CUDA Graph support #914

zhyncs · 2023-08-31T02:16:16Z

After reading the Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning and the llama-cuda-graph-example by Fireworks.ai's @jamesr66a

CUDA graphs address all sources of CPU overhead highlighted above: user-written logic, PyTorch dispatcher logic, memory allocation overhead, and GPU driver/kernel overhead.

Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for CUDA graphs.

While both the regular attention mechanism and the PagedAttention scheme undergo shape changes over iterations, the latter provides a unique advantage when integrating with CUDA graphs.

And with this benchmark

We find that without CUDA graphs, LLaMA-7B inference executes at 30 tokens/sec, but with CUDA graphs enabled it executes at 69 tokens/sec for a 2.3x speedup.

We may refer to and port similar optimizations to vLLM. Cheers.

zhyncs · 2023-08-31T02:36:37Z

https://github.com/fw-ai/llama-cuda-graph-example/tree/masked_attn

WoosukKwon · 2023-08-31T02:56:25Z

Hi @zhyncs thanks for bringing it up. I believe their arguments make sense. Currently, we don't use CUDA graphs because of the difficulties in dynamic shape support, but things will get easier if we apply CUDA graphs only on the generation phase.

WoosukKwon · 2023-08-31T08:55:25Z

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

zhyncs · 2023-08-31T12:37:07Z

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

Make sense.

With CUDA graphs being a method that combines significant performance improvement with code flexibility and usability.

But for now, these optimizations require specific expertise that is not yet sufficiently automated.

We will also do some investigations and hope to participate in construction for it.

yunfeng-scale · 2023-09-08T18:22:05Z

@WoosukKwon would you mind clarifying how a C++ backend would help alleviate CPU overhead? Do you mean re-implementing the models like FastTransformer in C++ while using paged attention kernels?

yunfeng-scale · 2023-10-09T21:54:45Z

@WoosukKwon @zhuohan123 a CUDA graph POC: yunfeng-scale#1. I realized this is less useful for us since this does not improve throughput, but it might be helpful for the framework overall. let me know if I should move forward with the PR and merge upstream.

yiakwy-xpu-ml-framework-team · 2024-08-15T09:58:39Z

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

Currently we use an independent memory pool (it is definitely needed, but it could also affect performance) to create cuda graph and capture llama2 gpt model in decoding stage.

This will ease the overhead between kernel launching in host cpu side (even with CPP). In llama.cpp, 15% is estimated by @agray3 in this issue. (yes definitely we should do this in ROCM HIP)

Do we have new benchmark ? @WoosukKwon @yunfeng-scale

WoosukKwon added enhancement New feature or request performance Performance-related issues labels Aug 31, 2023

yunfeng-scale mentioned this issue Sep 8, 2023

Investigate CUDA graphs scaleapi/llm-engine#265

Open

yunfeng-scale mentioned this issue Oct 21, 2023

CUDA graph for Llama #1440

Closed

zhyncs closed this as completed Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graph support #914

CUDA Graph support #914

zhyncs commented Aug 31, 2023

zhyncs commented Aug 31, 2023

WoosukKwon commented Aug 31, 2023

WoosukKwon commented Aug 31, 2023

zhyncs commented Aug 31, 2023

yunfeng-scale commented Sep 8, 2023

yunfeng-scale commented Oct 9, 2023

yiakwy-xpu-ml-framework-team commented Aug 15, 2024 •

edited

Loading

CUDA Graph support #914

CUDA Graph support #914

Comments

zhyncs commented Aug 31, 2023

zhyncs commented Aug 31, 2023

WoosukKwon commented Aug 31, 2023

WoosukKwon commented Aug 31, 2023

zhyncs commented Aug 31, 2023

yunfeng-scale commented Sep 8, 2023

yunfeng-scale commented Oct 9, 2023

yiakwy-xpu-ml-framework-team commented Aug 15, 2024 • edited Loading

yiakwy-xpu-ml-framework-team commented Aug 15, 2024 •

edited

Loading