Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Graph support #914

Closed
zhyncs opened this issue Aug 31, 2023 · 7 comments
Closed

CUDA Graph support #914

zhyncs opened this issue Aug 31, 2023 · 7 comments
Labels
enhancement New feature or request performance Performance-related issues

Comments

@zhyncs
Copy link
Contributor

zhyncs commented Aug 31, 2023

Hi vLLM genius @WoosukKwon @zhuohan123

After reading the Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning and the llama-cuda-graph-example by Fireworks.ai's @jamesr66a

CUDA graphs address all sources of CPU overhead highlighted above: user-written logic, PyTorch dispatcher logic, memory allocation overhead, and GPU driver/kernel overhead.

Thus, incremental generation can be limited by the CPU speed and thus is a good candidate for CUDA graphs.

While both the regular attention mechanism and the PagedAttention scheme undergo shape changes over iterations, the latter provides a unique advantage when integrating with CUDA graphs.

And with this benchmark

We find that without CUDA graphs, LLaMA-7B inference executes at 30 tokens/sec, but with CUDA graphs enabled it executes at 69 tokens/sec for a 2.3x speedup.

We may refer to and port similar optimizations to vLLM. Cheers.

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 31, 2023

@WoosukKwon
Copy link
Collaborator

Hi @zhyncs thanks for bringing it up. I believe their arguments make sense. Currently, we don't use CUDA graphs because of the difficulties in dynamic shape support, but things will get easier if we apply CUDA graphs only on the generation phase.

@WoosukKwon WoosukKwon added enhancement New feature or request performance Performance-related issues labels Aug 31, 2023
@WoosukKwon
Copy link
Collaborator

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 31, 2023

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

Make sense.

With CUDA graphs being a method that combines significant performance improvement with code flexibility and usability.

But for now, these optimizations require specific expertise that is not yet sufficiently automated.

We will also do some investigations and hope to participate in construction for it.

@yunfeng-scale
Copy link
Contributor

@WoosukKwon would you mind clarifying how a C++ backend would help alleviate CPU overhead? Do you mean re-implementing the models like FastTransformer in C++ while using paged attention kernels?

@yunfeng-scale
Copy link
Contributor

@WoosukKwon @zhuohan123 a CUDA graph POC: yunfeng-scale#1. I realized this is less useful for us since this does not improve throughput, but it might be helpful for the framework overall. let me know if I should move forward with the PR and merge upstream.

@yiakwy-xpu-ml-framework-team
Copy link

yiakwy-xpu-ml-framework-team commented Aug 15, 2024

However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option.

Currently we use an independent memory pool (it is definitely needed, but it could also affect performance) to create cuda graph and capture llama2 gpt model in decoding stage.

This will ease the overhead between kernel launching in host cpu side (even with CPP). In llama.cpp, 15% is estimated by @agray3 in this issue. (yes definitely we should do this in ROCM HIP)

Do we have new benchmark ? @WoosukKwon @yunfeng-scale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

4 participants