-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Graph support #914
Comments
Hi @zhyncs thanks for bringing it up. I believe their arguments make sense. Currently, we don't use CUDA graphs because of the difficulties in dynamic shape support, but things will get easier if we apply CUDA graphs only on the generation phase. |
However, we'd need more investigation and discussion on this, since using CUDA graphs certainly adds complexity and imposes some restrictions to our python code. Considering the ease of future development, having a lightweight C++ backend just like FasterTransformer might be a better option. |
Make sense.
We will also do some investigations and hope to participate in construction for it. |
@WoosukKwon would you mind clarifying how a C++ backend would help alleviate CPU overhead? Do you mean re-implementing the models like FastTransformer in C++ while using paged attention kernels? |
@WoosukKwon @zhuohan123 a CUDA graph POC: yunfeng-scale#1. I realized this is less useful for us since this does not improve throughput, but it might be helpful for the framework overall. let me know if I should move forward with the PR and merge upstream. |
Currently we use an independent memory pool (it is definitely needed, but it could also affect performance) to create cuda graph and capture llama2 gpt model in decoding stage. This will ease the overhead between kernel launching in host cpu side (even with CPP). In llama.cpp, 15% is estimated by @agray3 in this issue. (yes definitely we should do this in ROCM HIP) Do we have new benchmark ? @WoosukKwon @yunfeng-scale |
Hi vLLM genius @WoosukKwon @zhuohan123
After reading the Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning and the llama-cuda-graph-example by Fireworks.ai's @jamesr66a
And with this benchmark
We may refer to and port similar optimizations to vLLM. Cheers.
The text was updated successfully, but these errors were encountered: