Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: fp8 vs smoothquant int8 #2074

Open
enozhu opened this issue Aug 1, 2024 · 2 comments
Open

performance: fp8 vs smoothquant int8 #2074

enozhu opened this issue Aug 1, 2024 · 2 comments
Labels

Comments

@enozhu
Copy link

enozhu commented Aug 1, 2024

reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance

image
Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16)
GPT-J 1 1.40x 1.40x
GPT-J 8 1.44x 1.30x
LLaMA-v2-7B 1 1.51x 1.47x
LLaMA-v2-7B 8 1.40x 1.32x

my question is : why fp8 speedup is better than int8 smoothquant, fp8 and int8 tensor core TFLOPS is same on H100

@renjie0
Copy link

renjie0 commented Aug 8, 2024

int8 smoothquant has quant/dequant cost

Copy link

github-actions bot commented Sep 8, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

@github-actions github-actions bot added the stale label Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants