performance: fp8 vs smoothquant int8 #2074

enozhu · 2024-08-01T04:12:31Z

reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance

Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16)
GPT-J 1 1.40x 1.40x
GPT-J 8 1.44x 1.30x
LLaMA-v2-7B 1 1.51x 1.47x
LLaMA-v2-7B 8 1.40x 1.32x

my question is : why fp8 speedup is better than int8 smoothquant, fp8 and int8 tensor core TFLOPS is same on H100

renjie0 · 2024-08-08T05:25:08Z

int8 smoothquant has quant/dequant cost

github-actions · 2024-09-08T02:02:56Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions bot added the stale label Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: fp8 vs smoothquant int8 #2074

performance: fp8 vs smoothquant int8 #2074

enozhu commented Aug 1, 2024

renjie0 commented Aug 8, 2024

github-actions bot commented Sep 8, 2024

performance: fp8 vs smoothquant int8 #2074

performance: fp8 vs smoothquant int8 #2074

Comments

enozhu commented Aug 1, 2024

renjie0 commented Aug 8, 2024

github-actions bot commented Sep 8, 2024