Add Weight-Only Support To Whisper #794

Eddie-Wang1120 · 2024-01-02T16:21:54Z

support weight-only to whisper model

use default hf-internal-testing/librispeech_asr_dummy dataset

only use single build command:
python3 build.py --output_dir whisper_large_weight_only --use_gpt_attention_plugin --use_gemm_plugin --use_layernorm_plugin --use_bert_attention_plugin --use_weight_only

results:

\	float16	int8 weight-only
GPU memory usage	15964MiB	6420MiB
RTF	0.1132	0.0770
processing time	54.473s	37.037s
batch_size	4	4
num_beams	1	1
WER	2.48	2.48

obviously，the GPU memory usage reduce over 200%, the inference speed up about 150%, also mantain the accuracy.

looking for good news!
Eddie-Wang

yuekaizhang · 2024-01-03T07:27:49Z

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

Eddie-Wang1120 · 2024-01-03T07:40:53Z

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

Thanks a lot! I will keep on working on supporting int8 kv cache and smoothquant for whisper

paulxin001 · 2024-01-04T05:41:10Z

hello.
I used your code to complete the Int8 quantization.
The memory footprint is reduced to 6GB, but the processing speed is twice as slow.
I use a tesla v100 32G graphics card. Do you know the possible cause? At present, I guess it may be caused by hardware.

yuekaizhang · 2024-01-04T05:48:18Z

hello. I used your code to complete the Int8 quantization. The memory footprint is reduced to 6GB, but the processing speed is twice as slow. I use a tesla v100 32G graphics card. Do you know the possible cause? At present, I guess it may be caused by hardware.

Yeah, Ampere card e.g. A100, A10 should work with int8 very well. I am curious about the perf stats file under results_dir. Would you mind pasting fp16 v.s. int8 RTF and batch_size info here? The above results in the PR is from RTX4060Ti 16 GB. You may also try increase batch_size for int8 since it saves a lot VRAM.

paulxin001 · 2024-01-04T06:01:13Z

I used the 1221-135766-0002.wav file in the example and the default 4 for batch_size.
Int8 takes 0.91s and float16 takes 0.48s. It feels weird.
It is now suspected that the V100's graphics card is a problem.

yuekaizhang · 2024-01-04T06:03:06Z

I used the 1221-135766-0002.wav file in the example and the default 4 for batch_size. Int8 takes 0.91s and float16 takes 0.48s. It feels weird. It is now suspected that the V100's graphics card is a problem.

Also, I suggest to do benchmark with a whole dataset e.g. https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy

kristiankielhofner · 2024-01-07T13:48:25Z

@Eddie-Wang1120 This is great work!

While we're looking at optimized Whisper performance are there any plans to support distil-whisper?

Eddie-Wang1120 · 2024-01-08T01:54:30Z

@Eddie-Wang1120 This is great work!

While we're looking at optimized Whisper performance are there any plans to support distil-whisper?

Thanks! Currently I'm working on the int8_kv_cache support for Whisper, and will think about other model's support after I finish Whisper.

yuekaizhang · 2024-01-10T02:00:40Z

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

Hi @kaiyux , would you mind adding @Eddie-Wang1120‘s name during next release note? I have imported and merged this PR into gitlab. Thanks.

kaiyux · 2024-01-10T02:13:11Z

@Eddie-Wang1120 Thanks very much for your great contribution. The changes will be included in the next main branch update to the GitHub, and we will credit you as co-author. Thanks!

Eddie-Wang1120 · 2024-01-10T03:17:07Z

@Eddie-Wang1120 Thanks very much for your great contribution. The changes will be included in the next main branch update to the GitHub, and we will credit you as co-author. Thanks!

Thanks a lot!

robosina · 2024-01-23T11:23:59Z

@Eddie-Wang1120, thank you for your collaboration. I just tested this quantization on the latest release of TRT, I mean c896530, but I'm not experiencing any improvement in terms of performance or memory usage reduction. In terms of inference speed, it seems to be three times slower. I'm building the model using the following command:

python3 build.py --output_dir whisper_large_woq8 \
                 --model_name large-v2 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --use_layernorm_plugin \
                 --use_bert_attention_plugin \
                 --max_batch_size $BATCH_SIZE \
                 --max_beam_width $BEAM_WIDTH \
                 --log_level verbose \
                 --use_weight_only \
                 --weight_only_precision int8

But I see the following results, which I think are not expected. I guess this might be related to the A10 GPU, and it might not perform well in INT8 bit mode, whereas it is faster in float16. On the other hand, I should see a significantly lower memory footprint, right?

large-v2:

RTF: 0.0341
total_duration: 481.035 seconds
(0.13 hours)
processing time: 16.386 seconds (0.00 hours)
batch size: 8
num_beams: 5

v2 int8 WOQ:

RTF: 0.0977
total_duration: 481.035 seconds
(0.13 hours)
processing time: 46.985 seconds (0.01 hours)
batch size: 8
num_beams: 5

yuekaizhang · 2024-01-23T11:36:41Z

85

Current weight only quant solution for whisper exists a large speed/throughput regression and we're investigating it now. However, you should see a significantly lower memory footprint with the current solution. What's your VRAM usage in above cases? @robosina

robosina · 2024-01-24T06:36:41Z

@yuekaizhang I see thanks for providing feedback. For the normal model with a batch size of 8 and a beam size of 5, the memory usage is approximately 19,862 MiB. For the WOQ8 model, it is around 19,044 MiB.

yuekaizhang · 2024-01-24T06:39:01Z

@yuekaizhang I see thanks for providing feedback. For the normal model with a batch size of 8 and a beam size of 5, the memory usage is approximately 19,862 MiB. For the WOQ8 model, it is around 19,044 MiB.

Would you mind trying batch_size 4 and beam_size 1? It's weird since I got 16000Mb for fp16, about 7000Mb for Weight only int8 on A10 GPU. @robosina

robosina · 2024-01-24T06:44:12Z

@yuekaizhang Yes, it's weird for me; In this config, the memory usage is 8,730 MiB for the normal model and 7,912 MiB for the WOQ8 model.

yuekaizhang · 2024-01-24T07:04:11Z

@yuekaizhang Yes, it's weird for me; In this config, the memory usage is 8,730 MiB for the normal model and 7,912 MiB for the WOQ8 model.

Yeah, with this config, the WOQ8 results are same. Your's fp16 memory usage is much lower than mine. I'm using large-v3 model, seems it's the only difference between us.

robosina · 2024-01-24T08:12:15Z

@yuekaizhang I see Thanks, I will check this in more detail and get back to you. Thanks.

Bhuvanesh09 · 2024-02-08T06:14:51Z

85

Current weight only quant solution for whisper exists a large speed/throughput regression and we're investigating it now. However, you should see a significantly lower memory footprint with the current solution. What's your VRAM usage in above cases? @robosina

@yuekaizhang Any updates on the reason for the quantized models behaving slower than unquantized models?
By any chance, does the quantization take place only for the decoder of the model since it is the causal part? I feel that it could be the reason for the slow down since encoder would still operate at fp while decoder is int8.

aramfaghfouri · 2024-02-14T06:20:34Z

Hi @yuekaizhang,
Is there any update on this?
Thanks!

yuekaizhang · 2024-02-18T05:59:20Z

Hi @yuekaizhang, Is there any update on this? Thanks!

@aramfaghfouri @Bhuvanesh09 See #992 (comment) please. All issues are now fixed, and the relationship between memory usage and speed is similar to the conclusions in the link provided. Using int8 weight only will result in less memory usage and faster speed.

You can wait for our code update, or directly use the PR corresponding to the link above.

kristiankielhofner mentioned this pull request Jan 9, 2024

Why does the whisper model need 17GB of video memory? #805

Open

kaiyux mentioned this pull request Jan 16, 2024

Update TensorRT-LLM #891

Merged

Eddie-Wang1120 closed this Jan 17, 2024

Eddie-Wang1120 force-pushed the main branch from ca6b703 to c896530 Compare January 17, 2024 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Weight-Only Support To Whisper #794

Add Weight-Only Support To Whisper #794

Eddie-Wang1120 commented Jan 2, 2024 •

edited

Loading

yuekaizhang commented Jan 3, 2024

Eddie-Wang1120 commented Jan 3, 2024

paulxin001 commented Jan 4, 2024

yuekaizhang commented Jan 4, 2024 •

edited

Loading

paulxin001 commented Jan 4, 2024

yuekaizhang commented Jan 4, 2024

kristiankielhofner commented Jan 7, 2024

Eddie-Wang1120 commented Jan 8, 2024

yuekaizhang commented Jan 10, 2024

kaiyux commented Jan 10, 2024

Eddie-Wang1120 commented Jan 10, 2024

robosina commented Jan 23, 2024

yuekaizhang commented Jan 23, 2024

robosina commented Jan 24, 2024

yuekaizhang commented Jan 24, 2024 •

edited

Loading

robosina commented Jan 24, 2024 •

edited

Loading

yuekaizhang commented Jan 24, 2024

robosina commented Jan 24, 2024 •

edited

Loading

Bhuvanesh09 commented Feb 8, 2024

aramfaghfouri commented Feb 14, 2024

yuekaizhang commented Feb 18, 2024

Add Weight-Only Support To Whisper #794

Add Weight-Only Support To Whisper #794

Conversation

Eddie-Wang1120 commented Jan 2, 2024 • edited Loading

support weight-only to whisper model

results:

yuekaizhang commented Jan 3, 2024

Eddie-Wang1120 commented Jan 3, 2024

paulxin001 commented Jan 4, 2024

yuekaizhang commented Jan 4, 2024 • edited Loading

paulxin001 commented Jan 4, 2024

yuekaizhang commented Jan 4, 2024

kristiankielhofner commented Jan 7, 2024

Eddie-Wang1120 commented Jan 8, 2024

yuekaizhang commented Jan 10, 2024

kaiyux commented Jan 10, 2024

Eddie-Wang1120 commented Jan 10, 2024

robosina commented Jan 23, 2024

large-v2:

v2 int8 WOQ:

yuekaizhang commented Jan 23, 2024

robosina commented Jan 24, 2024

yuekaizhang commented Jan 24, 2024 • edited Loading

robosina commented Jan 24, 2024 • edited Loading

yuekaizhang commented Jan 24, 2024

robosina commented Jan 24, 2024 • edited Loading

Bhuvanesh09 commented Feb 8, 2024

aramfaghfouri commented Feb 14, 2024

yuekaizhang commented Feb 18, 2024

Eddie-Wang1120 commented Jan 2, 2024 •

edited

Loading

yuekaizhang commented Jan 4, 2024 •

edited

Loading

yuekaizhang commented Jan 24, 2024 •

edited

Loading

robosina commented Jan 24, 2024 •

edited

Loading

robosina commented Jan 24, 2024 •

edited

Loading