Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Weight-Only Support To Whisper #794

Closed
wants to merge 0 commits into from

Conversation

Eddie-Wang1120
Copy link
Contributor

@Eddie-Wang1120 Eddie-Wang1120 commented Jan 2, 2024

support weight-only to whisper model

use default hf-internal-testing/librispeech_asr_dummy dataset

only use single build command:
python3 build.py --output_dir whisper_large_weight_only --use_gpt_attention_plugin --use_gemm_plugin --use_layernorm_plugin --use_bert_attention_plugin --use_weight_only

results:

\ float16 int8 weight-only
GPU memory usage 15964MiB 6420MiB
RTF 0.1132 0.0770
processing time 54.473s 37.037s
batch_size 4 4
num_beams 1 1
WER 2.48 2.48

obviously,the GPU memory usage reduce over 200%, the inference speed up about 150%, also mantain the accuracy.

looking for good news!
Eddie-Wang

@yuekaizhang
Copy link

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

@Eddie-Wang1120
Copy link
Contributor Author

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

Thanks a lot! I will keep on working on supporting int8 kv cache and smoothquant for whisper

@paulxin001
Copy link

hello.
I used your code to complete the Int8 quantization.
The memory footprint is reduced to 6GB, but the processing speed is twice as slow.
I use a tesla v100 32G graphics card. Do you know the possible cause? At present, I guess it may be caused by hardware.

@yuekaizhang
Copy link

yuekaizhang commented Jan 4, 2024

hello. I used your code to complete the Int8 quantization. The memory footprint is reduced to 6GB, but the processing speed is twice as slow. I use a tesla v100 32G graphics card. Do you know the possible cause? At present, I guess it may be caused by hardware.

Yeah, Ampere card e.g. A100, A10 should work with int8 very well. I am curious about the perf stats file under results_dir. Would you mind pasting fp16 v.s. int8 RTF and batch_size info here? The above results in the PR is from RTX4060Ti 16 GB. You may also try increase batch_size for int8 since it saves a lot VRAM.

@paulxin001
Copy link

I used the 1221-135766-0002.wav file in the example and the default 4 for batch_size.
Int8 takes 0.91s and float16 takes 0.48s. It feels weird.
It is now suspected that the V100's graphics card is a problem.

@yuekaizhang
Copy link

I used the 1221-135766-0002.wav file in the example and the default 4 for batch_size. Int8 takes 0.91s and float16 takes 0.48s. It feels weird. It is now suspected that the V100's graphics card is a problem.

Also, I suggest to do benchmark with a whole dataset e.g. https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy

@kristiankielhofner
Copy link

@Eddie-Wang1120 This is great work!

While we're looking at optimized Whisper performance are there any plans to support distil-whisper?

@Eddie-Wang1120
Copy link
Contributor Author

@Eddie-Wang1120 This is great work!

While we're looking at optimized Whisper performance are there any plans to support distil-whisper?

Thanks! Currently I'm working on the int8_kv_cache support for Whisper, and will think about other model's support after I finish Whisper.

@yuekaizhang
Copy link

@Eddie-Wang1120 Hi Eddie, many thanks. We would take this PR into our internal gitlab. We'll add your name into the co-author list and credit your work on the release notes for whisper int8 support.

Hi @kaiyux , would you mind adding @Eddie-Wang1120‘s name during next release note? I have imported and merged this PR into gitlab. Thanks.

@kaiyux
Copy link
Member

kaiyux commented Jan 10, 2024

@Eddie-Wang1120 Thanks very much for your great contribution. The changes will be included in the next main branch update to the GitHub, and we will credit you as co-author. Thanks!

@Eddie-Wang1120
Copy link
Contributor Author

@Eddie-Wang1120 Thanks very much for your great contribution. The changes will be included in the next main branch update to the GitHub, and we will credit you as co-author. Thanks!

Thanks a lot!

@robosina
Copy link

@Eddie-Wang1120, thank you for your collaboration. I just tested this quantization on the latest release of TRT, I mean c896530, but I'm not experiencing any improvement in terms of performance or memory usage reduction. In terms of inference speed, it seems to be three times slower. I'm building the model using the following command:

python3 build.py --output_dir whisper_large_woq8 \
                 --model_name large-v2 \
                 --use_gpt_attention_plugin \
                 --use_gemm_plugin \
                 --use_layernorm_plugin \
                 --use_bert_attention_plugin \
                 --max_batch_size $BATCH_SIZE \
                 --max_beam_width $BEAM_WIDTH \
                 --log_level verbose \
                 --use_weight_only \
                 --weight_only_precision int8
           

But I see the following results, which I think are not expected. I guess this might be related to the A10 GPU, and it might not perform well in INT8 bit mode, whereas it is faster in float16. On the other hand, I should see a significantly lower memory footprint, right?

large-v2:

RTF: 0.0341
total_duration: 481.035 seconds
(0.13 hours)
processing time: 16.386 seconds (0.00 hours)
batch size: 8
num_beams: 5

v2 int8 WOQ:

RTF: 0.0977
total_duration: 481.035 seconds
(0.13 hours)
processing time: 46.985 seconds (0.01 hours)
batch size: 8
num_beams: 5

@yuekaizhang
Copy link

85

Current weight only quant solution for whisper exists a large speed/throughput regression and we're investigating it now. However, you should see a significantly lower memory footprint with the current solution. What's your VRAM usage in above cases? @robosina

@robosina
Copy link

@yuekaizhang I see thanks for providing feedback. For the normal model with a batch size of 8 and a beam size of 5, the memory usage is approximately 19,862 MiB. For the WOQ8 model, it is around 19,044 MiB.

@yuekaizhang
Copy link

yuekaizhang commented Jan 24, 2024

@yuekaizhang I see thanks for providing feedback. For the normal model with a batch size of 8 and a beam size of 5, the memory usage is approximately 19,862 MiB. For the WOQ8 model, it is around 19,044 MiB.

Would you mind trying batch_size 4 and beam_size 1? It's weird since I got 16000Mb for fp16, about 7000Mb for Weight only int8 on A10 GPU. @robosina

@robosina
Copy link

robosina commented Jan 24, 2024

@yuekaizhang Yes, it's weird for me; In this config, the memory usage is 8,730 MiB for the normal model and 7,912 MiB for the WOQ8 model.

@yuekaizhang
Copy link

@yuekaizhang Yes, it's weird for me; In this config, the memory usage is 8,730 MiB for the normal model and 7,912 MiB for the WOQ8 model.

Yeah, with this config, the WOQ8 results are same. Your's fp16 memory usage is much lower than mine. I'm using large-v3 model, seems it's the only difference between us.

@robosina
Copy link

robosina commented Jan 24, 2024

@yuekaizhang I see Thanks, I will check this in more detail and get back to you. Thanks.

@Bhuvanesh09
Copy link
Contributor

85

Current weight only quant solution for whisper exists a large speed/throughput regression and we're investigating it now. However, you should see a significantly lower memory footprint with the current solution. What's your VRAM usage in above cases? @robosina

@yuekaizhang Any updates on the reason for the quantized models behaving slower than unquantized models?
By any chance, does the quantization take place only for the decoder of the model since it is the causal part? I feel that it could be the reason for the slow down since encoder would still operate at fp while decoder is int8.

@aramfaghfouri
Copy link

Hi @yuekaizhang,
Is there any update on this?
Thanks!

@yuekaizhang
Copy link

Hi @yuekaizhang, Is there any update on this? Thanks!

@aramfaghfouri @Bhuvanesh09 See #992 (comment) please. All issues are now fixed, and the relationship between memory usage and speed is similar to the conclusions in the link provided. Using int8 weight only will result in less memory usage and faster speed.

You can wait for our code update, or directly use the PR corresponding to the link above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants