Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix llama conversion with smooth quant #1650

Closed
wants to merge 3 commits into from

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented May 22, 2024

This PR fixes a few errors which appear when following the README https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#smoothquant on current latest commit.

Note: the first commit looks quite obvious (although I'm not sure how this could have worked before), while I'm less sure about the second, I just just going by error messages during engine conversion, there might be a better place for the fix. So feel free to treat this as a bug report instead. I verified that engine built in this way has reasonable outputs and expected performance. The model I was testing this on is Mistral 7B (mistral-7b-v0.1-instruct) but I assume other llama 2 and 3 should also work (didn't get to llama 3 yet).

without this it errors out with:

Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 456, in <module>
    main()
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 448, in main
    convert_and_save_hf(args)
  File "/app/tensorrt_llm/examples/llama/convert_checkpoint.py", line 353, in convert_and_save_hf
    LLaMAForCausalLM.quantize(args.model_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 405, in quantize
    convert.quantize(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1395, in quantize
    weights = load_weights_from_hf(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1437, in load_weights_from_hf
    weights = convert_hf_llama(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1089, in convert_hf_llama
    convert_layer(l)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 725, in convert_layer
    get_tllm_linear_sq_weight(int8_weights,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 610, in get_tllm_linear_sq_weight
    results[prefix + 'per_channel_scale'] = torch.Tensor([
ValueError: only one element tensors can be converted to Python scalars

we can also check the shapes:
cur_per_channel_value.shape -> torch.Size([6144])
col_shape -> [1, 6144]

so it's clear that we meant to convert the tensor without []
with these the model works and provides sensible output
@kaiyux kaiyux mentioned this pull request May 28, 2024
@kaiyux
Copy link
Member

kaiyux commented May 28, 2024

Hi @lopuhin , the changes are integrated in #1688 and we've credited you as co-author, hence I'm closing this PR now, thanks a lot.

@kaiyux kaiyux closed this May 28, 2024
@lopuhin
Copy link
Contributor Author

lopuhin commented Jun 4, 2024

hi @kaiyux great, thank you! I think only the first commit was integrated, but the two others were not, but they are also required -- although they fix the error which would happen when running the engine. I'm experimenting with smooth quant llama 3 right now and need all commits to get it working. Do you mind having another look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants