return-generation-logits bug when fp8 enabled #2088

binhtranmcs · 2024-08-06T03:18:22Z

I am running llama3 model on an rtx4090 with fp8 quantization. In the result, outputTokenIds seems to be correct but the generationLogits are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.

How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the generationLogits as well as the tokens generated. I made a call using client.txt and got the result in log.txt.

Command to run the client: python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits.

Please have a look. Thanks in advance!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-06T01:59:01Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions bot added the stale label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return-generation-logits bug when fp8 enabled #2088

return-generation-logits bug when fp8 enabled #2088

binhtranmcs commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Sep 6, 2024

return-generation-logits bug when fp8 enabled #2088

return-generation-logits bug when fp8 enabled #2088

Comments

binhtranmcs commented Aug 6, 2024 • edited Loading

github-actions bot commented Sep 6, 2024

binhtranmcs commented Aug 6, 2024 •

edited

Loading