You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running llama3 model on an rtx4090 with fp8 quantization. In the result, outputTokenIds seems to be correct but the generationLogits are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.
How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the generationLogits as well as the tokens generated. I made a call using client.txt and got the result in log.txt.
Command to run the client: python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits.
Please have a look. Thanks in advance!
The text was updated successfully, but these errors were encountered:
I am running llama3 model on an rtx4090 with fp8 quantization. In the result,
outputTokenIds
seems to be correct but thegenerationLogits
are all wrong. I also tested the same model without quantization and the returned logits are all correct, so I guess there is something wrong when returning the logits with fp8 enable.How I tested: I deployed the model using tritonserver with tensorrtllm_backend. I changed the bls backend a bit to get the softmax of the
generationLogits
as well as the tokens generated. I made a call using client.txt and got the result in log.txt.Command to run the client:
python3 client.py -p "hello how are you" --model-name tensorrt_llm_bls --request-id testid --verbose -o 10 --return-generation-logits
.Please have a look. Thanks in advance!
The text was updated successfully, but these errors were encountered: