Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference input randomSeeds by idx rather than batchSlot #1742

Closed
wants to merge 1 commit into from

Conversation

pathorn
Copy link
Contributor

@pathorn pathorn commented Jun 5, 2024

Steps to reproduce:

  1. Compile a model with max_batch_size 8 or higher (this will make the steps more easily reproducible). I used 128
  2. Perform an extremely high temperature request (will be random output)
curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":30,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream?[1-4]' | grep text_output
  1. Perform the same request with different randomSeed:
curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":31,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream?[1-4]' | grep text_output
  1. Finally, perform a single request with randomSeed 0:
curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":0,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream' | grep text_output

The output of the steps I get in my build of phi-4-mini-4k is

2.
"output_log_probs":-10.639476776123047,"text_output":"ram"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
3.
"output_log_probs":-10.946898460388184,"text_output":"implementation"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
4.
"output_log_probs":-10.670726776123047,"text_output":"jpeg"

The same bug occurs in all models. you will get different tokens output, but the point is that the incorrect responses in 2. and 3. will match the randomSeed 0 from step 4.

A single request goes through curandInitialize which is correct, rather than curandBatchInitialize, so we can trust the output of performing a single request at once.

Note also that when performing a batch, both curl and trtllm are greedy and start processing it not as a batch, before processing the other 3. This is why you see one result in the batch with the correct answer before the batch produces the incorrect answer.

You will see that in steps 2 and 3, you see one different output which respects the randomSeed, followed by 3 outputs which matches the randomSeed 0 one.

Debugging steps:

Thread 447 "tritonserver" hit Breakpoint 15, tensorrt_llm::kernels::invokeCurandBatchInitialize (states=0x3414e6600, batchSlots=0x7fffc5c00200, batchSize=3, randomSeeds=0x3414e7e00, stream=0x7ff8d0dcb380) at /app/tensorrt_llm/cpp/tensorrt_llm/kernels/decodingCommon.cu:63
63          curandBatchInitialize<<<grid, block, 0, stream>>>(states, batchSlots, batchSize, randomSeeds);
(gdb) print *(int(*)[128])batchSlots
$1 = {35, 36, 37, 0, 0, 0, 0, ... (all 0)}
(gdb) print batchSize
$2 = 3
(gdb) init-if-undefined $tensor_scratch_buf = malloc(1024*1024*256)
(gdb) print (int)cudaMemcpy($tensor_scratch_buf, randomSeeds, 128 * 8, 4)
$3 = 0
(gdb) x/128gd $tensor_scratch_buf
0x7ffbfbe00010: 123     123
0x7ffbfbe00020: 123     0
0x7ffbfbe00030: 0       0
0x7ffbfbe00040: 0       0
... (all 0) ...

Thus it appears to be incorrect to read from randomSeeds[batchSlots[0]] and correct to read from randomSeeds[0] instead.

@juney-nvidia
Copy link
Collaborator

@pathorn thanks for submitting the fix, our engineer is working to cherry-pick this fix into our internal repo and after the cherry-pick gets done it will be pushed to the github main branch in the subsequent Tuesday. It is possible that this MR will get landed into the TensorRT-LLM main branch in the next week or the week after it.

Thanks
June

@pathorn
Copy link
Contributor Author

pathorn commented Jun 12, 2024

This fix was merged in. Thanks!

@pathorn pathorn closed this Jun 12, 2024
@byshiue byshiue added the Merged label Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants