Reference input randomSeeds by idx rather than batchSlot #1742

pathorn · 2024-06-05T16:44:10Z

Steps to reproduce:

Compile a model with max_batch_size 8 or higher (this will make the steps more easily reproducible). I used 128
Perform an extremely high temperature request (will be random output)

curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":30,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream?[1-4]' | grep text_output

Perform the same request with different randomSeed:

curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":31,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream?[1-4]' | grep text_output

Finally, perform a single request with randomSeed 0:

curl -Z -s --parallel-max 10 -d  '{"text_input": "Why did the chicken cross the", "max_tokens": 1, "bad_words": [], "stop_words":[],"stream":true, "temperature":100.0,"random_seed":0,"top_k":0,"top_p":1.0,"return_log_probs":true}' 'http://localhost:8000/v2/models/tensorrt_llm_bls/generate_stream' | grep text_output

The output of the steps I get in my build of phi-4-mini-4k is

2.
"output_log_probs":-10.639476776123047,"text_output":"ram"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
3.
"output_log_probs":-10.946898460388184,"text_output":"implementation"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
"output_log_probs":-10.670726776123047,"text_output":"jpeg"
4.
"output_log_probs":-10.670726776123047,"text_output":"jpeg"

The same bug occurs in all models. you will get different tokens output, but the point is that the incorrect responses in 2. and 3. will match the randomSeed 0 from step 4.

A single request goes through curandInitialize which is correct, rather than curandBatchInitialize, so we can trust the output of performing a single request at once.

Note also that when performing a batch, both curl and trtllm are greedy and start processing it not as a batch, before processing the other 3. This is why you see one result in the batch with the correct answer before the batch produces the incorrect answer.

You will see that in steps 2 and 3, you see one different output which respects the randomSeed, followed by 3 outputs which matches the randomSeed 0 one.

Debugging steps:

Thread 447 "tritonserver" hit Breakpoint 15, tensorrt_llm::kernels::invokeCurandBatchInitialize (states=0x3414e6600, batchSlots=0x7fffc5c00200, batchSize=3, randomSeeds=0x3414e7e00, stream=0x7ff8d0dcb380) at /app/tensorrt_llm/cpp/tensorrt_llm/kernels/decodingCommon.cu:63
63          curandBatchInitialize<<<grid, block, 0, stream>>>(states, batchSlots, batchSize, randomSeeds);
(gdb) print *(int(*)[128])batchSlots
$1 = {35, 36, 37, 0, 0, 0, 0, ... (all 0)}
(gdb) print batchSize
$2 = 3
(gdb) init-if-undefined $tensor_scratch_buf = malloc(1024*1024*256)
(gdb) print (int)cudaMemcpy($tensor_scratch_buf, randomSeeds, 128 * 8, 4)
$3 = 0
(gdb) x/128gd $tensor_scratch_buf
0x7ffbfbe00010: 123     123
0x7ffbfbe00020: 123     0
0x7ffbfbe00030: 0       0
0x7ffbfbe00040: 0       0
... (all 0) ...

Thus it appears to be incorrect to read from randomSeeds[batchSlots[0]] and correct to read from randomSeeds[0] instead.

juney-nvidia · 2024-06-06T22:23:04Z

@pathorn thanks for submitting the fix, our engineer is working to cherry-pick this fix into our internal repo and after the cherry-pick gets done it will be pushed to the github main branch in the subsequent Tuesday. It is possible that this MR will get landed into the TensorRT-LLM main branch in the next week or the week after it.

Thanks
June

pathorn · 2024-06-12T21:17:32Z

This fix was merged in. Thanks!

Reference input randomSeeds by idx rather than batchSlot

550945e

byshiue assigned nekorobov and byshiue Jun 6, 2024

juney-nvidia requested a review from nekorobov June 6, 2024 22:23

chiendb97 mentioned this pull request Jun 7, 2024

random_seed seems to be ignored (or at least inconsistent) for inflight_batcher_llm triton-inference-server/tensorrtllm_backend#468

Open

4 tasks

kaiyux mentioned this pull request Jun 11, 2024

Update TensorRT-LLM #1763

Merged

pathorn closed this Jun 12, 2024

byshiue added the Merged label Jun 13, 2024

kaiyux mentioned this pull request Jul 17, 2024

TensorRT-LLM v0.11 Update #1969

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference input randomSeeds by idx rather than batchSlot #1742

Reference input randomSeeds by idx rather than batchSlot #1742

pathorn commented Jun 5, 2024 •

edited

Loading

juney-nvidia commented Jun 6, 2024

pathorn commented Jun 12, 2024

Reference input randomSeeds by idx rather than batchSlot #1742

Reference input randomSeeds by idx rather than batchSlot #1742

Conversation

pathorn commented Jun 5, 2024 • edited Loading

juney-nvidia commented Jun 6, 2024

pathorn commented Jun 12, 2024

pathorn commented Jun 5, 2024 •

edited

Loading