questions on return-generation-logits when streaming #2090

binhtranmcs · 2024-08-06T04:52:38Z

I see that when run streaming inference, the result contains generationLogits for the full sequence. Which means it will contain a tensor of shape batch_size*beam_size*max_output_length*vocab_size every time the executor returns a token, which is very inefficient. Is this expected behavior? I believe returning only the logits for the current token would be more optimal.

Please have a look. Thanks in advance!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-06T01:58:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

AdamzNV · 2024-09-09T00:19:00Z

It's under our consideration now.

github-actions bot added the stale label Sep 6, 2024

lfr-0531 added the feature request New feature or request label Sep 7, 2024

lfr-0531 assigned ncomly-nvidia and AdamzNV Sep 7, 2024

github-actions bot removed the stale label Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions on return-generation-logits when streaming #2090

questions on return-generation-logits when streaming #2090

binhtranmcs commented Aug 6, 2024

github-actions bot commented Sep 6, 2024

AdamzNV commented Sep 9, 2024

questions on return-generation-logits when streaming #2090

questions on return-generation-logits when streaming #2090

Comments

binhtranmcs commented Aug 6, 2024

github-actions bot commented Sep 6, 2024

AdamzNV commented Sep 9, 2024