You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see that when run streaming inference, the result contains generationLogits for the full sequence. Which means it will contain a tensor of shape batch_size*beam_size*max_output_length*vocab_size every time the executor returns a token, which is very inefficient. Is this expected behavior? I believe returning only the logits for the current token would be more optimal.
Please have a look. Thanks in advance!
The text was updated successfully, but these errors were encountered:
I see that when run streaming inference, the result contains
generationLogits
for the full sequence. Which means it will contain a tensor of shapebatch_size*beam_size*max_output_length*vocab_size
every time theexecutor
returns a token, which is very inefficient. Is this expected behavior? I believe returning only the logits for the current token would be more optimal.Please have a look. Thanks in advance!
The text was updated successfully, but these errors were encountered: