Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enc-dec triton backend support #800

Closed
shannonphu opened this issue Jan 3, 2024 · 20 comments
Closed

enc-dec triton backend support #800

shannonphu opened this issue Jan 3, 2024 · 20 comments
Assignees

Comments

@shannonphu
Copy link

Hi is there any update on when enc-dec models like T5 will get the TRT-LLM Triton backend support? Posting an issue for awareness and just wanted to know if its still being planned. Thanks in advance!

#424 (reply in thread)

@symphonylyh
Copy link
Collaborator

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

@sihanwang41
Copy link

Hi @shannonphu , yes we're working on it. Right now it's at the stage of adding the C++ runtime. Tentative date for Triton enc-dec support is around mid to late January. Thanks for your patience

is it also included continuous batching?

@symphonylyh
Copy link
Collaborator

is it also included continuous batching?
Our current plan is to reach there by steps: (1) C++ runtime (2) regular Triton support (3) continous batching. Eventually we want to enable continus batching, but for the mid to late January release it's more likely to only have (1) and (2), with (3) coming right after it

@mlmonk
Copy link
Contributor

mlmonk commented Feb 1, 2024

@symphonylyh Could share if theres an update on this?

@shixianc
Copy link

shixianc commented Feb 9, 2024

Hi is there an update for this?

@symphonylyh
Copy link
Collaborator

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc ,
We have been actively working on this support but finding the amount of work is more than expected since we want to have a good implementation to support enc-dec and in general such 2-stage pipeline.

May I use this thread to collect your feedback so we can understand your need and prioritize better. I know @sihanwang41 specifically asked about continuous batching, i.e., inflight batching, but others didn't share the request info. Can you reply by describing if any of (1), (2), (3) would be helpful and can unblock you first:
(1) a Triton Python backend support to run enc-dec model
(2) a C++ runtime (no Triton) to run enc-dec model, without inflight batching
(3) a Triton C++ backend to run enc-dec model, without inflight batching
(4) a Triton C++ backend, with paged kv cache and inflight batching for enc-dec <-- final goal

Thanks

@shixianc
Copy link

shixianc commented Feb 22, 2024

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

@symphonylyh
Copy link
Collaborator

@symphonylyh Thanks for the update! Starting with (3) would unblock our team.

May I assume this would also have the classic dynamic batching supported?

Got it, thanks for the input.
By dynamic batching, do you mean the Triton's dynamic batching that has nothing to do with the inflight/continuous batching concept. If so, yes.

@shannonphu
Copy link
Author

@symphonylyh (1) and/or (3). I am not super clear on the difference between the Python vs C++ backend. I was using this to build the engine https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/README.md

@mlmonk
Copy link
Contributor

mlmonk commented Feb 24, 2024 via email

@shannonphu
Copy link
Author

@mlmonk Oh interesting, I was under the impression that we just couldn't serve T5 models on Triton yet because the TRT-LLM backend wasn't ready for it yet.

@mlmonk
Copy link
Contributor

mlmonk commented Mar 7, 2024

@symphonylyh @shannonphu We have been able to use the Flan-T5 with Triton. I believe this is (1). You can reproduce it here. Note that this is much older version of both libraries when Flan-T5 was not officially supported.

Like @shixianc mentioned, (3) would unblock us and (4) would the ideal state. It would be great if you could share how far along you are with the (3) release.

@LuckyL00ser
Copy link

hey @symphonylyh , do you have any updates on the progress?

@XiaobingSuper
Copy link

@symphonylyh, any progress?

@TeamSeshDeadBoy
Copy link

Hello, @symphonylyh . Is there any progress on any of (1-4) ?

@mrmuke
Copy link

mrmuke commented May 13, 2024

We would love (1)

@symphonylyh
Copy link
Collaborator

Hi @shannonphu , @sihanwang41 , @mlmonk , @shixianc, @LuckyL00ser , @XiaobingSuper @TeamSeshDeadBoy @mrmuke

As part of today's release #1725 , enc-dec C++ runtime has been successfully implemented with inflight batching and paged kv cache. Please have a try following the README C++ runtime section
. This directly corresponds to (4) above, with Triton backend being added next.

Our roadmap next pretty soon:

  1. Triton C++ backend is almost ready and to be released soon
  2. Multi-GPU support

@mlmonk
Copy link
Contributor

mlmonk commented Jun 5, 2024

Thanks for the update! This is excellent news, I'm sure it was a lot of effort to make it happen.

@HamzaG737
Copy link

Hello @symphonylyh,
Is there any progress on adding (1) ?

@symphonylyh
Copy link
Collaborator

@HamzaG737 it's full-fledged now. For (1) Triton backend, you can follow the guide here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md.

Also, closing this issue as support has been added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants