Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to run Medusa IFB with triton inference server #1449

Closed
1 of 4 tasks
chiendb97 opened this issue Apr 15, 2024 · 8 comments
Closed
1 of 4 tasks

Fail to run Medusa IFB with triton inference server #1449

chiendb97 opened this issue Apr 15, 2024 · 8 comments
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers

Comments

@chiendb97
Copy link

System Info

GPU: A30
GPU memory: 24G
TensorRT-LLM: 0.9.0.dev2024040900
CUDA: 12.3
OS: unbuntu 20.04

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I use the medusa model with attn_bias=true and I modified the code examples/medusa/convert_checkpoint.py to fix the model
1. Convert model
python3 examples/medusa/convert_checkpoint.py --workers 16 --model_dir /models/production/chat_legal_llama --output_dir /models/medusa/chat_legal_medusa/tensorrt_llm/c-model --dtype float16 --tp_size 2 --pp_size 1 --medusa_model_dir /models/medusa/chat_legal_medusa --fixed_num_medusa_heads 5 --max_medusa_token_len 63
2. Build engine
trtllm-build --workers 16 --tp_size 2 --pp_size 1 --checkpoint_dir=/models/medusa/chat_legal_medusa/tensorrt_llm/c-model --output_dir=/models/medusa/chat_legal_medusa/tensorrt_llm/engine --use_custom_all_reduce disable --gemm_plugin float16 --gpt_attention_plugin float16 --use_paged_context_fmha enable --paged_kv_cache enable --remove_input_padding enable --context_fmha enable --multi_block_mode enable --max_batch_size 2 --max_beam_width 1 --max_input_len 4096 --max_output_len 1024
3. Deploy model with triton inference server

Expected behavior

model returns correct results

actual behavior

Server crashes when using streaming or stopping early with end id or using decoding mode with top_k or top_p

1. Stopping early with end id
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1 0x7f40905b2a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2 0x7f4034912362 /data01/kilm/users/chiendb/projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f4034912362]
3 0x7f4034abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less, std::allocator >&) + 36
4 0x7f4034ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5 0x7f40ab1f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f40ab1f1253]
6 0x7f40aaf80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f40aaf80ac3]
7 0x7f40ab011a04 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 63083907: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1 0x7f40905b2a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2 0x7f4034912362 /data01/kilm/users/chiendb/projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f4034912362]
3 0x7f4034abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less, std::allocator >&) + 36
4 0x7f4034ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5 0x7f40ab1f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f40ab1f1253]
6 0x7f40aaf80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f40aaf80ac3]
7 0x7f40ab011a04 clone + 68
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: 0 <= acceptedTokensLen && acceptedTokensLen <= nextDraftTokensLen (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:1424)
1 0x7f8bfc4e3a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2 0x7f8ba0912362 /projects/TensorRT-LLM/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xd9362) [0x7f8ba0912362]
3 0x7f8ba0abcdb4 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> > >&, std::set<unsigned long, std::less, std::allocator >&) + 36
4 0x7f8ba0ac4ee4 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404
5 0x7f8c0fff1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f8c0fff1253]
6 0x7f8c0fd80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8c0fd80ac3]
7 0x7f8c0fe11a04 clone + 68

2. Using streaming
I0415 02:59:52.989804 11244 stream_infer_handler.cc:155] Process for ModelStreamInferHandler, rpc_ok=1, context 0, 0 step WRITTEN
I0415 02:59:52.989807 11244 infer_handler.h:1305] Returning from ModelStreamInferHandler, 0, ISSUED
terminate called after throwing an instance of 'tensorrt_llm::common::TllmExceptionterminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
'
what(): [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/projects/TensorRT-LLM/cpp/tensorrt_llm/runtime/bufferView.h:83)
1 0x7f083839ba60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2 0x7f07dca3730e virtual thunk to tensorrt_llm::runtime::TensorView::reshape(nvinfer1::Dims32 const&) + 366
3 0x7f07dca382a3 virtual thunk to tensorrt_llm::runtime::TensorView::resize(unsigned long) + 147
4 0x7f07dcabd2e1 tensorrt_llm::batch_manager::GptManager::returnCompletedRequests() + 1297
5 0x7f07dcac4f11 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 449
6 0x7f084adf1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f084adf1253]
7 0x7f084ab80ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f084ab80ac3]
8 0x7f084ac11a04 clone + 68 what(): [TensorRT-LLM][ERROR] Assertion failed: newSize <= getCapacity() (/projects/TensorRT-LLM/cpp/tensorrt_llm/runtime/bufferView.h:83)
1 0x7f0ce6459a60 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 121
2 0x7f0c84a3730e virtual thunk to tensorrt_llm::runtime::TensorView::reshape(nvinfer1::Dims32 const&) + 366
3 0x7f0c84a382a3 virtual thunk to tensorrt_llm::runtime::TensorView::resize(unsigned long) + 147
4 0x7f0c84abd2e1 tensorrt_llm::batch_manager::GptManager::returnCompletedRequests() + 1297
5 0x7f0c84ac4f11 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 449
6 0x7f0cfa7f1253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f0cfa7f1253]
7 0x7f0cfa580ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f0cfa580ac3]
8 0x7f0cfa611a04 clone + 68

Signal (6) received.
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

3. Using decoding mode with top_k or top_p
Received an error from server:
in ensemble 'ensemble', Failed to process the request(s) for model instance 'postprocessing_0_0', message: TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

additional notes

When I use script https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py , model returns correct results

@chiendb97 chiendb97 added the bug Something isn't working label Apr 15, 2024
@nekorobov
Copy link
Collaborator

nekorobov commented Apr 18, 2024

Hi @chiendb97 , thank you for reporting these issues.

"Stopping early with end id" is a known bug and we're on it. I hope we will get a fix to main branch soon.

Regarding "Using streaming", is it also a failure that you've seen during the end_id experiment?

"Using decoding mode with top_k or top_p". Medusa uses own decoding mode "Medusa" internally regardless of what decoding modes are set by the user. So, I'm not sure what you mean by this. Could you give a little bit more context, please?

@nekorobov nekorobov added the triaged Issue has been triaged by maintainers label Apr 18, 2024
@chiendb97
Copy link
Author

chiendb97 commented Apr 18, 2024

Hi @nekorobov , thank you for your reply.

Regarding "Using streaming", is it also a failure that you've seen during the end_id experiment?

Regarding "Using streaming", it's an error when i don't use early stop with end id.

"Using decoding mode with top_k or top_p". Medusa uses own decoding mode "Medusa" internally regardless of what decoding modes are set by the user. So, I'm not sure what you mean by this. Could you give a little bit more context, please?

When I set decoding_mode to "", the model returns normal results. When I set decoding_mode to "top_k" or "top_p", the model returns all token id as 0 and results in the error as above.

@nekorobov
Copy link
Collaborator

nekorobov commented Apr 19, 2024

I've reproduced all 3 issues and working on fixes. I will update the bug once the fixes are merged into the main branch. Thank you again for reporting!

@nekorobov nekorobov self-assigned this Apr 19, 2024
@nekorobov
Copy link
Collaborator

All of these issues should be solved in the latest main branch. Could you try it and reopen if it does not work for you? Thank you!

@chiendb97
Copy link
Author

@nekorobov I re-ran the code using the latest version from the main branch. However, I noticed incorrect results when using the Medusa model with a long context length. I also tested it with the Vicuna model and my own model. Below are the test results on the Vicuna model (max_input_length=1024, max_output_length=1024, repetition_penalty=1.03):

Prompt:
Write long essays in English: I am

Llama output:

  • Able to write a minimum of 500 words in English on a given topic.
  • Able to write a maximum of 1000 words in English on a given topic.
  • Able to use appropriate grammar, punctuation and spelling in my writing.
  • Able to organize my ideas effectively and present them in a logical manner.
  • Able to use a variety of sentence structures and vocabul The first time I saw the movie "The Secret Life of Walter Mitty," I was struck by the vivid imagery and the sense of adventure that it portrayed. The film follows the life of a man named Walter Mitty, who is stuck in a mundane life and dreams of adventure. He daydreams about being a hero, traveling the world, and living life to the fullest.

As I watched the movie, I couldn't help but relate to Walter Mitty. I too have always been drawn to adventure and have often found myself daydreaming about traveling the world and experiencing new things. I have always been fascinated by the idea of living life to the fullest and making the most of every moment.

However, as I got older, I realized that adventure doesn't always have to be far away or expensive. Sometimes, the greatest adventures can be found right in your own backyard.

One of the most adventurous things I have ever done was move to a foreign country alone. I left behind my family and friends and moved to a place where I didn't know anyone or speak the language. It was scary and challenging, but it was also one of the most rewarding experiences of my life.

Another adventure I had was learning to surf. I had always been afraid of trying new things, but I decided to take a surfing lesson and push myself out of my comfort zone. It was difficult at first, but the feeling of accomplishment and the rush of adrenaline was worth it.

Adventure can come in many forms, and it's important to remember that it doesn't always have to be expensive or far away. Sometimes, the greatest adventures are the ones that challenge us and push us out of our comfort zones.

So, if you're feeling stuck in a rut or looking for a new adventure, don't be afraid to step outside of your comfort zone and try something new. You never know what amazing experiences you might discover along the way.Home » News » News » 2019: PDP will defeat APC in all elections – Secondus
On February 14, 2019 12:37 pmIn Newsby vanguard
The National Chairman of the Peoples Democratic Party (PDP), Uche Secondus, has expressed confidence that the party would defeat the All Progressives Congress (APC) in all elections in 2019.
Secondus made the remark on Thursday in Abuja while addressing journalists on the state of the nation.
He said that the PDP was ready to take over power from the APC in 2019, adding that the party had learnt from its past mistakes and had put in place measures to ensure victory.
According to him, the PDP has a formidable structure across the country, with a strong membership base, adding that the party would work hard to ensure that it wins the presidential, National Assembly and governorship elections.
“We are ready for the 2019 elections. We are going to win in all the elections. We have learnt from our past mistakes and we have put in place measures to ensure victory.
“The PDP has a formidable structure across the country, with a strong membership base. We will work hard to ensure that we win the presidential, National Assembly and governorship elections,” he said.
Secondus also dismissed the notion that the PDP was divided, saying that the party was united and ready to take over power from the APC.
He said that the PDP had put in place measures to ensure that it won the 2019 elections, adding that the party would work hard to ensure that it won the presidential, National Assembly and governorship elections.
“The PDP is not divided. We are united and ready to take over power from the APC. We have put in place measures to ensure that we win the 2019 elections,” he said.
Secondus also expressed concern over the state of the economy, saying that the PDP was ready to address the challeng

Medusa output:

  • Able to write long essays in English with a high level of proficiency.
  • Able to write essays that are well-organized, well-structured, and easy to follow.
  • Able to write essays that are free of grammatical errors and spelling mistakes.
  • Able to write essays that are well-supported by evidence and examples.
  • Able to write essays that are persuasive and engaging.
  • Able to write essays that are free of plagiarism.
  • Able to write essays that are within the required word count.
  • Able to write essays that are tailored to the specific requirements of the task.
  • Able to write essays that are well-researched and well-referenced.
  • Able to write essays that are free of repetition and redundancy.
  • Able to write essays that are free of ambiguity and vagueness.
  • Able to write essays that are free of inconsistencies and contradictions.
  • Able to write essays that are free of errors in grammar______COCOCOAOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给

@chiendb97
Copy link
Author

Hi @nekorobov
I'm experiencing this error even when using the script available at https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/run.py. Additionally, I didn't encounter this issue in version 0.9.0.dev2024040900. Could you try reproducing the problem on your end?
Thank you!

@littletomatodonkey
Copy link

Hi @chiendb97 , I use GptManager for auto-regressive decoding LLM now. Which way do yo use medusa in GptManager, could you please provide a demo ? Thanks!

@chiendb97
Copy link
Author

Hi @chiendb97 , I use GptManager for auto-regressive decoding LLM now. Which way do yo use medusa in GptManager, could you please provide a demo ? Thanks!

@littletomatodonkey I utilize Medusa via the TensorRT-LLM Backend. I've made modifications to the TensorRT-LLM and nvidia-modelopt code to enable Medusa decoding with FP8 precision. In my case, using Medusa decoding has resulted in a latency reduction of approximately 40% when decoding with FP8 precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants