Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with pipeline parallelism and gather_all_token_logits #1284

Closed
2 of 4 tasks
Marks101 opened this issue Mar 12, 2024 · 5 comments
Closed
2 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@Marks101
Copy link
Contributor

System Info

  • NVIDIA H100 DGX
  • CUDA 12.1
  • TensorRT-LLM 0.8.0

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Based on the Falcon examples, I added the use of pipeline parallelism and gather_all_token_logits:

python convert_checkpoint.py --model_dir ./falcon/7b-instruct --dtype bfloat16 --output_dir ./falcon/7b-instruct/trt_ckpt/bf16/2-gpu/ --pp_size 2

trtllm-build --checkpoint_dir ./falcon/7b-instruct/trt_ckpt/bf16/2-gpu/ --gemm_plugin bfloat16 --remove_input_padding enable --gpt_attention_plugin bfloat16 --output_dir ./falcon/7b-instruct/trt_engines/bf16/2-gpu/ --gather_all_token_logits

python ../summarize.py --test_trt_llm --hf_model_dir ./falcon/7b-instruct --engine_dir ./falcon/7b-instruct/trt_engines/bf16/2-gpu/

Expected behavior

Produces a similar result to the case without pipelining and without gather_all_token_logits

actual behavior

Crashes with the following stack trace:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8
[ 0] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fa73ade1520]
[ 1] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession18executeContextStepERKSt6vectorINS0_15GenerationInputESaIS3_EERKS2_IiSaIiEEPKNS_13batch_manager16kv_cache_manager14KVCacheManagerE+0x5a2)[0x7fa455c9a7c2]
[ 2] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession15generateBatchedERSt6vectorINS0_16GenerationOutputESaIS3_EERKS2_INS0_15GenerationInputESaIS7_EERKNS0_14SamplingConfigERKSt8functionIFvibEE+0xc0b)[0x7fa455c9b89b]
[ 3] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/virtualenv/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime10GptSession8generateERNS0_16GenerationOutputERKNS0_15GenerationInputERKNS0_14SamplingConfigE+0xc43)[0x7fa455c9d2f3]
[ 4] /virtualenv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x42f79)[0x7fa484d80f79]
[ 5] /virtualenv/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2d19e)[0x7fa484d6b19e]
[ 6] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x15a10e)[0x55cc0703e10e]
[ 7] python(_PyObject_MakeTpCall+0x25b)[0x55cc07034a7b]
[ 8] python(+0x168acb)[0x55cc0704cacb]
[ 9] python(_PyEval_EvalFrameDefault+0x614a)[0x55cc0702ccfa]
[10] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x1687f1)[0x55cc0704c7f1]
[11] python(PyObject_Call+0x122)[0x55cc0704d492]
[12] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyEval_EvalFrameDefault+0x2a27)[0x55cc070295d7]
[13] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyFunction_Vectorcall+0x7c)[0x55cc0703e9fc]
[14] python(_PyEval_EvalFrameDefault+0x198c)[0x55cc0702853c]
[15] python(_PyFunction_Vectorcall+0x7c)[0x55cc0703e9fc]
Tue Mar 12 12:30:27 2024[1,0]<stderr>:[16] python(_PyEval_EvalFrameDefault+0x6bd)[0x55cc0702726d]
[17] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x13f9c6)[0x55cc070239c6]
[18] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(PyEval_EvalCode+0x86)[0x55cc07119256]
[19] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x260108)[0x55cc07144108]
[20] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x2599cb)[0x55cc0713d9cb]
[21] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(+0x25fe55)[0x55cc07143e55]
[22] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyRun_SimpleFileObject+0x1a8)[0x55cc07143338]
[23] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_PyRun_AnyFileObject+0x43)[0x55cc07142f83]
[24] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(Py_RunMain+0x2be)[0x55cc07135a5e]
[25] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(Py_BytesMain+0x2d)[0x55cc0710c02d]
[26] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fa73adc8d90]
[27] Tue Mar 12 12:30:27 2024[1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fa73adc8e40]
[28] Tue Mar 12 12:30:27 2024[1,0]<stderr>:python(_start+0x25)[0x55cc0710bf25]
*** End of error message ***

If I add --use_py_session, I get the following error:

Traceback (most recent call last):
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 644, in <module>
    main(args)
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 388, in main
    output, *_ = eval_trt_llm(datapoint,
  File "/TensorRT-LLM/examples/falcon/../summarize.py", line 233, in eval_trt_llm
    outputs = runner.generate(
  File "/virtualenv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py", line 642, in generate
    outputs = self._prepare_outputs(outputs, input_lengths)
  File "/virtualenv/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py", line 237, in _prepare_outputs
    context_logits = context_logits.flatten(end_dim=-2)
AttributeError: 'NoneType' object has no attribute 'flatten'

additional notes

We noticed this error in different tasks that require us to gather logits and use pipeline parallelism. We managed to reproduce this issue based on the official examples. For simplicity, I base this issue description on these observations.

@Marks101 Marks101 added the bug Something isn't working label Mar 12, 2024
@yweng0828
Copy link

Hi @Marks101 , thanks for your feedback, we will try to reproduce and fix this issue.

@yweng0828
Copy link

Hi @Marks101 , we have fixed this issue in our latest version, could you please verify it?
Please feel free to contact us if there are still problems.

@Marks101
Copy link
Contributor Author

Hi @yweng0828, thank you for taking care of this issue.
I was able to verify that everything is fixed. Great 😄

@yweng0828
Copy link

Thanks for your update, @Marks101 . Let's close this issue! :)

@byshiue
Copy link
Collaborator

byshiue commented Apr 18, 2024

C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants