[TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr #2069

KeitaW · 2024-07-31T13:12:43Z

System Info

Compute instance: AWS G6.48xlarge (https://aws.amazon.com/ec2/instance-types/g6/)
Driver Version: 535.183.01
Working inside nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 image.

Who can help?

Hello @byshiue,
I am currently trying to write minimal working code that load TensorRT-LLM engine and run text generation inference.
I have followed example engine build process described in here and confirmed that engine works with run.py script.

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm \
-v /home/ubuntu/g5-vs-g6-bench:/workspace \
-v ~/.cache:/root/.cache \
bench-image:latest python3 /workspace/TensorRT-LLM/examples/run.py --max_output_len=50 \
--tokenizer_dir "/workspace/models/Meta-Llama-3-8B" \
--engine_dir "/workspace/models/Meta-Llama-3-8B/trt_engines/bf16/1-gpu"

Output:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.07 (build 102761898)
Triton Server Version 2.48.0

Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.4 driver version 550.54.15 with kernel driver version 535.183.01.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
[07/31/2024-12:57:04] [TRT-LLM] [I] Load engine takes: 11.176877498626709 sec
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter in Paris and was a pupil of the history painter, Paul Delaroche. He was a member of the Société des Artistes Français and exhibited at the Salon from 1848. He was also a member of the Société des"

Now I'm trying to create a GenerationSession manually and run text generation inference and encountering

 what():  [TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp:150)

which eventually crashes Python process. It is highly likely that I'm making mistake in GenerateSession set up, but would like to confirm if that is the case.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Step1: Build engine

Follow the steps described in https://github.com/NVIDIA/TensorRT-LLM/tree/v0.11.0/examples/llama#llama-v3-updates inside nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3. In this particular example, I'm testing with Llama3-8B.

Step2:

Run the following command inside the same container.

import os

import json
import torch
from transformers import AutoTokenizer
from tensorrt_llm import Mapping
from tensorrt_llm.builder import Engine
from tensorrt_llm.quantization import QuantMode
from tensorrt_llm.runtime import GenerationSession, ModelConfig, SamplingConfig

engine_dir = "/workspace/models/Meta-Llama-3-8B/trt_engines/bf16/1-gpu/"
tokenizer_dir = "/workspace/models/Meta-Llama-3-8B"
rank = 0

config_path = os.path.join(engine_dir, 'config.json')
with open(config_path, 'r') as f:
    config = json.load(f)
build_config = config["build_config"]
plugin_config = build_config["plugin_config"]
pretrained_config = config["pretrained_config"]
quantization_config = pretrained_config["quantization"]
mapping_config = pretrained_config["mapping"]


quant_mode = QuantMode.from_quant_algo(
    quant_algo=quantization_config["quant_algo"],
    kv_cache_quant_algo=quantization_config["kv_cache_quant_algo"]
)

model_conifg = ModelConfig(
    max_batch_size=build_config["max_batch_size"],
    max_beam_width=build_config["max_beam_width"],
    vocab_size=pretrained_config["vocab_size"],
    num_layers=pretrained_config["num_hidden_layers"],
    num_heads=pretrained_config["num_attention_heads"],
    num_kv_heads=pretrained_config["num_key_value_heads"],
    hidden_size=pretrained_config["hidden_size"],
    dtype=pretrained_config["dtype"],
    paged_kv_cache=build_config["plugin_config"]["paged_kv_cache"],
    tokens_per_block=build_config["plugin_config"]["tokens_per_block"],
    gpt_attention_plugin=bool(build_config["plugin_config"]["gpt_attention_plugin"]),
    remove_input_padding=build_config["plugin_config"]["remove_input_padding"],
    quant_mode=quant_mode
)


torch.cuda.set_device(rank % mapping_config["gpus_per_node"])

runtime_mapping = Mapping(
    world_size=mapping_config["world_size"],
    rank=rank,
    tp_size=mapping_config["tp_size"],
    pp_size=mapping_config["pp_size"]
)

engine_name = f'rank{rank}.engine'
serialize_path = os.path.join(engine_dir, engine_name)
with open(serialize_path, 'rb') as f:
    engine_buffer = f.read()

session = GenerationSession(
    model_config=model_conifg,
    engine_buffer=engine_buffer,
    mapping=runtime_mapping, 
    debug_mode=True
)

tokenizer = AutoTokenizer.from_pretrained(
    tokenizer_dir,
    legacy=False,
    padding_side='left',
    truncation_side='left',
    trust_remote_code=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token
pad_id = tokenizer.encode(tokenizer.pad_token, add_special_tokens=False)[0]
end_id = tokenizer.encode(tokenizer.eos_token, add_special_tokens=False)[0]
top_k = 1.0
num_beams = 1
sampling_config = SamplingConfig(
    pad_id=pad_id,
    end_id=end_id,
    top_k=top_k,
    num_beams=1
)



input_text = "To tell a story"
line_encoded = []
input_id = tokenizer.encode(input_text, return_tensors="pt").type(torch.int32)
line_encoded.append(input_id)
input_lengths = []
input_lengths.append(input_id.shape[-1])
max_length = max(input_lengths)
session.setup(
    batch_size=1, max_context_length=4096, max_new_tokens=4096, beam_width=1,
    max_attention_window_size=4096
)

session.decode_batch(
    line_encoded,   sampling_config
)

Expected behavior

The script returns exit code 0.

actual behavior

I got following error which eventually crashes Python process.

/usr/local/lib/python3.10/dist-packages/torch/nested/__init__.py:219: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  return _nested.nested_tensor(
CUDA Error: CUDA_ERROR_ILLEGAL_ADDRESS /workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp 149
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr (/workspace/tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/cubinObj.cpp:150)
1       0x7f90f9787d24 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 100
2       0x7f90f9789f16 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x71ff16) [0x7f90f9789f16]
3       0x7f90f98e1440 tensorrt_llm::kernels::DecoderXQAImplJIT::prepareForActualXQAParams(tensorrt_llm::kernels::XQAParams const&) + 688
4       0x7f90f98e1590 tensorrt_llm::kernels::DecoderXQAImplJIT::prepare(tensorrt_llm::kernels::XQAParams const&) + 96
5       0x7f9088000c27 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xf9c27) [0x7f9088000c27]
6       0x7f9088021466 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x11a466) [0x7f9088021466]
7       0x7f91cdbc8974 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10a0974) [0x7f91cdbc8974]
8       0x7f91cdb6c783 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044783) [0x7f91cdb6c783]
9       0x7f91cdb6e0c1 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1) [0x7f91cdb6e0c1]
10      0x7f9174aa48f0 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0) [0x7f9174aa48f0]
11      0x7f9174a458f3 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3) [0x7f9174a458f3]
12      0x558431125c9e /usr/bin/python3(+0x15ac9e) [0x558431125c9e]
13      0x55843111c3cb _PyObject_MakeTpCall + 603
14      0x5584311343eb /usr/bin/python3(+0x1693eb) [0x5584311343eb]
15      0x55843111459a _PyEval_EvalFrameDefault + 25674
16      0x55843112659c _PyFunction_Vectorcall + 124
17      0x55843110e96e _PyEval_EvalFrameDefault + 2078
18      0x55843113425e /usr/bin/python3(+0x16925e) [0x55843113425e]
19      0x558431110a9d _PyEval_EvalFrameDefault + 10573
20      0x55843113425e /usr/bin/python3(+0x16925e) [0x55843113425e]
21      0x558431110a9d _PyEval_EvalFrameDefault + 10573
22      0x55843112659c _PyFunction_Vectorcall + 124
23      0x558431134db2 PyObject_Call + 290
24      0x558431110a9d _PyEval_EvalFrameDefault + 10573
25      0x558431134111 /usr/bin/python3(+0x169111) [0x558431134111]
26      0x558431134db2 PyObject_Call + 290
27      0x558431110a9d _PyEval_EvalFrameDefault + 10573
28      0x55843112659c _PyFunction_Vectorcall + 124
29      0x55843110e96e _PyEval_EvalFrameDefault + 2078
30      0x55843110af96 /usr/bin/python3(+0x13ff96) [0x55843110af96]
31      0x558431200c66 PyEval_EvalCode + 134
32      0x55843120681d /usr/bin/python3(+0x23b81d) [0x55843120681d]
33      0x5584311267f9 /usr/bin/python3(+0x15b7f9) [0x5584311267f9]
34      0x55843110e827 _PyEval_EvalFrameDefault + 1751
35      0x558431143890 /usr/bin/python3(+0x178890) [0x558431143890]
36      0x5584311109bf _PyEval_EvalFrameDefault + 10351
37      0x558431143890 /usr/bin/python3(+0x178890) [0x558431143890]
38      0x5584311109bf _PyEval_EvalFrameDefault + 10351
39      0x558431143890 /usr/bin/python3(+0x178890) [0x558431143890]
40      0x55843122119f /usr/bin/python3(+0x25619f) [0x55843122119f]
41      0x558431131eca /usr/bin/python3(+0x166eca) [0x558431131eca]
42      0x55843110e96e _PyEval_EvalFrameDefault + 2078
43      0x55843112659c _PyFunction_Vectorcall + 124
44      0x55843110e827 _PyEval_EvalFrameDefault + 1751
45      0x55843112659c _PyFunction_Vectorcall + 124
46      0x55843110e96e _PyEval_EvalFrameDefault + 2078
47      0x558431134111 /usr/bin/python3(+0x169111) [0x558431134111]
48      0x55843110fb77 _PyEval_EvalFrameDefault + 6695
49      0x55843112659c _PyFunction_Vectorcall + 124
50      0x55843110e96e _PyEval_EvalFrameDefault + 2078
51      0x55843112659c _PyFunction_Vectorcall + 124
52      0x55843110e96e _PyEval_EvalFrameDefault + 2078
53      0x55843112659c _PyFunction_Vectorcall + 124
54      0x55843110e96e _PyEval_EvalFrameDefault + 2078
55      0x558431134111 /usr/bin/python3(+0x169111) [0x558431134111]
56      0x558431134db2 PyObject_Call + 290
57      0x558431110a9d _PyEval_EvalFrameDefault + 10573
58      0x55843112659c _PyFunction_Vectorcall + 124
59      0x55843110e827 _PyEval_EvalFrameDefault + 1751
60      0x55843110af96 /usr/bin/python3(+0x13ff96) [0x55843110af96]
61      0x558431200c66 PyEval_EvalCode + 134
62      0x55843122bb38 /usr/bin/python3(+0x260b38) [0x55843122bb38]
63      0x5584312253fb /usr/bin/python3(+0x25a3fb) [0x5584312253fb]
64      0x55843122b885 /usr/bin/python3(+0x260885) [0x55843122b885]
65      0x55843122ad68 _PyRun_SimpleFileObject + 424
66      0x55843122a9b3 _PyRun_AnyFileObject + 67
67      0x55843121d45e Py_RunMain + 702
68      0x5584311f3a3d Py_BytesMain + 45
69      0x7f9441722d90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f9441722d90]
70      0x7f9441722e40 __libc_start_main + 128
71      0x5584311f3935 _start + 37
[142a2d5a219e:00424] *** Process received signal ***
[142a2d5a219e:00424] Signal: Aborted (6)
[142a2d5a219e:00424] Signal code:  (-6)
[142a2d5a219e:00424] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f944173b520]
[142a2d5a219e:00424] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f944178f9fc]
[142a2d5a219e:00424] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f944173b476]
[142a2d5a219e:00424] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f94417217f3]
[142a2d5a219e:00424] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f942f676b9e]
[142a2d5a219e:00424] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f942f68220c]
[142a2d5a219e:00424] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f942f6811e9]
[142a2d5a219e:00424] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f942f681959]
[142a2d5a219e:00424] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f943af7c884]
[142a2d5a219e:00424] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7f943af7d2dd]
[142a2d5a219e:00424] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7202bc)[0x7f90f978a2bc]
[142a2d5a219e:00424] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7kernels17DecoderXQAImplJIT7prepareERKNS0_9XQAParamsE+0x60)[0x7f90f98e1590]
[142a2d5a219e:00424] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xf9c27)[0x7f9088000c27]
[142a2d5a219e:00424] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x11a466)[0x7f9088021466]
[142a2d5a219e:00424] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10a0974)[0x7f91cdbc8974]
[142a2d5a219e:00424] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1044783)[0x7f91cdb6c783]
[142a2d5a219e:00424] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10460c1)[0x7f91cdb6e0c1]
[142a2d5a219e:00424] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0xa48f0)[0x7f9174aa48f0]
[142a2d5a219e:00424] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x458f3)[0x7f9174a458f3]
[142a2d5a219e:00424] [19] /usr/bin/python3(+0x15ac9e)[0x558431125c9e]
[142a2d5a219e:00424] [20] /usr/bin/python3(_PyObject_MakeTpCall+0x25b)[0x55843111c3cb]
[142a2d5a219e:00424] [21] /usr/bin/python3(+0x1693eb)[0x5584311343eb]
[142a2d5a219e:00424] [22] /usr/bin/python3(_PyEval_EvalFrameDefault+0x644a)[0x55843111459a]
[142a2d5a219e:00424] [23] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55843112659c]
[142a2d5a219e:00424] [24] /usr/bin/python3(_PyEval_EvalFrameDefault+0x81e)[0x55843110e96e]
[142a2d5a219e:00424] [25] /usr/bin/python3(+0x16925e)[0x55843113425e]
[142a2d5a219e:00424] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x294d)[0x558431110a9d]
[142a2d5a219e:00424] [27] /usr/bin/python3(+0x16925e)[0x55843113425e]
[142a2d5a219e:00424] [28] /usr/bin/python3(_PyEval_EvalFrameDefault+0x294d)[0x558431110a9d]
[142a2d5a219e:00424] [29] /usr/bin/python3(_PyFunction_Vectorcall+0x7c)[0x55843112659c]
[142a2d5a219e:00424] *** End of error message ***
Aborted (core dumped)

additional notes

NA

The text was updated successfully, but these errors were encountered:

github-actions · 2024-08-31T01:57:53Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

KeitaW added the bug Something isn't working label Jul 31, 2024

github-actions bot added the stale label Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr #2069

[TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr #2069

KeitaW commented Jul 31, 2024 •

edited

Loading

github-actions bot commented Aug 31, 2024

[TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr #2069

[TensorRT-LLM][ERROR] Assertion failed: mModule != nullptr #2069

Comments

KeitaW commented Jul 31, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Step1: Build engine

Step2:

Expected behavior

actual behavior

additional notes

github-actions bot commented Aug 31, 2024

KeitaW commented Jul 31, 2024 •

edited

Loading