You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Working inside nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3 image.
Who can help?
Hello @byshiue,
I am currently trying to write minimal working code that load TensorRT-LLM engine and run text generation inference.
I have followed example engine build process described in here and confirmed that engine works with run.py script.
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 24.07 (build 102761898)
Triton Server Version 2.48.0
Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.4 driver version 550.54.15 with kernel driver version 535.183.01.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
[07/31/2024-12:57:04] [TRT-LLM] [I] Load engine takes: 11.176877498626709 sec
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 4 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 5 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 6 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 7 is not available.
Input [Text 0]: "<|begin_of_text|>Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: " painter in Paris and was a pupil of the history painter, Paul Delaroche. He was a member of the Société des Artistes Français and exhibited at the Salon from 1848. He was also a member of the Société des"
Now I'm trying to create a GenerationSession manually and run text generation inference and encountering
which eventually crashes Python process. It is highly likely that I'm making mistake in GenerateSession set up, but would like to confirm if that is the case.
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
System Info
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
image.Who can help?
Hello @byshiue,
I am currently trying to write minimal working code that load TensorRT-LLM engine and run text generation inference.
I have followed example engine build process described in here and confirmed that engine works with run.py script.
Output:
Now I'm trying to create a
GenerationSession
manually and run text generation inference and encounteringwhich eventually crashes Python process. It is highly likely that I'm making mistake in
GenerateSession
set up, but would like to confirm if that is the case.Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Step1: Build engine
Follow the steps described in https://github.com/NVIDIA/TensorRT-LLM/tree/v0.11.0/examples/llama#llama-v3-updates inside
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
. In this particular example, I'm testing with Llama3-8B.Step2:
Run the following command inside the same container.
Expected behavior
The script returns exit code 0.
actual behavior
I got following error which eventually crashes Python process.
additional notes
NA
The text was updated successfully, but these errors were encountered: