KeyError: 6 when getting nvlink_bandwidth #1467

choyuansu · 2024-04-17T22:26:17Z

System Info

GPU: NVIDIA RTX A6000

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run git clone https://github.com/NVIDIA/TensorRT-LLM.git

Create Dockerfile and docker-compose.yaml in TensorRT-LLM/

Dockerfile

# Obtain and start the basic docker image environment.
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
RUN apt-get update && apt-get -y install \
    python3.10 \
    python3-pip \
    openmpi-bin \
    libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
RUN --mount=type=cache,target=/root/.cache/pip pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com

COPY ./examples/qwen/requirements.txt .
RUN --mount=type=cache,target=/root/.cache/pip pip3 install -r requirements.txt

WORKDIR /workdir

docker-compose.yaml

services:
  tensorrt:
    image: tensorrt-llm
    volumes:
      - .:/workdir
      - /mnt/models:/mnt/models
    command:
    - bash
    - -ec
    - |
      cd examples/qwen
      pip install -r requirements.txt
      python3 convert_checkpoint.py --model_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/ \
                --dtype float32 \
                --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/
      trtllm-build --checkpoint_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/ \
                --gemm_plugin float32 \
                --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_engines/fp32/1-gpu/
    deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                count: 1
                capabilities: [gpu]

Run git clone https://huggingface.co/Qwen/Qwen-7B-Chat in /mnt/models/Large_Language_Model
Run docker compose up

Expected behavior

No error

actual behavior

[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink is active: True
[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink version: 6
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 411, in main
    cluster_config = infer_cluster_config()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 523, in infer_cluster_config
    cluster_info=infer_cluster_info(),
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 487, in infer_cluster_info
    nvl_bw = nvlink_bandwidth(nvl_version)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 433, in nvlink_bandwidth
    return nvl_bw_table[nvlink_version]
KeyError: 6

additional notes

Relevant code: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/auto_parallel/cluster_info.py#L427-L433

Can't seem to find info about NVLink version 6's bandwidth online.

The text was updated successfully, but these errors were encountered:

byshiue · 2024-04-22T07:21:19Z

Thank you for the report. We will fix it in next update.

choyuansu added the bug Something isn't working label Apr 17, 2024

byshiue self-assigned this Apr 22, 2024

kaiyux mentioned this issue Apr 24, 2024

Update TensorRT-LLM #1492

Merged

kaiyux mentioned this issue Jun 5, 2024

TensorRT-LLM v0.10 update #1734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 6 when getting nvlink_bandwidth #1467

KeyError: 6 when getting nvlink_bandwidth #1467

choyuansu commented Apr 17, 2024 •

edited

Loading

byshiue commented Apr 22, 2024

KeyError: 6 when getting nvlink_bandwidth #1467

KeyError: 6 when getting nvlink_bandwidth #1467

Comments

choyuansu commented Apr 17, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

byshiue commented Apr 22, 2024

choyuansu commented Apr 17, 2024 •

edited

Loading