Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 6 when getting nvlink_bandwidth #1467

Open
2 of 4 tasks
choyuansu opened this issue Apr 17, 2024 · 1 comment
Open
2 of 4 tasks

KeyError: 6 when getting nvlink_bandwidth #1467

choyuansu opened this issue Apr 17, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@choyuansu
Copy link

choyuansu commented Apr 17, 2024

System Info

GPU: NVIDIA RTX A6000

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Run git clone https://github.com/NVIDIA/TensorRT-LLM.git

  2. Create Dockerfile and docker-compose.yaml in TensorRT-LLM/

    Dockerfile
    # Obtain and start the basic docker image environment.
    FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
    
    # Install dependencies, TensorRT-LLM requires Python 3.10
    RUN apt-get update && apt-get -y install \
        python3.10 \
        python3-pip \
        openmpi-bin \
        libopenmpi-dev
    
    # Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
    # If you want to install the stable version (corresponding to the release branch), please
    # remove the `--pre` option.
    RUN --mount=type=cache,target=/root/.cache/pip pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
    
    COPY ./examples/qwen/requirements.txt .
    RUN --mount=type=cache,target=/root/.cache/pip pip3 install -r requirements.txt
    
    WORKDIR /workdir
    
    docker-compose.yaml
    services:
      tensorrt:
        image: tensorrt-llm
        volumes:
          - .:/workdir
          - /mnt/models:/mnt/models
        command:
        - bash
        - -ec
        - |
          cd examples/qwen
          pip install -r requirements.txt
          python3 convert_checkpoint.py --model_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/ \
                    --dtype float32 \
                    --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/
          trtllm-build --checkpoint_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_ckpt/fp32/1-gpu/ \
                    --gemm_plugin float32 \
                    --output_dir /mnt/models/Large_Language_Model/Qwen-7B-Chat/trt_engines/fp32/1-gpu/
        deploy:
            resources:
              reservations:
                devices:
                  - driver: nvidia
                    count: 1
                    capabilities: [gpu]
    
  3. Run git clone https://huggingface.co/Qwen/Qwen-7B-Chat in /mnt/models/Large_Language_Model

  4. Run docker compose up

Expected behavior

No error

actual behavior

[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink is active: True
[04/16/2024-22:50:23] [TRT-LLM] [I] NVLink version: 6
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 411, in main
    cluster_config = infer_cluster_config()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 523, in infer_cluster_config
    cluster_info=infer_cluster_info(),
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 487, in infer_cluster_info
    nvl_bw = nvlink_bandwidth(nvl_version)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/auto_parallel/cluster_info.py", line 433, in nvlink_bandwidth
    return nvl_bw_table[nvlink_version]
KeyError: 6

additional notes

Relevant code: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/auto_parallel/cluster_info.py#L427-L433

Can't seem to find info about NVLink version 6's bandwidth online.

@choyuansu choyuansu added the bug Something isn't working label Apr 17, 2024
@byshiue
Copy link
Collaborator

byshiue commented Apr 22, 2024

Thank you for the report. We will fix it in next update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants