Random generation when inferring llama 2 with and without LoRA in the same batch #2096

ilyaonoff · 2024-08-07T13:24:00Z

System Info

CPU architecture: x86_64
GPU name: NVIDIA A100 40Gb
OS: Ubuntu 20.04.6 LTS
TensorRT-LLM tag: v0.11.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run everything with TensorRT-LLM v0.11.0 in its container

Build LLama-2 engine with LoRA support. Llama checkpoint is taken from https://huggingface.co/meta-llama/Llama-2-7b-hf
Using examples/llama/convert_checkpoint.py run:

python3 convert_checkpoint.py --model_dir /workspace/llm_models/llama-2 \
                            --dtype float16 \
                            --output_dir tllm_checkpoint

Build the engine:

trtllm-build --checkpoint_dir tllm_checkpoint \
             --output_dir ./engine_outputs \
             --gemm_plugin float16 \
             --max_input_len 8182 \
             --max_seq_len 8182 \
             --max_batch_size 8 \
             --max_beam_width 1 \
             --max_num_tokens 65456 \
             --lora_plugin float16 \
             --max_lora_rank 16 \
             --lora_target_modules "attn_q" "attn_k" "attn_v" "attn_dense"

Run inference with lora from https://huggingface.co/tloen/alpaca-lora-7b, using examples/run.py script

This inference produces correct result:

python3 run.py \
    --use_py_session \
    --engine_dir llama/engine_outputs \
    --max_output_len 123 \
    --temperature 1 \
    --tokenizer_dir /workspace/llm_models/llama-2/ \
    --input_text \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
    --lora_task_uids -1 -1  \
    --lora_dir /workspace/llm_models/loras/alpaca-lora-7b

Input [Text 0]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 0 Beam 0]: "\n\n### Tip 1:\n\n\n"
Input [Text 1]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 1 Beam 0]: "\n\n### Tip 1:\n\n\n"

However, the next one produces random generation when mixing LoRA usage in the same batch

python3 run.py \
    --use_py_session \
    --engine_dir llama/engine_outputs \
    --max_output_len 16 \
    --temperature 1 \
    --tokenizer_dir /workspace/llm_models/llama-2/ \
    --input_text \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
    --lora_task_uids -1 0  \
    --lora_dir /workspace/llm_models/loras/alpaca-lora-7b

Input [Text 0]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 0 Beam 0]: "Љraztrightarrowrazrazrazraz←←←ikuikuikurackikurack"
Input [Text 1]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 1 Beam 0]: "\n\n### Tip 1:\n\n\n"

Full output:
run.log

Expected behavior

The model produces a meaningful result when combining requests with and without lora in the same batch

actual behavior

When combining requests with and without lora in the same batch the model produces a random result for the request without lora

additional notes

The text was updated successfully, but these errors were encountered:

ilyaonoff · 2024-08-20T18:51:06Z

@byshiue Can somebody help?

yuxianq · 2024-08-21T07:28:55Z

@ilyaonoff This bug is known and has been fixed in both the main branch and v0.12, you can validate it with the main branch now or wait for the v0.12 release.

ilyaonoff · 2024-08-21T09:43:20Z

@yuxianq Thank you!

ilyaonoff added the bug Something isn't working label Aug 7, 2024

lfr-0531 assigned yuxianq Sep 4, 2024

lfr-0531 closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random generation when inferring llama 2 with and without LoRA in the same batch #2096

Random generation when inferring llama 2 with and without LoRA in the same batch #2096

ilyaonoff commented Aug 7, 2024 •

edited

Loading

ilyaonoff commented Aug 20, 2024

yuxianq commented Aug 21, 2024

ilyaonoff commented Aug 21, 2024

Random generation when inferring llama 2 with and without LoRA in the same batch #2096

Random generation when inferring llama 2 with and without LoRA in the same batch #2096

Comments

ilyaonoff commented Aug 7, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

ilyaonoff commented Aug 20, 2024

yuxianq commented Aug 21, 2024

ilyaonoff commented Aug 21, 2024

ilyaonoff commented Aug 7, 2024 •

edited

Loading