Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random generation when inferring llama 2 with and without LoRA in the same batch #2096

Closed
2 of 4 tasks
ilyaonoff opened this issue Aug 7, 2024 · 3 comments
Closed
2 of 4 tasks
Assignees
Labels
bug Something isn't working

Comments

@ilyaonoff
Copy link

ilyaonoff commented Aug 7, 2024

System Info

  • CPU architecture: x86_64
  • GPU name: NVIDIA A100 40Gb
  • OS: Ubuntu 20.04.6 LTS
  • TensorRT-LLM tag: v0.11.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run everything with TensorRT-LLM v0.11.0 in its container

  1. Build LLama-2 engine with LoRA support. Llama checkpoint is taken from https://huggingface.co/meta-llama/Llama-2-7b-hf
    Using examples/llama/convert_checkpoint.py run:
python3 convert_checkpoint.py --model_dir /workspace/llm_models/llama-2 \
                            --dtype float16 \
                            --output_dir tllm_checkpoint

Build the engine:

trtllm-build --checkpoint_dir tllm_checkpoint \
             --output_dir ./engine_outputs \
             --gemm_plugin float16 \
             --max_input_len 8182 \
             --max_seq_len 8182 \
             --max_batch_size 8 \
             --max_beam_width 1 \
             --max_num_tokens 65456 \
             --lora_plugin float16 \
             --max_lora_rank 16 \
             --lora_target_modules "attn_q" "attn_k" "attn_v" "attn_dense"
  1. Run inference with lora from https://huggingface.co/tloen/alpaca-lora-7b, using examples/run.py script

This inference produces correct result:

python3 run.py \
    --use_py_session \
    --engine_dir llama/engine_outputs \
    --max_output_len 123 \
    --temperature 1 \
    --tokenizer_dir /workspace/llm_models/llama-2/ \
    --input_text \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
    --lora_task_uids -1 -1  \
    --lora_dir /workspace/llm_models/loras/alpaca-lora-7b
Input [Text 0]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 0 Beam 0]: "\n\n### Tip 1:\n\n\n"
Input [Text 1]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 1 Beam 0]: "\n\n### Tip 1:\n\n\n"

However, the next one produces random generation when mixing LoRA usage in the same batch

python3 run.py \
    --use_py_session \
    --engine_dir llama/engine_outputs \
    --max_output_len 16 \
    --temperature 1 \
    --tokenizer_dir /workspace/llm_models/llama-2/ \
    --input_text \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
        "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n" \
    --lora_task_uids -1 0  \
    --lora_dir /workspace/llm_models/loras/alpaca-lora-7b
Input [Text 0]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 0 Beam 0]: "Љraztrightarrowrazrazrazraz←←←ikuikuikurackikurack"
Input [Text 1]: "<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n"
Output [Text 1 Beam 0]: "\n\n### Tip 1:\n\n\n"

Full output:
run.log

Expected behavior

The model produces a meaningful result when combining requests with and without lora in the same batch

actual behavior

When combining requests with and without lora in the same batch the model produces a random result for the request without lora

additional notes

@ilyaonoff ilyaonoff added the bug Something isn't working label Aug 7, 2024
@ilyaonoff
Copy link
Author

@byshiue Can somebody help?

@yuxianq
Copy link

yuxianq commented Aug 21, 2024

@ilyaonoff This bug is known and has been fixed in both the main branch and v0.12, you can validate it with the main branch now or wait for the v0.12 release.

@ilyaonoff
Copy link
Author

@yuxianq Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants