Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use_fp8_context_fmha broken outputs #1539

Open
2 of 4 tasks
siddhatiwari opened this issue May 3, 2024 · 20 comments
Open
2 of 4 tasks

use_fp8_context_fmha broken outputs #1539

siddhatiwari opened this issue May 3, 2024 · 20 comments
Assignees
Labels
bug Something isn't working triaged Issue has been triaged by maintainers waiting for feedback

Comments

@siddhatiwari
Copy link

siddhatiwari commented May 3, 2024

System Info

CPU architecture: x86_64
Host RAM: 1TB
GPU: 8xH100 SXM
Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend
TensorRT-LLM version: 0.10.0.dev2024043000
Driver Version: 535.161.07
CUDA Version: 12.2
OS: Ubuntu 22.04

Who can help?

@byshiue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Build llama 70b with the following parameters:

python3 ../quantization/quantize.py \
  --model_dir ./llama-70b \
  --dtype float16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir ./llama-70b_fp8 \
  --calib_size 512 \
  --tp_size 2

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 256 \
        --max_input_len 2560 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 20480 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa enable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed

Sample output:
It's alright. I understand. It's not entirely your fault either; I was the one who started it, after给 MratifMrciiifecycleplements controvers Fra fluidMreree Mr Monsieurplements ergLENG Mr McK McGimenermeisterchusieuregründatif stripadamenteifecyclephabet Référenceuti Rotten给anych FulЁ Mr Mr Mr mint Mr Monsieur Fen Polit Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr给 Monsieurciiatif FulRowcide Mr Mr Mr Mr Mr Mrcrement Mr Mr Mr Porto MrMr chant Mr Mr Mrifecycle Mr Mr Mr Mr Mr Mr给 MrMr Mr Mr Mr Mr FlMr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mr Mratif Mr Mr Mr Mr Mr Mr Mr Mr给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给给

Expected behavior

Should not have broken output

actual behavior

Has broken output

additional notes

Same issue with use_paged_context_fmha enable

@siddhatiwari siddhatiwari added the bug Something isn't working label May 3, 2024
@PerkzZheng
Copy link
Collaborator

PerkzZheng commented May 6, 2024

can you try to set --enable_xqa disable as XQA is not compatible currently ? we will fix that soon.

@byshiue byshiue added the triaged Issue has been triaged by maintainers label May 6, 2024
@siddhatiwari
Copy link
Author

The following builds, including --enable_xqa disable, all had the same issue. Is there an example that uses use_fp8_context_fmha enable that I can reference to verify my build setup is correct?

enable_xqa disable

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 96 \
        --max_input_len 8192 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode enable \
        --max_num_tokens 78592 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa disable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed

-----------------------

use_paged_context_fmha enable

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 96 \
        --max_input_len 8192 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode enable \
        --max_num_tokens 78592 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa disable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --use_paged_context_fmha enable \
        --strongly_typed

-----------------------

multi_block_mode disable

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 96 \
        --max_input_len 8192 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 78592 \
        --use_custom_all_reduce enable \
        --use_fused_mlp \
        --enable_xqa disable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed

--------------------

use_custom_all_reduce disable 
use_fused_mlp disable

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 96 \
        --max_input_len 8192 \
        --max_output_len 512 \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 78592 \
        --use_custom_all_reduce disable \
        --enable_xqa disable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --strongly_typed


------------------

context_fmha_fp32_acc enable

trtllm-build --checkpoint_dir ./llama-70b_fp8 \
             --output_dir engines/llama-70b\
             --gemm_plugin float16 \
        --max_batch_size 96 \
        --max_input_len 8192 \
        --max_output_len 512 \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode disable \
        --max_num_tokens 78592 \
        --use_custom_all_reduce disable \
        --enable_xqa disable \
        --workers 2 \
        --use_fp8_context_fmha enable \
        --context_fmha_fp32_acc enable \
        --strongly_typed

@PerkzZheng
Copy link
Collaborator

thanks for the experiments. Have you tried fp8 context fmha with a smaller model like 7B or 13B ? we have verified that llama 7b works well, but it is possible that larger model size may not work as expected with fp8 context fmha. I will also give it a try locally.

@PerkzZheng
Copy link
Collaborator

@siddhatiwari what is the input in your tests ?

to make sure we are aligned, could you try with the run.py and summarize.py tests ?
this is what I got with llama2 70b:

Input [Text 0]: "Born in north-east France, Soyer trained as a"
Output [Text 0 Beam 0]: "chef in Paris and London, and was chef de cuisine at the Reform Club in London from 1837 to 1850. He was a celebrated chef and author during the Victorian era, and was noted for his culinary writing and recipes.

## Early life

Soyer was born in Meaux-en-Brie, Seine-et-Marne, France, on 4 February 1810. He was the son of a grocer, and was apprenticed to a cook at the age of 12. He worked in Paris, Str"
[05/08/2024-02:53:55] [TRT-LLM] [I] ---------------------------------------------------------
[05/08/2024-02:53:55] [TRT-LLM] [I] TensorRT-LLM Generated :
[05/08/2024-02:53:55] [TRT-LLM] [I]  Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.']
[05/08/2024-02:53:55] [TRT-LLM] [I]
 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[05/08/2024-02:53:55] [TRT-LLM] [I]
 Output : [['James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in']]
[05/08/2024-02:53:55] [TRT-LLM] [I] ---------------------------------------------------------
[05/08/2024-02:55:00] [TRT-LLM] [I] TensorRT-LLM (total latency: 64.42842435836792 sec)
[05/08/2024-02:55:00] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1981)
[05/08/2024-02:55:00] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 30.74729856780533)
[05/08/2024-02:55:00] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/08/2024-02:55:00] [TRT-LLM] [I]   rouge1 : 17.618507684565255
[05/08/2024-02:55:00] [TRT-LLM] [I]   rouge2 : 5.127062150515414
[05/08/2024-02:55:00] [TRT-LLM] [I]   rougeL : 13.960999667209482
[05/08/2024-02:55:00] [TRT-LLM] [I]   rougeLsum : 15.581847997415544

@siddhatiwari
Copy link
Author

@PerkzZheng thanks for pointing out the tests. I got unrelated runtime errors with run.py, but the summarize.py output looks correct. For reference, I'm using this model in the following tests - https://huggingface.co/NousResearch/Llama-2-70b-hf

[05/09/2024-21:01:24] [TRT-LLM] [I] TensorRT-LLM Generated : 
[05/09/2024-21:01:24] [TRT-LLM] [I]  Input : ['(CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in theater and in Hollywood, Best didn\'t become famous until 1979, when "The Dukes of Hazzard\'s" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best\'s Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff \'em and stuff \'em!" upon making an arrest. Among the most popular shows on TV in the early \'80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best\'s "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life\'s many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds\' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent \'Return of the Killer Shrews,\' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we\'ve lost in 2015 . CNN\'s Stella Chan contributed to this story.']
[05/09/2024-21:01:24] [TRT-LLM] [I] 
 Reference : ['James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .\n"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .']
[05/09/2024-21:01:24] [TRT-LLM] [I] 
 Output : [['James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he\'d been a busy actor for decades in']]
[05/09/2024-21:01:24] [TRT-LLM] [I] ---------------------------------------------------------
[05/09/2024-21:02:03] [TRT-LLM] [I] TensorRT-LLM (total latency: 38.28129529953003 sec)
[05/09/2024-21:02:03] [TRT-LLM] [I] TensorRT-LLM (total output tokens: 1678)
[05/09/2024-21:02:03] [TRT-LLM] [I] TensorRT-LLM (tokens per second: 43.8334175181528)
[05/09/2024-21:02:03] [TRT-LLM] [I] TensorRT-LLM beam 0 result
[05/09/2024-21:02:03] [TRT-LLM] [I]   rouge1 : 20.090972408423806
[05/09/2024-21:02:03] [TRT-LLM] [I]   rouge2 : 6.285852905234904
[05/09/2024-21:02:03] [TRT-LLM] [I]   rougeL : 15.623961420214172
[05/09/2024-21:02:03] [TRT-LLM] [I]   rougeLsum : 17.00995186574398

Then I tried running the same simple prompt multiple times, with no concurrency, and another run with concurrent requests. No concurrency outputs were good, but I got bad outputs with concurrent requests.

Prompt: What is the capital of the USA?

No concurrency (1 request at a time):

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In which continent would you find the Sahara Desert?
What is the name of the tallest mountain in South America?
Where can you see the Northern Lights?
The Arctic Circle, Antarctica and Iceland
Who was the first man to walk on the moon?
True or false: The Earth rotates around the sun.
What is the name of the highest waterfall in the world?
What is the longest river in Europe?
What is the deepest ocean in the world?
What is the name of the world’s largest island?
What is the name of the smallest planet in our solar system?
What is the name of the largest lake in North America?</s>

---------------

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In which continent would you find the Sahara Desert?
What is the name of the tallest mountain in South America?
Where can you see the Northern Lights?
The Arctic Circle, Antarctica and Iceland
Who was the first man to walk on the moon?
True or false: The Earth rotates around the sun.
What is the name of the highest waterfall in the world?
What is the longest river in Europe?
What is the deepest ocean in the world?
What is the name of the world’s largest island?
What is the name of the smallest planet in our solar system?
What is the name of the largest lake in North America?</s>

Concurrent requests:

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In which continent would给— ¿ schließ ¿'¿.— ehem ¿.—: (' Mr McG ' Row,'0... " ¿ Row' Row¿ ... Row ( revolza ' " (nmFormatter:

---------------

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In which continent would you find给— Sci— ¿¿−.— Row—— ¿¿:', (, ',ze...proc ( (Formatter' '' Row " Inn ( ' fond ( Simon ¿ (('

---------------

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In给—' vers ¿¿−:CI¿.— McG ' (,', '¿ Row' Row'¿Formatter¿' ... Revol: " ( revol ' cler( menu Simon Fr pul ( Dav

---------------

What is the capital of Canada?
What are the three largest countries in Africa by area?
How many time zones does Russia have?
Which country has a flag that features a maple leaf on it?
In which continent给— ¿

I'm not sure what the issue is, I'll debug further and also try running run.py again

@PerkzZheng
Copy link
Collaborator

@siddhatiwari thanks. I have reproduced this. Will let you know if I got a fix. thanks.

@PerkzZheng
Copy link
Collaborator

PerkzZheng commented May 16, 2024

@siddhatiwari you can check out next week's update on main branch for the fix. The output for multiple requests should be good with the update.

@kaiyux
Copy link
Member

kaiyux commented May 21, 2024

@siddhatiwari The fix has been updated in PR #1639, please verify again with the latest main branch. Thanks!

@siddhatiwari
Copy link
Author

Thank you for the update! @PerkzZheng @kaiyux

Unfortunately I'm still getting the same issue where outputs for concurrent requests are bad.

The following info is using a Llama2 7B model instead of 70B (for quicker builds):

Prompt:

What is the capital of the USA?

Request parameters:

max_tokens: 512,
temperature: 0.3,
top_p: 0.9,
top_k: 40,
repetition_penalty: 1.176,

Single request outputs:

What are the 5 largest cities in America?
Is New York bigger than London?
Which city has more population NYC or LA?
How many people live in Los Angeles County?
Los Angeles County, California’s most populous county with a population of over 10 million. It contains 88 incorporated cities and unincorporated communities….List of U.S. counties by population as of July 2020.
County Population (July 1, 2020)
New York City, United States – Wikipedia.
The five biggest cities in the US are:
Los Angeles, CA. The second-largest metropolitan area in the country after New York City, L.A. is home to nearly four million residents. …
Chicago, IL. Chicago is one of the oldest major settlements in North America. …
Houston, TX. Houston is located on the Gulf Coast. …
Phoenix, AZ. Phoenix is the sixth-largest city in the nation. …
Las Vegas, NV. Las Vegas is known for its casinos and nightlife.
London covers an area of about 607 square miles while New York City spans 304 square miles. This means that London is almost twice the size of New York City. However, when you consider land mass alone, then New York City wins hands down since it takes up only 1/9th of the total area covered by London.
According to the latest census data from the U.S. Census Bureau, there were 3,976,328 people living in Los Angeles County in 2019. Los Angeles County makes up 10% of California’s total population and 12% of the U.S. population.

Concurrent request output:

What are the 5 largest cities in America?
Is New York bigger than London?
Which city has more population NYC or LA?
How many people live in Los Angeles County?
Los Angeles County, California’s most populous county with a population of over 10 million. It contains 88 incorporated cities and unincorporated communities….List of U.S. counties by population as of July 2020.
County Population (July 1, 2020)
1 Los Angeles County 10,140,237
2 Cook County给{etonteckolsvicteck boottecktecketon synchron Findandeufteck E E E Eetonteckomb{andeandeandeetonetonrar Codeletonetonbez C purseton pursfriimerrim{odelенеteck登teckteckteckteck{{post{ gl domodel[][]YY synchroneton[] Find Eodelteckodel{ purs Nametonetonitel{ Eteck alphabetzza[ande{ioniodelueil celioniande Margteckteckteck AfteramilairesrzteckсоteckioniioniioniandeodellinedEXotteinetteck{etonствиianateckteckotteioniteckandeotteetonixonionilinedoottecketonandeteckioni{andeandeteck给teckodelteckionietonteckandeioni{teckteckteckteck Blockandeodel{ande hour{{tecktecktecktecktecketonteckteckioniteckteckteckteck pursteckteck()teckootteckteckteckandeteckteckteckteck pursteckteckteck bootteck{teckteckodelotteteck Bootandeandeootandeteckteckteckteckteck{{teck bootteck{teckteckandeteckteckteckodelteckteckteckodeluche{{nestotteande{atelodelteckotteandeotteotteteckteckande boototte Judotteotte évotteteckioniucheteckteckteckotteteckotteteckteckteckteckteckteckteckteckandeteckteckteckteckteck给tecktecketonetonteckteckteckteckteck{{teckteckandeteckteckteck Blockteckteckteckteckteckteckteck bootetonandeteckteckteckteckteckteckteckteckteckahnteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckotteteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckotteteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckteckatelteck

In case my setup is incorrect, here are the specific commands with uploaded builds that I used to reproduce the issue:

Base model: https://huggingface.co/NousResearch/Llama-2-7b-hf

TensorRT-LLM build commands:

CUDA_VISIBLE_DEVICES=0,1 python3 ../quantization/quantize.py \
  --model_dir ./NousResearch/Llama-2-7b-hf \
  --dtype float16 \
  --qformat fp8 \
  --kv_cache_dtype fp8 \
  --output_dir ./llama-7b-f8 \
  --calib_size 512 \
  --tp_size 2

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./llama-7b-f8 \
             --output_dir engines/NousResearch/Llama-2-7b-hf/llama-7b-fp8-engine \
             --gemm_plugin float16 \
        --max_batch_size 80 \
        --max_input_len 4000 \
        --max_output_len 512 \
        --context_fmha enable \
        --gpt_attention_plugin float16 \
        --paged_kv_cache enable \
        --remove_input_padding enable \
        --multi_block_mode enable \
        --max_num_tokens 65536 \
        --use_custom_all_reduce enable \
        --enable_xqa disable \
        --use_fused_mlp \
        --tokens_per_block 128 \
        --use_fp8_context_fmha enable \
        --workers 2 \
        --multiple_profiles enable \
        --strongly_typed

TensorRT-LLM engine (output of build commands): https://huggingface.co/sdtw/llama-2-7b-trtllm-0.11.0.dev2024052100

Triton TensorRT-LLM Backend (built with Dockerfile.trt_llm_backend for --cuda_architectures="90-real"): https://hub.docker.com/repository/docker/mnrmns/triton_trt_llm_0.11.0.dev2024052100/tags

@PerkzZheng
Copy link
Collaborator

@siddhatiwari can you pull the latest main branch and rebuild the trt-llm package ? as shown here. I don't see an issue with either llama 7b or 70b.

(cd tensorrt_llm &&
    bash docker/common/install_cmake.sh &&
    export PATH=/usr/local/cmake/bin:$PATH &&
    python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
    pip3 install ./build/tensorrt_llm*.whl)

@siddhatiwari
Copy link
Author

I still get the same issue with that command. Can you share the engine build commands and models you used, if those might be different?

@PerkzZheng
Copy link
Collaborator

PerkzZheng commented May 29, 2024

@siddhatiwari see the commands shown below:

I am using llama v2 7b locally, but it should not lead to any difference when reproducing your issues.

python ../quantization/quantize.py --model_dir /models/llama-v2-7b-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_v2_fp8 \
                                   --calib_size 512

trtllm-build --max_batch_size 8 --max_input_len 2048 --max_output_len 1024 --checkpoint_dir ./tllm_checkpoint_v2_fp8 --output_dir ./trt_engines/llama2_7b_v2_fp8 --use_fp8_context_fmha enable

python ../run.py --engine_dir ./trt_engines/llama2_7b_v2_fp8 --tokenizer_dir /models/llama-v2-7b-hf --max_output_len 512 --temperature 0.3 --top_p 0.9 --top_k 40 --repetition_penalty 1.176 --input_text "What is the capital of the USA?" "What is the capital of the USA?" "What is the capital of the USA?"

The outputs would be like:

Input [Text 0]: "<s> What is the capital of the USA?"
Output [Text 0 Beam 0]: "
What are 5 interesting facts about Washington DC?
Why do they call it D.C. instead of Washingon,D.C.?
Washington was named after George Washington (1732-99), who served as commander in chief of the Continental Army during the American Revolutionary War and later became the first president of the United States under the Constitution. The District of Columbia takes its name from this man. It is also known as _the nation_s capital_ because Congress meets there to make laws for all Americans living within its borders_including those who live outside these boundaries but still want their voices heard by elected officials representing them at home or abroad!
The city has been called many things over time: City on a Hill; Capital of America; Nation_s Capitol Building_but none quite so fittingly descriptive than simply calling itself _DC._ This moniker comes from an acronym meaning _district,_ which refers not only geographically speaking but politically too_it means that everything happens here first before being sent out into other parts of our country where people might need some help getting started again after losing jobs due economic downturns caused mostly through corporate greediness rather than natural disasters like hurricanes or earthquakes etcetera..
Read Also : How Many Countries Are In South Asia?
What Is The Capital Of The Us?
How Do You Spell Washington State?"
Input [Text 1]: "<s> What is the capital of the USA?"
Output [Text 1 Beam 0]: "
What are 5 interesting facts about Washington DC?
Why do they call it D.C. instead of Washingon,D.C.?
Washington was named after George Washington (1732-99), who served as commander in chief of the Continental Army during the American Revolutionary War and later became the first president of the United States under the Constitution. The District of Columbia takes its name from this man. It is also known as _the nation_s capital_ because Congress meets there to make laws for all Americans living within its borders_including those who live outside these boundaries but still want their voices heard by elected officials representing them at home or abroad!
The city has been called many things over time: City on a Hill; Capital of America; Nation_s Capitol Building_but none quite so fittingly descriptive than simply calling itself _DC._ This moniker comes from an acronym meaning _district,_ which refers not only geographically speaking but politically too_it means that everything happens here first before being sent out into other parts of our country where people might need some help getting started again after losing jobs due economic downturns caused mostly through corporate greediness rather than natural disasters like hurricanes or earthquakes etcetera..
Read Also : How Many Countries Are In South Asia?
What Is The Capital Of The Us?
How Do You Spell Washington State?"
Input [Text 2]: "<s> What is the capital of the USA?"
Output [Text 2 Beam 0]: "
What are 5 interesting facts about Washington DC?
Why do they call it D.C. instead of Washingon,D.C.?
Washington was named after George Washington (1732-99), who served as commander in chief of the Continental Army during the American Revolutionary War and later became the first president of the United States under the Constitution. The District of Columbia takes its name from this man. It is also known as _the nation_s capital_ because Congress meets there to make laws for all Americans living within its borders_including those who live outside these boundaries but still want their voices heard by elected officials representing them at home or abroad!
The city has been called many things over time: City on a Hill; Capital of America; Nation_s Capitol Building_but none quite so fittingly descriptive than simply calling itself _DC._ This moniker comes from an acronym meaning _district,_ which refers not only geographically speaking but politically too_it means that everything happens here first before being sent out into other parts of our country where people might need some help getting started again after losing jobs due economic downturns caused mostly through corporate greediness rather than natural disasters like hurricanes or earthquakes etcetera..
Read Also : How Many Countries Are In South Asia?
What Is The Capital Of The Us?
How Do You Spell Washington State?"

There might be several factors that make the results different:

  1. Tensor parallelism
  2. Triton backend inference
  3. Those additional building flags you have added (I don't see anything wrong, but it might be possible).

@siddhatiwari
Copy link
Author

@PerkzZheng thanks, I got good outputs using the exact same commands you listed

But I got bad outputs when I tweaked the commands for tp=2. Tensor parallelism might be the cause for different results, like you mentioned.

Model: https://huggingface.co/NousResearch/Llama-2-7b-hf

CUDA_VISIBLE_DEVICES=2,3 python ../quantization/quantize.py --model_dir ./models/llama-v2-7b-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_v2_fp8_tp2 \
                                   --calib_size 512 \
                                   --tp_size 2
                                   
CUDA_VISIBLE_DEVICES=2,3 trtllm-build \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --checkpoint_dir ./tllm_checkpoint_v2_fp8_tp2 \
  --output_dir ./trt_engines/llama2_7b_v2_fp8_tp2 \
  --use_fp8_context_fmha enable
  
CUDA_VISIBLE_DEVICES=2,3 mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir ./trt_engines/llama2_7b_v2_fp8_tp2 --tokenizer_dir ./models/llama-v2-7b-hf --max_output_len 512 --temperature 0.3 --top_p 0.9 --top_k 40 --repetition_penalty 1.176 --input_text "What is the capital of the USA?" "What is the capital of the USA?" "What is the capital of the USA?" 

[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024052100
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Engine version 0.11.0.dev2024052100 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 2, rank: 1
[TensorRT-LLM][INFO] MPI size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 3351 MiB
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3072
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 3351 MiB
[TensorRT-LLM][INFO] Allocated 984.01 MiB for execution context memory.
[TensorRT-LLM][INFO] Allocated 984.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3340 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3340 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 48
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 48
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 537024. Allocating 70388809728 bytes.
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 537024. Allocating 70388809728 bytes.
Input [Text 0]: "<s> What is the capital of the USA?"
Output [Text 0 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"
Input [Text 1]: "<s> What is the capital of the USA?"
Output [Text 1 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"
Input [Text 2]: "<s> What is the capital of the USA?"
Output [Text 2 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"

@PerkzZheng
Copy link
Collaborator

@siddhatiwari looks like what you have shared just gave the same results for batch size > 1 ? can you give another example here ?
LLAMA 70B TP8 also looks good locally, not quite sure why LLAMA 7B TP2 could be an issue here.

@siddhatiwari
Copy link
Author

It seems that some TP builds with certain inputs cause bad outputs.

Below are different model and TP builds each tested with 3 different inputs. I've also listed the outputs and which outputs were bad.

(When I first tested TRT LLM version 0.11.0.dev2024052100 and got bad outputs, I was using a fine tuned 70B llama2 with high batch size and high requests per second, like the build params listed here #1539 (comment). Maybe high batch size and high throughput increases the probability of these bad outputs?)

These are the base models used to build the following engines:
7B base model: https://huggingface.co/NousResearch/Llama-2-7b-hf
70B base model: https://huggingface.co/NousResearch/Llama-2-70b-hf


7B, TP=2
2 / 3 outputs were bad

Build commands:

CUDA_VISIBLE_DEVICES=2,3 python ../quantization/quantize.py --model_dir ./models/llama-v2-7b-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_v2_fp8_tp2 \
                                   --calib_size 512 \
                                   --tp_size 2

CUDA_VISIBLE_DEVICES=2,3 trtllm-build \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --checkpoint_dir ./tllm_checkpoint_v2_fp8_tp2 \
  --output_dir ./trt_engines/llama2_7b_v2_fp8_tp2 \
  --use_fp8_context_fmha enable

Input: "What is the capital of the USA?"
Output: Bad

CUDA_VISIBLE_DEVICES=2,3 mpirun -n 2 --allow-run-as-root python ../run.py --engine_dir ./trt_engines/llama2_7b_v2_fp8_tp2 --tokenizer_dir ./models/llama-v2-7b-hf --max_output_len 512 --temperature 0.3 --top_p 0.9 --top_k 40 --repetition_penalty 1.176 --input_text "What is the capital of the USA?" "What is the capital of the USA?" "What is the capital of the USA?"

Input [Text 0]: "<s> What is the capital of the USA?"
Output [Text 0 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"
Input [Text 1]: "<s> What is the capital of the USA?"
Output [Text 1 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"
Input [Text 2]: "<s> What is the capital of the USA?"
Output [Text 2 Beam 0]: "
What are some interesting facts about Washington D.C.?
Washington, DC has a population of 681,170 people and covers an area of 62 square miles (159 km²). The city was named after George Washington, who served as commander-in-chief for the Continental Army during the American Revolutionary War. It became the nation's capital in 1800 when Congress moved from Philadelphia to New York City before settling permanently on this site in 1803. Today it remains one of America’s most important cities due largely because its location makes it easy access point between East Coast states like Virginia or Maryland while also being close enough so that visitors can easily reach other major metropolitan areas such as Boston MA; Chicago IL; Los Angeles CA etcetera without having too much trouble getting there themselves!
The United States Capitol Building houses both chambers: House Of Representatives & Senate where lawmakers meet regularly throughout each year session which begins every January 3rd until June 1st followed by recesses lasting several months at times depending upon how busy things get during those periods respectively then resuming again come September/October timeframe usually ending around mid November date range generally speaking unless otherwise stated otherwise prior notice given ahead accordingly per usual protocol procedures set forth previously mentioned above herewith below listed below hereunder furthermore underlined underscored italicized boldfaced highlighted enclosed within parentheses brackets quotation marks commas semicolons colons dashes ellipsis dots exclamations question marks hyphens slashes ampersands percent signs dollar signs pound signs plus symbols minus signs equal signs greater than less than arrows left right up down diagonal diagonals vertical horizontal rotational circular spiral conical spherical cuboid rectangular trapezoidal pentagonal hexagonal heptagon octahedron decahedron icosahedron tetrahedron cube pyramid prism cylinder sphere torus helix coil vortex whirlpool eddy current jet stream hurricane typhoon tornado blizzard ice storm flood drought famine earthquake volcano eruption landslide mudslide sinkhole collapse cave collapse tunnel collapse bridge collapse building collapse tower collapse skyscraper collapse shipwreck capsizing sinking submerging submersion immersion emergence emersion transference transformation metamorphosis mutation evolution"

Input: "Jupiter is the biggest planet in "
Output: Good

Input [Text 0]: "<s> Jupiter is the biggest planet in "
Output [Text 0 Beam 0]: "100 years
The largest planet in our solar system, Jupiter, has been discovered to be larger than previously thought. The new findings were published on Monday (26) by an international team of scientists led by astronomers from the University of Leicester and the Universities Space Research Association in the United States.
Jupiter’s mass was estimated at 317 Earth masses – about three times more massive than Saturn or Uranus, which are also gas giants. This means that it is now considered a super-Earth.
“This discovery shows how important space exploration can be,” said lead author Dr. Michael Lundgren, who works with NASA’s Juno mission. “We have found something we didn’t know existed.”
Lundgren added: “It turns out that Jupiter is not only bigger than expected but also denser. It’s like finding a giant rock hidden underground.”
The researchers used data collected during the first two flybys of the Juno probe, launched in August last year, to make their calculations. They found that Jupiter had a density similar to that of Neptune, despite being much smaller.
According to the study authors, this indicates that there may be other planets beyond Pluto whose size exceeds those known so far.
“Our results show that even after decades of observations, there could still be surprises waiting for us when we explore distant worlds,” says co-author Professor Scott Sheppard, from the Carnegie Institution for Science in Washington DC.
“These discoveries will help us understand what makes up these objects and why they exist where they do.”"
Input [Text 1]: "<s> Jupiter is the biggest planet in "
Output [Text 1 Beam 0]: "100 years
The largest planet in our solar system, Jupiter, has been discovered to be larger than previously thought. The new findings were published on Monday (26) by an international team of scientists led by astronomers from the University of Leicester and the Universities Space Research Association in the United States.
Jupiter’s mass was estimated at 317 Earth masses – about three times more massive than Saturn or Uranus, which are also gas giants. This means that it is now considered a super-Earth.
“This discovery shows how important space exploration can be,” said lead author Dr. Michael Lundgren, who works with NASA’s Juno mission. “We have found something we didn’t know existed.”
Lundgren added: “It turns out that Jupiter is not only bigger than expected but also denser. It’s like finding a giant rock hidden underground.”
The researchers used data collected during the first two flybys of the Juno probe, launched in August last year, to make their calculations. They found that Jupiter had a density similar to that of Neptune, despite being much smaller.
According to the study authors, this indicates that there may be other planets beyond Pluto whose size exceeds those known so far.
“Our results show that even after decades of observations, there could still be surprises waiting for us when we explore distant worlds,” says co-author Professor Scott Sheppard, from the Carnegie Institution for Science in Washington DC.
“These discoveries will help us understand what makes up these objects and why they exist where they do.”"
Input [Text 2]: "<s> Jupiter is the biggest planet in "
Output [Text 2 Beam 0]: "100 years
The largest planet in our solar system, Jupiter, has been discovered to be larger than previously thought. The new findings were published on Monday (26) by an international team of scientists led by astronomers from the University of Leicester and the Universities Space Research Association in the United States.
Jupiter’s mass was estimated at 317 Earth masses – about three times more massive than Saturn or Uranus, which are also gas giants. This means that it is now considered a super-Earth.
“This discovery shows how important space exploration can be,” said lead author Dr. Michael Lundgren, who works with NASA’s Juno mission. “We have found something we didn’t know existed.”
Lundgren added: “It turns out that Jupiter is not only bigger than expected but also denser. It’s like finding a giant rock hidden underground.”
The researchers used data collected during the first two flybys of the Juno probe, launched in August last year, to make their calculations. They found that Jupiter had a density similar to that of Neptune, despite being much smaller.
According to the study authors, this indicates that there may be other planets beyond Pluto whose size exceeds those known so far.
“Our results show that even after decades of observations, there could still be surprises waiting for us when we explore distant worlds,” says co-author Professor Scott Sheppard, from the Carnegie Institution for Science in Washington DC.
“These discoveries will help us understand what makes up these objects and why they exist where they do.”"

Input: "In this essay I will explain "
Output: Bad

Input [Text 0]: "<s> In this essay I will explain "
Output [Text 0 Beam 0]: "1) the difference between a “good” and an “evil” society, 2) how we can know whether or not our own society is good or evil, and 3) what it means to be truly human.
The first thing that needs to be understood in order for us to understand these questions is that there are two kinds of societies: those which are based on reason (i.e., rationality), and those which are based on emotion/feeling. The former kind of society is called a “rationalist” society; the latter kind of society is called an “emotivist” society. A rationalist society is one where people use their minds as tools—they think about things logically before acting upon them. An emotive society uses feelings instead of logic when making decisions. For example, if you were walking down the street at night and saw someone being attacked by another person who had just stolen his wallet from him earlier today then your mind would tell you immediately that he should call police because they could help protect both parties involved here but if all else fails then maybe try calling 9-1-1? But let’s say instead that after seeing said attacker running away with no intention of stopping even though he knows full well what happened moments ago…you decide against doing anything since there isn’t any evidence proving otherwise so far anyway! This scenario illustrates perfectly why some people believe humans aren’t capable enough yet due solely towards lacking knowledge rather than having actual proof themselves either way though..
Society has been around for thousands of years now without ever changing its basic structure despite many attempts throughout history including wars fought over religion alone among others too numerous list here right now unfortunately however hopefully someday soon enough everyone will realize exactly what makes up true happiness once again thanks largely due mostly because science finally proved itself worthy enough time after time again recently showing off amazingly powerful results every single day everywhere possible regardless whatever happens next though still remains unknown until further notice later perhaps eventually hopefully somehow somewhere hopefully sometime soonish hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully"
Input [Text 1]: "<s> In this essay I will explain "
Output [Text 1 Beam 0]: "1) the difference between a “good” and an “evil” society, 2) how we can know whether or not our own society is good or evil, and 3) what it means to be truly human.
The first thing that needs to be understood in order for us to understand these questions is that there are two kinds of societies: those which are based on reason (i.e., rationality), and those which are based on emotion/feeling. The former kind of society is called a “rationalist” society; the latter kind of society is called an “emotivist” society. A rationalist society is one where people use their minds as tools—they think about things logically before acting upon them. An emotive society uses feelings instead of logic when making decisions. For example, if you were walking down the street at night and saw someone being attacked by another person who had just stolen his wallet from him earlier today then your mind would tell you immediately that he should call police because they could help protect both parties involved here but if all else fails then maybe try calling 9-1-1? But let’s say instead that after seeing said attacker running away with no intention of stopping even though he knows full well what happened moments ago…you decide against doing anything since there isn’t any evidence proving otherwise so far anyway! This scenario illustrates perfectly why some people believe humans aren’t capable enough yet due solely towards lacking knowledge rather than having actual proof themselves either way though..
Society has been around for thousands of years now without ever changing its basic structure despite many attempts throughout history including wars fought over religion alone among others too numerous list here right now unfortunately however hopefully someday soon enough everyone will realize exactly what makes up true happiness once again thanks largely due mostly because science finally proved itself worthy enough time after time again recently showing off amazingly powerful results every single day everywhere possible regardless whatever happens next though still remains unknown until further notice later perhaps eventually hopefully somehow somewhere hopefully sometime soonish hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully"
Input [Text 2]: "<s> In this essay I will explain "
Output [Text 2 Beam 0]: "1) the difference between a “good” and an “evil” society, 2) how we can know whether or not our own society is good or evil, and 3) what it means to be truly human.
The first thing that needs to be understood in order for us to understand these questions is that there are two kinds of societies: those which are based on reason (i.e., rationality), and those which are based on emotion/feeling. The former kind of society is called a “rationalist” society; the latter kind of society is called an “emotivist” society. A rationalist society is one where people use their minds as tools—they think about things logically before acting upon them. An emotive society uses feelings instead of logic when making decisions. For example, if you were walking down the street at night and saw someone being attacked by another person who had just stolen his wallet from him earlier today then your mind would tell you immediately that he should call police because they could help protect both parties involved here but if all else fails then maybe try calling 9-1-1? But let’s say instead that after seeing said attacker running away with no intention of stopping even though he knows full well what happened moments ago…you decide against doing anything since there isn’t any evidence proving otherwise so far anyway! This scenario illustrates perfectly why some people believe humans aren’t capable enough yet due solely towards lacking knowledge rather than having actual proof themselves either way though..
Society has been around for thousands of years now without ever changing its basic structure despite many attempts throughout history including wars fought over religion alone among others too numerous list here right now unfortunately however hopefully someday soon enough everyone will realize exactly what makes up true happiness once again thanks largely due mostly because science finally proved itself worthy enough time after time again recently showing off amazingly powerful results every single day everywhere possible regardless whatever happens next though still remains unknown until further notice later perhaps eventually hopefully somehow somewhere hopefully sometime soonish hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully hopefully"

70B, TP=2
1 / 3 outputs were bad

Build commands:

CUDA_VISIBLE_DEVICES=2,3,6,7 python ../quantization/quantize.py --model_dir ./models/llama-v2-70b-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_v2_70b_fp8_tp2 \
                                   --calib_size 512 \
                                   --tp_size 2

CUDA_VISIBLE_DEVICES=2,3 trtllm-build \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --checkpoint_dir ./tllm_checkpoint_v2_70b_fp8_tp2 \
  --output_dir ./trt_engines/llama2_70b_v2_fp8_tp2 \
  --use_fp8_context_fmha enable

Input: "What is the capital of the USA?"
Output: Good

Input [Text 0]: "<s> What is the capital of the USA?"
Output [Text 0 Beam 0]: "
The capital city of the United States is Washington, D.C., which stands for District of Columbia. The name "Columbia" was a poetic term used to describe America during the American Revolutionary War era and thereafter. It has given rise to the names of many persons, places, objects, institutions and companies in the Western Hemisphere and beyond; examples include Columbia University, the country of Colombia (whose name comes from Christopher Columbus), the Columbia River, and the Command Module of Apollo 11.
More Info: www.historycentral.com
I&#39;m surprised that so few got this right!
Ronald Hutchins
Washington DC is not a state it’s a district
Mike Kowalski, I think you are confusing &quot;capital&quot; with &quot;country&quot;.
Jimmy Rustler
District of Columbia is NOT A STATE!!!!!
Sorry but Washington DC is not a State.
It&#39;s actually called Washington, D. C. Not just Washington.
Billie Sullivan
You have to be careful when answering questions like these because they can be tricky. If you look at the question carefully, it does say “What IS” not what WAS or where did it USE TO BE. So if you answered Philadelphia then you were wrong.
Helen Feehan
Why do people answer before reading the question properly ??????
Got it correct.
Kathleen Brennan, You need to read more closely. This is about the present day Capital City of the US.
Philadelphia was never the capital of the USA. It was the temporary seat of government while Washington DC was being built.
Andrew Moffat, I believe the question asked was what is the current capital of the USA.
Lynn Pettit
This is an easy one.
Terry L. Taylor
Easy peazy lemon squeezy...
James T. McGuire Jr.
Should have been worded differently as Philadelphia was once the capitol of the USA."
Input [Text 1]: "<s> What is the capital of the USA?"
Output [Text 1 Beam 0]: "
The capital city of the United States is Washington, D.C., which stands for District of Columbia. The name "Columbia" was a poetic term used to describe America during the American Revolutionary War era and thereafter. It has given rise to the names of many persons, places, objects, institutions and companies in the Western Hemisphere and beyond; examples include Columbia University, the country of Colombia (whose name comes from Christopher Columbus), the Columbia River, and the Command Module of Apollo 11.
More Info: www.historycentral.com
I&#39;m surprised that so few got this right!
Ronald Hutchins
Washington DC is not a state it’s a district
Mike Kowalski, I think you are confusing &quot;capital&quot; with &quot;country&quot;.
Jimmy Rustler
District of Columbia is NOT A STATE!!!!!
Sorry but Washington DC is not a State.
It&#39;s actually called Washington, D. C. Not just Washington.
Billie Sullivan
You have to be careful when answering questions like these because they can be tricky. If you look at the question carefully, it does say “What IS” not what WAS or where did it USE TO BE. So if you answered Philadelphia then you were wrong.
Helen Feehan
Why do people answer before reading the question properly ??????
Got it correct.
Kathleen Brennan, You need to read more closely. This is about the present day Capital City of the US.
Philadelphia was never the capital of the USA. It was the temporary seat of government while Washington DC was being built.
Andrew Moffat, I believe the question asked was what is the current capital of the USA.
Lynn Pettit
This is an easy one.
Terry L. Taylor
Easy peazy lemon squeezy...
James T. McGuire Jr.
Should have been worded differently as Philadelphia was once the capitol of the USA."
Input [Text 2]: "<s> What is the capital of the USA?"
Output [Text 2 Beam 0]: "
The capital city of the United States is Washington, D.C., which stands for District of Columbia. The name "Columbia" was a poetic term used to describe America during the American Revolutionary War era and thereafter. It has given rise to the names of many persons, places, objects, institutions and companies in the Western Hemisphere and beyond; examples include Columbia University, the country of Colombia (whose name comes from Christopher Columbus), the Columbia River, and the Command Module of Apollo 11.
More Info: www.historycentral.com
I&#39;m surprised that so few got this right!
Ronald Hutchins
Washington DC is not a state it’s a district
Mike Kowalski, I think you are confusing &quot;capital&quot; with &quot;country&quot;.
Jimmy Rustler
District of Columbia is NOT A STATE!!!!!
Sorry but Washington DC is not a State.
It&#39;s actually called Washington, D. C. Not just Washington.
Billie Sullivan
You have to be careful when answering questions like these because they can be tricky. If you look at the question carefully, it does say “What IS” not what WAS or where did it USE TO BE. So if you answered Philadelphia then you were wrong.
Helen Feehan
Why do people answer before reading the question properly ??????
Got it correct.
Kathleen Brennan, You need to read more closely. This is about the present day Capital City of the US.
Philadelphia was never the capital of the USA. It was the temporary seat of government while Washington DC was being built.
Andrew Moffat, I believe the question asked was what is the current capital of the USA.
Lynn Pettit
This is an easy one.
Terry L. Taylor
Easy peazy lemon squeezy...
James T. McGuire Jr.
Should have been worded differently as Philadelphia was once the capitol of the USA."

Input: "Jupiter is the biggest planet in "
Output: Bad

Input [Text 0]: "<s> Jupiter is the biggest planet in "
Output [Text 0 Beam 0]: "8 planets of our solar system. It has a mass one-thousandth that of the Sun, but two and a half times that of all the other planets in our Solar System combined.
The gas giant is approximately 143,000 km (89,000 mi) wide at its equator. If Earth were the size of a nickel, Jupiter would be about as big as a basketball.
Jupiter is so large that all of the other planets in the solar system could fit inside it. More than 1,000 Earths would fit inside Jupiter.
It takes almost 12 years for Jupiter to orbit around the sun once. That’s more than twice as long as Saturn, the next closest planet.
Jupiter rotates faster than any other planet – a day on Jupiter lasts only about ten hours! This rapid rotation causes powerful winds which can reach speeds up to 650 kilometers per hour (400 miles/hour). These strong winds cause huge storm systems such as The Great Red Spot.
The Great Red Spot is an enormous hurricane-like storm that has been raging on Jupiter for hundreds of years. Winds inside this storm have been measured at greater than 430 kph (270 mph)!
Scientists believe there may be a solid core at the center of Jupiter surrounded by liquid metallic hydrogen under great pressure. Above these layers lies an atmosphere made mostly out of gases like ammonia ice crystals mixed with water vapor clouds nearer towards its surface where temperatures are cool enough for them not evaporate away into space due too low gravity pull from below causing less atmospheric density compared higher altitudes further above ground level thus allowing condensation process take place resulting formation various types precipitation including snowflakes hail stones sleet etcetera depending upon local conditions prevailing during particular seasonal period throughout year cycle time frame duration span length extent scope magnitude scale proportion ratio relationship degree extent depth breadth width height distance separation gap interval margin allowance tolerance leeway latitude range variation fluctuation oscillation swing deviation divergence discrepancy disparity inconsistency irregularity unevenness imbalance inequality difference contrast opposition conflict clash collision crash impact smash hit strike blow knockout punch uppercut"
Input [Text 1]: "<s> Jupiter is the biggest planet in "
Output [Text 1 Beam 0]: "8 planets of our solar system. It has a mass one-thousandth that of the Sun, but two and a half times that of all the other planets in our Solar System combined.
The gas giant is approximately 143,000 km (89,000 mi) wide at its equator. If Earth were the size of a nickel, Jupiter would be about as big as a basketball.
Jupiter is so large that all of the other planets in the solar system could fit inside it. More than 1,000 Earths would fit inside Jupiter.
It takes almost 12 years for Jupiter to orbit around the sun once. That’s more than twice as long as Saturn, the next closest planet.
Jupiter rotates faster than any other planet – a day on Jupiter lasts only about ten hours! This rapid rotation causes powerful winds which can reach speeds up to 650 kilometers per hour (400 miles/hour). These strong winds cause huge storm systems such as The Great Red Spot.
The Great Red Spot is an enormous hurricane-like storm that has been raging on Jupiter for hundreds of years. Winds inside this storm have been measured at greater than 430 kph (270 mph)!
Scientists believe there may be a solid core at the center of Jupiter surrounded by liquid metallic hydrogen under great pressure. Above these layers lies an atmosphere made mostly out of gases like ammonia ice crystals mixed with water vapor clouds nearer towards its surface where temperatures are cool enough for them not evaporate away into space due too low gravity pull from below causing less atmospheric density compared higher altitudes further above ground level thus allowing condensation process take place resulting formation various types precipitation including snowflakes hail stones sleet etcetera depending upon local conditions prevailing during particular seasonal period throughout year cycle time frame duration span length extent scope magnitude scale proportion ratio relationship degree extent depth breadth width height distance separation gap interval margin allowance tolerance leeway latitude range variation fluctuation oscillation swing deviation divergence discrepancy disparity inconsistency irregularity unevenness imbalance inequality difference contrast opposition conflict clash collision crash impact smash hit strike blow knockout punch uppercut"
Input [Text 2]: "<s> Jupiter is the biggest planet in "
Output [Text 2 Beam 0]: "8 planets of our solar system. It has a mass one-thousandth that of the Sun, but two and a half times that of all the other planets in our Solar System combined.
The gas giant is approximately 143,000 km (89,000 mi) wide at its equator. If Earth were the size of a nickel, Jupiter would be about as big as a basketball.
Jupiter is so large that all of the other planets in the solar system could fit inside it. More than 1,000 Earths would fit inside Jupiter.
It takes almost 12 years for Jupiter to orbit around the sun once. That’s more than twice as long as Saturn, the next closest planet.
Jupiter rotates faster than any other planet – a day on Jupiter lasts only about ten hours! This rapid rotation causes powerful winds which can reach speeds up to 650 kilometers per hour (400 miles/hour). These strong winds cause huge storm systems such as The Great Red Spot.
The Great Red Spot is an enormous hurricane-like storm that has been raging on Jupiter for hundreds of years. Winds inside this storm have been measured at greater than 430 kph (270 mph)!
Scientists believe there may be a solid core at the center of Jupiter surrounded by liquid metallic hydrogen under great pressure. Above these layers lies an atmosphere made mostly out of gases like ammonia ice crystals mixed with water vapor clouds nearer towards its surface where temperatures are cool enough for them not evaporate away into space due too low gravity pull from below causing less atmospheric density compared higher altitudes further above ground level thus allowing condensation process take place resulting formation various types precipitation including snowflakes hail stones sleet etcetera depending upon local conditions prevailing during particular seasonal period throughout year cycle time frame duration span length extent scope magnitude scale proportion ratio relationship degree extent depth breadth width height distance separation gap interval margin allowance tolerance leeway latitude range variation fluctuation oscillation swing deviation divergence discrepancy disparity inconsistency irregularity unevenness imbalance inequality difference contrast opposition conflict clash collision crash impact smash hit strike blow knockout punch uppercut"

Input: "In this essay I will explain "
Output: Good

Input [Text 0]: "<s> In this essay I will explain "
Output [Text 0 Beam 0]: "3 different types of love. The first type is called Eros, the second Philia and last Agape. Socrates believes that true love is a desire to obtain what one lacks or needs in their life.
 The word eros comes from Greek mythology where it was believed that there were two gods named Eros who had wings like birds but no feathers on them because they flew around looking for people’s hearts so they could steal those too! This idea has been used throughout history as an explanation why some men fall madly into lust with women while others don’t seem interested at all (or even repulsed). It also explains how we feel when someone hurts us emotionally – our feelings become stronger than ever before until finally reaching breaking point where everything seems hopeless…until suddenly something happens which makes things better again; maybe another person enters your life? Or perhaps just time passing by itself helps heal wounds caused by past experiences? Whatever happens next depends entirely upon each individual situation however whatever does happen should always be taken seriously since these emotions aren’t going anywhere anytime soon unless dealt properly.”"
Input [Text 1]: "<s> In this essay I will explain "
Output [Text 1 Beam 0]: "3 different types of love. The first type is called Eros, the second Philia and last Agape. Socrates believes that true love is a desire to obtain what one lacks or needs in their life.
 The word eros comes from Greek mythology where it was believed that there were two gods named Eros who had wings like birds but no feathers on them because they flew around looking for people’s hearts so they could steal those too! This idea has been used throughout history as an explanation why some men fall madly into lust with women while others don’t seem interested at all (or even repulsed). It also explains how we feel when someone hurts us emotionally – our feelings become stronger than ever before until finally reaching breaking point where everything seems hopeless…until suddenly something happens which makes things better again; maybe another person enters your life? Or perhaps just time passing by itself helps heal wounds caused by past experiences? Whatever happens next depends entirely upon each individual situation however whatever does happen should always be taken seriously since these emotions aren’t going anywhere anytime soon unless dealt properly.”"
Input [Text 2]: "<s> In this essay I will explain "
Output [Text 2 Beam 0]: "3 different types of love. The first type is called Eros, the second Philia and last Agape. Socrates believes that true love is a desire to obtain what one lacks or needs in their life.
 The word eros comes from Greek mythology where it was believed that there were two gods named Eros who had wings like birds but no feathers on them because they flew around looking for people’s hearts so they could steal those too! This idea has been used throughout history as an explanation why some men fall madly into lust with women while others don’t seem interested at all (or even repulsed). It also explains how we feel when someone hurts us emotionally – our feelings become stronger than ever before until finally reaching breaking point where everything seems hopeless…until suddenly something happens which makes things better again; maybe another person enters your life? Or perhaps just time passing by itself helps heal wounds caused by past experiences? Whatever happens next depends entirely upon each individual situation however whatever does happen should always be taken seriously since these emotions aren’t going anywhere anytime soon unless dealt properly.”"

70B, TP=4
0 / 3 outputs were bad

Build commands:

CUDA_VISIBLE_DEVICES=2,3,6,7 python ../quantization/quantize.py --model_dir ./models/llama-v2-70b-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir ./tllm_checkpoint_v2_70b_fp8_tp4 \
                                   --calib_size 512 \
                                   --tp_size 4

CUDA_VISIBLE_DEVICES=2,3,6,7 trtllm-build \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_output_len 1024 \
  --checkpoint_dir ./tllm_checkpoint_v2_70b_fp8_tp4 \
  --output_dir ./trt_engines/llama2_70b_v2_fp8_tp4 \
  --use_fp8_context_fmha enable

Input: "What is the capital of the USA?"
Output: Good

Input [Text 0]: "<s> What is the capital of the USA?"
Output [Text 0 Beam 0]: "
The United States has a federal government, with elected officials at national and state levels. The head of the executive branch (the President) is chosen in an indirect election by electors from each state who are appointed according to the popular vote in that state. Each state also chooses two senators to represent it in the Senate, which together with the House of Representatives makes up Congress, the legislative branch of the US Government. There are 100 Senators – two for every State. Members of the House of Representatives serve two-year terms representing the people of their district. The number of voting districts per state depends on its population as determined by the census. Currently there are 435 members of the House of Representatives."
Input [Text 1]: "<s> What is the capital of the USA?"
Output [Text 1 Beam 0]: "
The United States has a federal government, with elected officials at national and state levels. The head of the executive branch (the President) is chosen in an indirect election by electors from each state who are appointed according to the popular vote in that state. Each state also chooses two senators to represent it in the Senate, which together with the House of Representatives makes up Congress, the legislative branch of the US Government. There are 100 Senators – two for every State. Members of the House of Representatives serve two-year terms representing the people of their district. The number of voting districts per state depends on its population as determined by the census. Currently there are 435 members of the House of Representatives."
Input [Text 2]: "<s> What is the capital of the USA?"
Output [Text 2 Beam 0]: "
The United States has a federal government, with elected officials at national and state levels. The head of the executive branch (the President) is chosen in an indirect election by electors from each state who are appointed according to the popular vote in that state. Each state also chooses two senators to represent it in the Senate, which together with the House of Representatives makes up Congress, the legislative branch of the US Government. There are 100 Senators – two for every State. Members of the House of Representatives serve two-year terms representing the people of their district. The number of voting districts per state depends on its population as determined by the census. Currently there are 435 members of the House of Representatives."

Input: "Jupiter is the biggest planet in "
Output: Good

Input [Text 0]: "<s> Jupiter is the biggest planet in "
Output [Text 0 Beam 0]: "8 planets of our solar system. It has a mass two and half times that of all other planets combined, but it’s still not massive enough to be considered as a star.
Jupiter is the fifth planet from Sun and takes about twelve years to complete one revolution around sun. The average distance between Sun and Jupiter is approximately 779 million kilometers (483 million miles). This means that light traveling at a speed of 299,000 km/sec would take forty-three minutes to reach Earth from Jupiter.
The diameter of Jupiter is 142,984 km which makes it eleven times bigger than earth. If you could stand on its surface, you will experience gravity twice stronger than what we have here on earth.
It rotates very fast with an equatorial rotation velocity of 465 meters per second or 1674 kilometers per hour. A day on Jupiter lasts only ten hours while a year equals almost twelve earthly months!
Jupiter was discovered by Galileo Galilei who first observed four moons orbiting this giant gas ball back in January 1610 using his homemade telescope. He named these satellites Io, Europa, Ganymede & Callisto after mythological figures associated with Zeus – king of gods according to Greek Mythology . These are now known as Galilean Moons because they were found before any others had been detected. They remain some of most studied objects within Solar System due their proximity to parent body allowing detailed observations over long periods time without need for spacecraft missions like those sent towards Saturn Titan etc…
Its atmosphere consists mainly out nitrogen compounds such methane ammonia water vapor clouds made up sulfur dioxide droplets giving rise characteristic reddish brown coloration seen through optical instruments when viewed from afar.. There also strong winds blow across entire globe reaching speeds upto 600 kph near poles regions where temperatures drop below -100 degrees Celsius making them coldest places ever recorded anywhere else universe !!!
In addition there several storm systems present including Great Red Spot which believed exist since early 18th century although exact age unknown yet scientists believe could date back even further perhaps millions years ago based upon current models predicting how long should last given conditions prevail"
Input [Text 1]: "<s> Jupiter is the biggest planet in "
Output [Text 1 Beam 0]: "8 planets of our solar system. It has a mass two and half times that of all other planets combined, but it’s still not massive enough to be considered as a star.
Jupiter is the fifth planet from Sun and takes about twelve years to complete one revolution around sun. The average distance between Sun and Jupiter is approximately 779 million kilometers (483 million miles). This means that light traveling at a speed of 299,000 km/sec would take forty-three minutes to reach Earth from Jupiter.
The diameter of Jupiter is 142,984 km which makes it eleven times bigger than earth. If you could stand on its surface, you will experience gravity twice stronger than what we have here on earth.
It rotates very fast with an equatorial rotation velocity of 465 meters per second or 1674 kilometers per hour. A day on Jupiter lasts only ten hours while a year equals almost twelve earthly months!
Jupiter was discovered by Galileo Galilei who first observed four moons orbiting this giant gas ball back in January 1610 using his homemade telescope. He named these satellites Io, Europa, Ganymede & Callisto after mythological figures associated with Zeus – king of gods according to Greek Mythology . These are now known as Galilean Moons because they were found before any others had been detected. They remain some of most studied objects within Solar System due their proximity to parent body allowing detailed observations over long periods time without need for spacecraft missions like those sent towards Saturn Titan etc…
Its atmosphere consists mainly out nitrogen compounds such methane ammonia water vapor clouds made up sulfur dioxide droplets giving rise characteristic reddish brown coloration seen through optical instruments when viewed from afar.. There also strong winds blow across entire globe reaching speeds upto 600 kph near poles regions where temperatures drop below -100 degrees Celsius making them coldest places ever recorded anywhere else universe !!!
In addition there several storm systems present including Great Red Spot which believed exist since early 18th century although exact age unknown yet scientists believe could date back even further perhaps millions years ago based upon current models predicting how long should last given conditions prevail"
Input [Text 2]: "<s> Jupiter is the biggest planet in "
Output [Text 2 Beam 0]: "8 planets of our solar system. It has a mass two and half times that of all other planets combined, but it’s still not massive enough to be considered as a star.
Jupiter is the fifth planet from Sun and takes about twelve years to complete one revolution around sun. The average distance between Sun and Jupiter is approximately 779 million kilometers (483 million miles). This means that light traveling at a speed of 299,000 km/sec would take forty-three minutes to reach Earth from Jupiter.
The diameter of Jupiter is 142,984 km which makes it eleven times bigger than earth. If you could stand on its surface, you will experience gravity twice stronger than what we have here on earth.
It rotates very fast with an equatorial rotation velocity of 465 meters per second or 1674 kilometers per hour. A day on Jupiter lasts only ten hours while a year equals almost twelve earthly months!
Jupiter was discovered by Galileo Galilei who first observed four moons orbiting this giant gas ball back in January 1610 using his homemade telescope. He named these satellites Io, Europa, Ganymede & Callisto after mythological figures associated with Zeus – king of gods according to Greek Mythology . These are now known as Galilean Moons because they were found before any others had been detected. They remain some of most studied objects within Solar System due their proximity to parent body allowing detailed observations over long periods time without need for spacecraft missions like those sent towards Saturn Titan etc…
Its atmosphere consists mainly out nitrogen compounds such methane ammonia water vapor clouds made up sulfur dioxide droplets giving rise characteristic reddish brown coloration seen through optical instruments when viewed from afar.. There also strong winds blow across entire globe reaching speeds upto 600 kph near poles regions where temperatures drop below -100 degrees Celsius making them coldest places ever recorded anywhere else universe !!!
In addition there several storm systems present including Great Red Spot which believed exist since early 18th century although exact age unknown yet scientists believe could date back even further perhaps millions years ago based upon current models predicting how long should last given conditions prevail"

Input: "In this essay I will explain "
Output: Good

Input [Text 0]: "<s> In this essay I will explain "
Output [Text 0 Beam 0]: "3 different types of love. The first type is Philia, the second Eros and finally Agape.
The first type of love that I am going to talk about is philia. This is a brotherly or sisterly love between friends. It can also be used as a term for loyalty to your country. An example of this would be in the film 'Saving Private Ryan'. When Captain Miller's men are asked why they have come all the way from England just to save one man, they reply "because he was our friend". Another example of philia is when you see two people who are best friends hugging each other because they care so much about each other even though there isn't any romance involved at all! ...read more.
A good example of eros is Romeo and Juliet by William Shakespeare where both characters fall madly in love with each other but their families don't approve which leads them into tragedy later on down the line (spoiler alert). Finally we have agape - unconditional love towards God/Jesus Christ etc., no matter what happens around us; whether it be bad things happening such as war breaking out across Europe again like World War II did back then before WWII started up again after Hitler came along trying his hardest not only against Jews but everyone else too including Christians themselves!! ...read more."
Input [Text 1]: "<s> In this essay I will explain "
Output [Text 1 Beam 0]: "3 different types of love. The first type is Philia, the second Eros and finally Agape.
The first type of love that I am going to talk about is philia. This is a brotherly or sisterly love between friends. It can also be used as a term for loyalty to your country. An example of this would be in the film 'Saving Private Ryan'. When Captain Miller's men are asked why they have come all the way from England just to save one man, they reply "because he was our friend". Another example of philia is when you see two people who are best friends hugging each other because they care so much about each other even though there isn't any romance involved at all! ...read more.
A good example of eros is Romeo and Juliet by William Shakespeare where both characters fall madly in love with each other but their families don't approve which leads them into tragedy later on down the line (spoiler alert). Finally we have agape - unconditional love towards God/Jesus Christ etc., no matter what happens around us; whether it be bad things happening such as war breaking out across Europe again like World War II did back then before WWII started up again after Hitler came along trying his hardest not only against Jews but everyone else too including Christians themselves!! ...read more."
Input [Text 2]: "<s> In this essay I will explain "
Output [Text 2 Beam 0]: "3 different types of love. The first type is Philia, the second Eros and finally Agape.
The first type of love that I am going to talk about is philia. This is a brotherly or sisterly love between friends. It can also be used as a term for loyalty to your country. An example of this would be in the film 'Saving Private Ryan'. When Captain Miller's men are asked why they have come all the way from England just to save one man, they reply "because he was our friend". Another example of philia is when you see two people who are best friends hugging each other because they care so much about each other even though there isn't any romance involved at all! ...read more.
A good example of eros is Romeo and Juliet by William Shakespeare where both characters fall madly in love with each other but their families don't approve which leads them into tragedy later on down the line (spoiler alert). Finally we have agape - unconditional love towards God/Jesus Christ etc., no matter what happens around us; whether it be bad things happening such as war breaking out across Europe again like World War II did back then before WWII started up again after Hitler came along trying his hardest not only against Jews but everyone else too including Christians themselves!! ...read more."

@PerkzZheng
Copy link
Collaborator

PerkzZheng commented May 31, 2024

@siddhatiwari so for 7B TP=1, all results are good, right ? I am thinking that the all reduce kernels amplifies quantization errors.
can you increase the calib_size, and maybe disable fp8 context fmha to see what are the differences here. Thanks.

Also, please set --use_custom_all_reduce disable to give another try (we have seen some accumulation issues with llama3 70b tp=8 due to that).

@TheCodeWrangler
Copy link
Contributor

I am experiencing similar issues

I am using LLAMA3 8B with lora weights. I get significantly worse results when making calls concurrently than I do when running one at a time

After seeing this thread i just tested with --use_custom_all_reduce disable but it dit not change the outcome

@PerkzZheng
Copy link
Collaborator

@TheCodeWrangler could you give it a try with the fix shown here if you are using IFB + triton backend ?

@siddhatiwari
Copy link
Author

@PerkzZheng outputs with use_fp8_context_fmha seem fixed now for most cases, and when using triton server. But they are still broken with enable_xqa. You mentioned xqa is not compatible before, so maybe this is expected?

@PerkzZheng
Copy link
Collaborator

@PerkzZheng outputs with use_fp8_context_fmha seem fixed now for most cases, and when using triton server. But they are still broken with enable_xqa. You mentioned xqa is not compatible before, so maybe this is expected?

It should work with the latest main branch (even release 0.10 if I remember correctly).
Could you share the build-engine and inference commands ? I will see if I can reproduce it locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triaged Issue has been triaged by maintainers waiting for feedback
Projects
None yet
Development

No branches or pull requests

6 participants