You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried to use 8 workers to compile the model for parallism. However, got OOMed issue with TRTLLM
Expected behavior
Model is able to run conversion steps without getting OOMed
actual behavior
INFO LmiUtils convert_py: Loading checkpoint shards: 100%|??????????| 30/30 [00:42<00:00, 1.22s/it]
INFO LmiUtils convert_py: Loading checkpoint shards: 100%|??????????| 30/30 [00:42<00:00, 1.42s/it]
INFO LmiUtils convert_py: Traceback (most recent call last):
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 409, in execute
INFO LmiUtils convert_py: future.result()
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
INFO LmiUtils convert_py: return self.__get_result()
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
INFO LmiUtils convert_py: raise self._exception
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
INFO LmiUtils convert_py: result = self.fn(*self.args, **self.kwargs)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 367, in convert_and_save_rank
INFO LmiUtils convert_py: llama = LLaMAForCausalLM.from_hugging_face(
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 317, in from_hugging_face
INFO LmiUtils convert_py: weights = load_weights_from_hf_model(hf_model, config)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1128, in load_weights_from_hf_model
INFO LmiUtils convert_py: convert_layer(l)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1006, in convert_layer
INFO LmiUtils convert_py: mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 431, in get_weight
INFO LmiUtils convert_py: config[prefix + '.weight'].data = config[prefix + '.weight'].to(dtype)
INFO LmiUtils convert_py: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU
INFO LmiUtils convert_py: Traceback (most recent call last):
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 409, in execute
INFO LmiUtils convert_py: future.result()
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
INFO LmiUtils convert_py: return self.__get_result()
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
INFO LmiUtils convert_py: raise self._exception
INFO LmiUtils convert_py: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
INFO LmiUtils convert_py: result = self.fn(*self.args, **self.kwargs)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/build_scripts/llama/convert_checkpoint.py", line 367, in convert_and_save_rank
INFO LmiUtils convert_py: llama = LLaMAForCausalLM.from_hugging_face(
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 317, in from_hugging_face
INFO LmiUtils convert_py: weights = load_weights_from_hf_model(hf_model, config)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1128, in load_weights_from_hf_model
INFO LmiUtils convert_py: convert_layer(l)
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 1006, in convert_layer
INFO LmiUtils convert_py: mlp_gate_weight = get_weight(model_params, prefix + 'mlp.up_proj',
INFO LmiUtils convert_py: File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py", line 431, in get_weight
INFO LmiUtils convert_py: config[prefix + '.weight'].data = config[prefix + '.weight'].to(dtype)
INFO LmiUtils convert_py: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU
additional notes
Changing workers number to 1 could mitigate the issue. But very slow
The text was updated successfully, but these errors were encountered:
System Info
A100 40GB x8, Ubuntu 22.04
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Tried to use 8 workers to compile the model for parallism. However, got OOMed issue with TRTLLM
Expected behavior
Model is able to run conversion steps without getting OOMed
actual behavior
additional notes
Changing workers number to 1 could mitigate the issue. But very slow
The text was updated successfully, but these errors were encountered: