[Bug] qwen2 awq量化微调后的模型报错 #1836

qiuxuezhe123 · 2024-06-24T08:48:48Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

使用lmdeploy lite auto_awq将sft后的qwen2-7b进行awq量化，报错assert torch.isnan(p).sum() == 0

Reproduction

lmdeploy lite auto_awq
qwen2-sft-checkpoint-1506-merged
--calib-dataset 'c4'
--calib-samples 128
--calib-seqlen 4096
--work-dir qwen2_7b_qg_2_epoch_awq

Environment

lmdeploy==0.4.1

Error traceback

Traceback (most recent call last):
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 68, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 242, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError

lvhan028 · 2024-06-24T09:43:59Z

Can you paste the output of running lmdeploy check_env?

qiuxuezhe123 · 2024-06-24T11:51:41Z

Can you paste the output of running lmdeploy check_env?

下面是lmdeploy环境下运行awq量化的所有输出结果
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.84it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.norm to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-3f6237ecfc2df013/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-11668a7e9b799711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
model.layers.0, samples: 128, max gpu memory: 6.73 GB
model.layers.1, samples: 128, max gpu memory: 8.48 GB
model.layers.2, samples: 128, max gpu memory: 8.48 GB
model.layers.3, samples: 128, max gpu memory: 8.48 GB
model.layers.4, samples: 128, max gpu memory: 8.48 GB
model.layers.5, samples: 128, max gpu memory: 8.48 GB
model.layers.6, samples: 128, max gpu memory: 8.48 GB
model.layers.7, samples: 128, max gpu memory: 8.48 GB
model.layers.8, samples: 128, max gpu memory: 8.48 GB
model.layers.9, samples: 128, max gpu memory: 8.48 GB
model.layers.10, samples: 128, max gpu memory: 8.48 GB
model.layers.11, samples: 128, max gpu memory: 8.48 GB
model.layers.12, samples: 128, max gpu memory: 8.48 GB
model.layers.13, samples: 128, max gpu memory: 8.48 GB
model.layers.14, samples: 128, max gpu memory: 8.48 GB
model.layers.15, samples: 128, max gpu memory: 8.48 GB
model.layers.16, samples: 128, max gpu memory: 8.48 GB
model.layers.17, samples: 128, max gpu memory: 8.48 GB
model.layers.18, samples: 128, max gpu memory: 8.48 GB
model.layers.19, samples: 128, max gpu memory: 8.48 GB
model.layers.20, samples: 128, max gpu memory: 8.48 GB
model.layers.21, samples: 128, max gpu memory: 8.48 GB
model.layers.22, samples: 128, max gpu memory: 8.48 GB
model.layers.23, samples: 128, max gpu memory: 8.48 GB
model.layers.24, samples: 128, max gpu memory: 8.48 GB
model.layers.25, samples: 128, max gpu memory: 8.48 GB
model.layers.26, samples: 128, max gpu memory: 8.48 GB
model.layers.27, samples: 128, max gpu memory: 8.48 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.24 smooth weight done.
model.layers.25 smooth weight done.
model.layers.26 smooth weight done.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/main.py", line 5, in
run()
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq
auto_awq(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 124, in auto_awq
smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size,
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 259, in smooth_layers
smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs
assert torch.isnan(p).sum() == 0
AssertionError

lvhan028 · 2024-06-24T12:09:31Z

不是这个，是执行命令 “lmdeploy check_env”，它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

serser · 2024-06-24T13:56:16Z

Related to #1786, env is listed

qiuxuezhe123 · 2024-06-24T14:01:26Z

不是这个，是执行命令 “lmdeploy check_env”，它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

执行lmdeploy check_env报错了，报错信息如下：
Traceback (most recent call last):
File "/opt/conda/bin/lmdeploy", line 8, in
sys.exit(run())
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/cli.py", line 192, in check_env
env_info = collect_env()
File "/opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/collect_env.py", line 156, in collect_env
import torchvision
File "/opt/conda/lib/python3.8/site-packages/torchvision/init.py", line 6, in
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
File "/opt/conda/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in
def meta_nms(dets, scores, iou_threshold):
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_ops.py", line 253, in inner
custom_op = _find_custom_op(qualname, also_check_torch_library=True)
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1076, in _find_custom_op
overload = get_op(qualname)
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1062, in get_op
error_not_found()
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1052, in error_not_found
raise ValueError(
ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library.

lvhan028 · 2024-06-24T15:50:59Z

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

AllentDan · 2024-06-25T02:52:01Z

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

是跟torch 版本有关，我这边相同环境，torch2.1.2 + cu118 降到 torch2.1.0 + cu118 就会 Nan。可能需要更新下发布的 docker 内的 torch 版本。

qiuxuezhe123 · 2024-06-25T03:31:49Z

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题，但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么？

好的，谢谢，我在cuda12环境下试下

lvhan028 assigned AllentDan Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] qwen2 awq量化微调后的模型报错 #1836

[Bug] qwen2 awq量化微调后的模型报错 #1836

qiuxuezhe123 commented Jun 24, 2024 •

edited

Loading

lvhan028 commented Jun 24, 2024

qiuxuezhe123 commented Jun 24, 2024

lvhan028 commented Jun 24, 2024

serser commented Jun 24, 2024

qiuxuezhe123 commented Jun 24, 2024

lvhan028 commented Jun 24, 2024

AllentDan commented Jun 25, 2024

qiuxuezhe123 commented Jun 25, 2024

[Bug] qwen2 awq量化微调后的模型报错 #1836

[Bug] qwen2 awq量化微调后的模型报错 #1836

Comments

qiuxuezhe123 commented Jun 24, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lvhan028 commented Jun 24, 2024

qiuxuezhe123 commented Jun 24, 2024

lvhan028 commented Jun 24, 2024

serser commented Jun 24, 2024

qiuxuezhe123 commented Jun 24, 2024

lvhan028 commented Jun 24, 2024

AllentDan commented Jun 25, 2024

qiuxuezhe123 commented Jun 25, 2024

qiuxuezhe123 commented Jun 24, 2024 •

edited

Loading