Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] qwen2 awq量化微调后的模型报错 #1836

Open
2 tasks
qiuxuezhe123 opened this issue Jun 24, 2024 · 8 comments
Open
2 tasks

[Bug] qwen2 awq量化微调后的模型报错 #1836

qiuxuezhe123 opened this issue Jun 24, 2024 · 8 comments
Assignees

Comments

@qiuxuezhe123
Copy link

qiuxuezhe123 commented Jun 24, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

使用lmdeploy lite auto_awq将sft后的qwen2-7b进行awq量化,报错assert torch.isnan(p).sum() == 0

Reproduction

lmdeploy lite auto_awq
qwen2-sft-checkpoint-1506-merged
--calib-dataset 'c4'
--calib-samples 128
--calib-seqlen 4096
--work-dir qwen2_7b_qg_2_epoch_awq

Environment

lmdeploy==0.4.1

Error traceback

Traceback (most recent call last):
  File "/opt/conda/bin/lmdeploy", line 8, in <module>
    sys.exit(run())
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 131, in auto_awq
    auto_awq(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 68, in auto_awq
    smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size, device)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 242, in smooth_layers
    smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs
    assert torch.isnan(p).sum() == 0
AssertionError
@lvhan028
Copy link
Collaborator

Can you paste the output of running lmdeploy check_env?

@qiuxuezhe123
Copy link
Author

Can you paste the output of running lmdeploy check_env?

下面是lmdeploy环境下运行awq量化的所有输出结果
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.84it/s]
Move model.embed_tokens to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.layers.24 to CPU.
Move model.layers.25 to CPU.
Move model.layers.26 to CPU.
Move model.layers.27 to CPU.
Move model.norm to GPU.
Move lm_head to CPU.
Loading calibrate dataset ...
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-3f6237ecfc2df013/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
Found cached dataset json (/root/.cache/huggingface/datasets/json/c4-11668a7e9b799711/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
model.layers.0, samples: 128, max gpu memory: 6.73 GB
model.layers.1, samples: 128, max gpu memory: 8.48 GB
model.layers.2, samples: 128, max gpu memory: 8.48 GB
model.layers.3, samples: 128, max gpu memory: 8.48 GB
model.layers.4, samples: 128, max gpu memory: 8.48 GB
model.layers.5, samples: 128, max gpu memory: 8.48 GB
model.layers.6, samples: 128, max gpu memory: 8.48 GB
model.layers.7, samples: 128, max gpu memory: 8.48 GB
model.layers.8, samples: 128, max gpu memory: 8.48 GB
model.layers.9, samples: 128, max gpu memory: 8.48 GB
model.layers.10, samples: 128, max gpu memory: 8.48 GB
model.layers.11, samples: 128, max gpu memory: 8.48 GB
model.layers.12, samples: 128, max gpu memory: 8.48 GB
model.layers.13, samples: 128, max gpu memory: 8.48 GB
model.layers.14, samples: 128, max gpu memory: 8.48 GB
model.layers.15, samples: 128, max gpu memory: 8.48 GB
model.layers.16, samples: 128, max gpu memory: 8.48 GB
model.layers.17, samples: 128, max gpu memory: 8.48 GB
model.layers.18, samples: 128, max gpu memory: 8.48 GB
model.layers.19, samples: 128, max gpu memory: 8.48 GB
model.layers.20, samples: 128, max gpu memory: 8.48 GB
model.layers.21, samples: 128, max gpu memory: 8.48 GB
model.layers.22, samples: 128, max gpu memory: 8.48 GB
model.layers.23, samples: 128, max gpu memory: 8.48 GB
model.layers.24, samples: 128, max gpu memory: 8.48 GB
model.layers.25, samples: 128, max gpu memory: 8.48 GB
model.layers.26, samples: 128, max gpu memory: 8.48 GB
model.layers.27, samples: 128, max gpu memory: 8.48 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.24 smooth weight done.
model.layers.25 smooth weight done.
model.layers.26 smooth weight done.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/main.py", line 5, in
run()
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/lite.py", line 137, in auto_awq
auto_awq(**kwargs)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/apis/auto_awq.py", line 124, in auto_awq
smooth_layers(layers, fc2fcs, norm2fcs, act_scales, w_group_size,
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 259, in smooth_layers
smooth_ln_fcs(ln, fcs, a_scales[a_name], group_size)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/lite/quantization/awq.py", line 118, in smooth_ln_fcs
assert torch.isnan(p).sum() == 0
AssertionError

@lvhan028
Copy link
Collaborator

不是这个,是执行命令 “lmdeploy check_env”,它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

@serser
Copy link

serser commented Jun 24, 2024

Related to #1786, env is listed

@qiuxuezhe123
Copy link
Author

不是这个,是执行命令 “lmdeploy check_env”,它会把环境信息打印出来。我们想看下在哪个环境中可以复现这个问题

执行lmdeploy check_env报错了,报错信息如下:
Traceback (most recent call last):
File "/opt/conda/bin/lmdeploy", line 8, in
sys.exit(run())
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/entrypoint.py", line 37, in run
args.run(args)
File "/opt/conda/lib/python3.8/site-packages/lmdeploy/cli/cli.py", line 192, in check_env
env_info = collect_env()
File "/opt/conda/lib/python3.8/site-packages/mmengine/utils/dl_utils/collect_env.py", line 156, in collect_env
import torchvision
File "/opt/conda/lib/python3.8/site-packages/torchvision/init.py", line 6, in
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
File "/opt/conda/lib/python3.8/site-packages/torchvision/_meta_registrations.py", line 164, in
def meta_nms(dets, scores, iou_threshold):
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_ops.py", line 253, in inner
custom_op = _find_custom_op(qualname, also_check_torch_library=True)
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1076, in _find_custom_op
overload = get_op(qualname)
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1062, in get_op
error_not_found()
File "/opt/conda/lib/python3.8/site-packages/torch/_custom_op/impl.py", line 1052, in error_not_found
raise ValueError(
ValueError: Could not find the operator torchvision::nms. Please make sure you have already registered the operator and (if registered from C++) loaded it via torch.ops.load_library.

@lvhan028
Copy link
Collaborator

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么?

@AllentDan
Copy link
Collaborator

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么?

是跟torch 版本有关,我这边相同环境,torch2.1.2 + cu118 降到 torch2.1.0 + cu118 就会 Nan。可能需要更新下发布的 docker 内的 torch 版本。

@qiuxuezhe123
Copy link
Author

Related to #1786, env is listed

可能和torch的版本有关系。我在torch2.1.0 + cu118 下也遇到了 nan 的问题,但是在 torch 2.1.2 + cu12 下是正常的。

你方便创建 cuda 12的环境试试么?

好的,谢谢,我在cuda12环境下试下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants