Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练报错 #8

Open
wdp-007 opened this issue Apr 4, 2023 · 4 comments
Open

训练报错 #8

wdp-007 opened this issue Apr 4, 2023 · 4 comments

Comments

@wdp-007
Copy link

wdp-007 commented Apr 4, 2023

你好,尝试按照readme进行训练,遇到一下错误, 辛苦帮忙看一下:
torch: 2.0.0
cuda: V10.1.243

accelerate launch --num_processes 1 --mixed_precision fp16 train.py --config=configs/imagenet64_uvit_mid.py
Traceback (most recent call last):
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 251, in <module>
    app.run(main)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 247, in main
    train(config)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 157, in train
    metrics = train_step(batch)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 83, in train_step
    loss = sde.LSimple(score_model, _batch[0], pred=config.pred, y=_batch[1])
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 273, in LSimple
    noise_pred = score_model.noise_pred(xt, t, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 177, in noise_pred
    pred = self.predict(xt, t, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 174, in predict
    return self.nnet(xt, t * 999, **kwargs)  # follow SDE
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/libs/uvit.py", line 209, in forward
    label_emb = self.label_emb(y)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@baofff
Copy link
Owner

baofff commented Apr 6, 2023

你可以先试试把num_processes改成2

@wdp-007
Copy link
Author

wdp-007 commented Apr 6, 2023

感谢回复!
我是在单卡机器上调试的。。。
num_processes=2以后也是报同样的错

@baofff
Copy link
Owner

baofff commented Apr 6, 2023

你可以先在命令行输入accelerate config,配置成单卡的环境。然后执行accelerate launch --mixed_precision fp16 train.py --config=configs/imagenet64_uvit_mid.py试试

@wdp-007
Copy link
Author

wdp-007 commented Apr 7, 2023

感谢回复!
我在配置的时候尝试了cpu/gpu,都还是报一样的错误

(unidiffuser) ➜  U-ViT git:(main) ✗ accelerate config                                                                        
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:yes
Do you want to use DeepSpeed? [yes/NO]: NO
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: fp16
(unidiffuser) ➜  U-ViT git:(main) ✗ accelerate config                                                                        
In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU [4] MPS): 0
Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: fp16

使用DeepSpeed也是报同样的错:

  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 251, in <module>
    app.run(main)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 247, in main
    train(config)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 157, in train
    metrics = train_step(batch)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/train.py", line 83, in train_step
    loss = sde.LSimple(score_model, _batch[0], pred=config.pred, y=_batch[1])
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 273, in LSimple
    noise_pred = score_model.noise_pred(xt, t, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 177, in noise_pred
    pred = self.predict(xt, t, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/sde.py", line 174, in predict
    return self.nnet(xt, t * 999, **kwargs)  # follow SDE
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/accelerate/utils/operations.py", line 507, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/diffusion/U-ViT/libs/uvit.py", line 209, in forward
    label_emb = self.label_emb(y)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/home/pai/envs/unidiffuser/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants