Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: no kernel image is available for execution on the device 2024-05-15T10:49:54.910116296Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. 2024-05-15T10:49:54.910125144Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1 #28

Open
qiuqc1 opened this issue May 15, 2024 · 2 comments

Comments

@qiuqc1
Copy link

qiuqc1 commented May 15, 2024

Version:
Pytorch:1.10.1
Cuda:11.1
chamferdist:1.0.0

Traceback (most recent call last):
2024-05-15T10:49:54.908855819Z File "./tools/test.py", line 266, in
2024-05-15T10:49:54.908901447Z main()
2024-05-15T10:49:54.908922825Z File "./tools/test.py", line 237, in main
2024-05-15T10:49:54.909075639Z outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
2024-05-15T10:49:54.909088268Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/apis/test.py", line 72, in custom_multi_gpu_test
2024-05-15T10:49:54.909092798Z result = model(return_loss=False, rescale=True, **data)
2024-05-15T10:49:54.909096399Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2024-05-15T10:49:54.909245735Z return forward_call(*input, **kwargs)
2024-05-15T10:49:54.909256195Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
2024-05-15T10:49:54.909347536Z output = self.module(*inputs[0], **kwargs[0])
2024-05-15T10:49:54.909353283Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2024-05-15T10:49:54.909470394Z return forward_call(*input, **kwargs)
2024-05-15T10:49:54.909476617Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/detectors/bevformer.py", line 156, in forward
2024-05-15T10:49:54.909529465Z return self.forward_test(**kwargs)
2024-05-15T10:49:54.909533659Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/detectors/vidar.py", line 469, in forward_test
2024-05-15T10:49:54.909664066Z e2e_predictor_utils.compute_chamfer_distance_inner(
2024-05-15T10:49:54.909674936Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/utils/e2e_predictor_utils.py", line 183, in compute_chamfer_distance_inner
2024-05-15T10:49:54.909709721Z return compute_chamfer_distance(inner_pred_pcd, inner_gt_pcd)
2024-05-15T10:49:54.909718060Z File "/ml-engine/code/bb92172ed69b1dd0c567f677210a74af3015236f/projects/mmdet3d_plugin/bevformer/utils/e2e_predictor_utils.py", line 166, in compute_chamfer_distance
2024-05-15T10:49:54.909726361Z loss_src, loss_dst, _ = chamfer_distance(
2024-05-15T10:49:54.909733479Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
2024-05-15T10:49:54.909867152Z return forward_call(*input, **kwargs)
2024-05-15T10:49:54.909881111Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 77, in forward
2024-05-15T10:49:54.909889793Z source_nn = knn_points(
2024-05-15T10:49:54.909896791Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 280, in knn_points
2024-05-15T10:49:54.909939384Z p1_dists, p1_idx = _knn_points.apply(
2024-05-15T10:49:54.909949931Z File "/root/Software/anaconda3/envs/py38t19/lib/python3.8/site-packages/chamferdist/chamfer.py", line 176, in forward
2024-05-15T10:49:54.910052040Z idx, dists = _C.knn_points_idx(p1, p2, lengths1, lengths2, K, version)
2024-05-15T10:49:54.910106309Z RuntimeError: CUDA error: no kernel image is available for execution on the device
2024-05-15T10:49:54.910116296Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2024-05-15T10:49:54.910125144Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Have you ever encountered such a problem? When I was training with 8 A100s in a server cluster, an error occurred in the final eval report. There was no problem in the first 24 epochs. The normal training loss decreased, and the corresponding checkpoint file was generated, and then separately When I tested the 24-round pth file, it failed because of this.
Later, I used the same mirror environment and could test normally on 4090. May I ask why this problem occurs?

@tomztyang
Copy link
Contributor

Seems like something is wrong with the chamferdist package. Maybe try installing the chamferdist package following the 4docc?

@qiuqc1
Copy link
Author

qiuqc1 commented May 16, 2024

I reinstalled chamferdist in the docker environment, and checked all the required environments with the connection you gave me. There was a conflict on my side, which caused the version of numpy and setuptools to be inconsistent with the version required by chamferdist. Now that I have uninstalled this irrelevant package, there should be no conflict issues in the environment.
But when I used this newly made image to try to run the eval code today, I still got an error, and it was the same error path. It should not be an environmental problem.
Still looking for problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants