Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Driver error inside Docker. #306

Closed
brolinA opened this issue Jun 14, 2024 · 7 comments
Closed

Cuda Driver error inside Docker. #306

brolinA opened this issue Jun 14, 2024 · 7 comments

Comments

@brolinA
Copy link

brolinA commented Jun 14, 2024

Hi I am trying to run the repository in docker running on Ubuntu 20.04

The docker setup was successful and I was able to run the Jackle robot as expected. Then I tried to run wild_visual_navigation_ros inside the docker and I get the following error.

Command used
roslaunch wild_visual_navigation_ros wild_visual_navigation.launch

Error
cuda_error

This is the native cuda driver on my Ubuntu 20.04

image

Should the cuda version match between the docker image and my native system? Or is there something else that is causing the error.

@mmattamala
Copy link
Collaborator

mmattamala commented Jun 16, 2024

Hi @brolinA thanks for reporting this.
Can you check if you can:

  • do nvidia-smi inside the container?
  • do python3 -c "import torch; print(torch.cuda.is_available())" inside the container

I share your concern that there could be some incompatibility or driver issue.

@brolinA
Copy link
Author

brolinA commented Jun 18, 2024

Hi @mmattamala,
Thank you for the response. Unfortunately, I am not able to enter docker after restarting my system. I keep getting the following error.

[+] Running 0/0
 ⠋ Container docker-wvn_nvidia-1  Recreate                                                                                                                                                                    0.0s 
Error response from daemon: unknown or invalid runtime name: nvidia

I have made sure that both nvidia-docker2 and nvidia-container-toolkit has been installed. Still it doesn't work. It was working the last time but doesn't work after restarting the PC.

Any idea how to tackle this issue.

@mmattamala
Copy link
Collaborator

I think that something is messed up with the nvidia docker configuration. Can you check that all the steps are correct? https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker

Similarly, this thread might have some tips: https://stackoverflow.com/questions/52865988/nvidia-docker-unknown-runtime-specified-nvidia

@brolinA
Copy link
Author

brolinA commented Jun 18, 2024

Hi @mmattamala,
Thank you. I was able to fix it.

Here is the output of the commands you mentioned.

image

@mmattamala
Copy link
Collaborator

Good to know it helped.

Coming back to the original issue, it seems to be some mismatch between the driver in the host system and the container (because the nvidia-smi output doesn't match)

I'm a bit short on time at the moment to take a deeper look, but I recommend to search for similar issues with docker.

@andreschreiber
Copy link

Adding on to this -- I had the exact same issue (same error messages).
Using nvidia-smi in the container showed CUDA 12.3 whereas outside of the container it showed CUDA 12.2.
Changing the Dockerfile to use:
FROM nvidia/cuda:12.2.2-runtime-ubuntu20.04 as base
fixed the issue for me.

@mmattamala
Copy link
Collaborator

Thanks @andreschreiber for the proposed fix! I'll close the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants