CNDNN_ERROR ? #17

edwardcho · 2022-02-23T05:14:40Z

Hello Sir,

Using my-datasets, I tried to train your code.
But I met CUDNN-ERROR.

...
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [374,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [374,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/train.py", line 40, in <module>
    trainer.run_generator_one_step(data_i)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/trainers/pix2pix_trainer.py", line 35, in run_generator_one_step
    g_losses, generated = self.pix2pix_model(data, mode='generator')
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 46, in forward
    input_semantics, real_image)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 136, in compute_generator_loss
    input_semantics, real_image, compute_kld_loss=self.opt.use_vae)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 198, in generate_fake
    fake_image = self.netG(input_semantics, z=z)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/networks/generator.py", line 90, in forward
    x = self.fc(x)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f44b3256a22 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10aa3 (0x7f44b34b7aa3 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f44b34b9147 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f44b32405a4 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2f382 (0x7f4558065382 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa2f421 (0x7f4558065421 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xe7 (0x7f455add0b97 in /lib/x86_64-linux-gnu/libc.so.6)

How to solve it??

Thanks,
Edward Cho.

Ha0Tang · 2022-02-23T13:43:30Z

Please provide more information on how you train.

edwardcho · 2022-02-28T01:23:05Z

Hello sir.
I tried to train image-to-image translation using your code.

My dataset is as follows:

Blur image vs clear image : paired image set, gray scale
Because I could not prepare labeled image, i think that blur image same to labeled image.

edwardcho · 2022-02-28T01:26:13Z

If i couldn't prepare "semantic labeled image", i can't use your code??

Ha0Tang · 2022-04-22T08:24:27Z

Did you successfully run my code with my dataset?

Ha0Tang · 2022-04-22T08:26:06Z

You can run the code without using "semantic labeled image".

Ha0Tang · 2022-04-22T08:29:08Z

How many channel dimensions is the blurred image? It should be 3, if not, you need to change the code.

davidvfx07 · 2022-08-28T00:24:21Z

I am having the same issue! When I use the ADE dataset images it trains with no issues but when I use my own, with the same bit depth, it gives me this error!

davidvfx07 · 2022-08-28T00:34:50Z

I think figured it out. It's basically an overload error I think. Decreasing the amount of images may help with that error. I don't know why a lower batch size still produces that error though, I now have to reduce my image count to just 25. This is code breaking. @Ha0Tang, please fix this or explain what I can be doing wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNDNN_ERROR ? #17

CNDNN_ERROR ? #17

edwardcho commented Feb 23, 2022 •

edited

Loading

Ha0Tang commented Feb 23, 2022

edwardcho commented Feb 28, 2022

edwardcho commented Feb 28, 2022

Ha0Tang commented Apr 22, 2022

Ha0Tang commented Apr 22, 2022

Ha0Tang commented Apr 22, 2022

davidvfx07 commented Aug 28, 2022 •

edited

Loading

davidvfx07 commented Aug 28, 2022 •

edited

Loading

CNDNN_ERROR ? #17

CNDNN_ERROR ? #17

Comments

edwardcho commented Feb 23, 2022 • edited Loading

Ha0Tang commented Feb 23, 2022

edwardcho commented Feb 28, 2022

edwardcho commented Feb 28, 2022

Ha0Tang commented Apr 22, 2022

Ha0Tang commented Apr 22, 2022

Ha0Tang commented Apr 22, 2022

davidvfx07 commented Aug 28, 2022 • edited Loading

davidvfx07 commented Aug 28, 2022 • edited Loading

edwardcho commented Feb 23, 2022 •

edited

Loading

davidvfx07 commented Aug 28, 2022 •

edited

Loading

davidvfx07 commented Aug 28, 2022 •

edited

Loading