Kernel dies with Convolutions on GPU

byvalentino
byvalentino Registered Posts: 1

I am running into a kernel panic each time I try a convolution on GPU.

The environment seems set correctly, GPU is available, and simple transformations run both on GPU and CPU. Convolutions run in CPU, and in GPU are killing the kernel (see image attached).

I can't find any useful error message.

How to throubleshot?

Thanks upfront for the help!

Valentino

torch==2.3.0
torchaudio==2.3.0
torchvision==0.18.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.1.105

Answers

  • NicolasD
    NicolasD Dataiker, Dataiku DSS Core Designer, Registered Posts: 12 Dataiker

    Hello Out of memory problems are alas common when using GPUs.

    Would you be able to monitor your GPU memory usage while the cell run ? For example if you can use `watch nvidia-smi` on the server where the GPU is located and observe, it would help identify an out of memory problem.

Setup Info
    Tags
      Help me…