Community Conundrum 10: The Titanic is now live Learn more

Multiple GPU Support

Level 2
Multiple GPU Support

DSS gives me the option to select both NVIDIA RTX 2080TIs to train, but when I select both GPUs the training session errors out and tells me to reduce the numbers of GPUs. The software allows 1 GPU to run fine. How can I configure DSS to support multiple GPU sessions without errors.

8 Replies
Dataiker
Dataiker

Hi,

DSS supports multiple GPUs for training deep learning models. Could you please upload the full log of the failed training session?

Best regards,

Alex

 

Level 2

Hi Alex, 

I attached the option of both GPUs selected in addition to the errors. 

My python environment is DSS code-envs: python2_7(path) with tensorflow 1.15 and Keras 2.1.5

It works well with one GPU selected. 

I also noticed that when I selected one GPU, it still uses a different GPU to train, I also attached that image as well. 

Thanks for looking into this. 

Vinh

Dataiker
Dataiker

Hi,

Could you please send us the full error log (click on the "LOGS" button) as text file attachment? Unfortunately, screenshots do not allow us to diagnose the problem.

Cheers,

Alex

 

 

Level 2

Hi, 

Attached is the error log. 

Thanks, 

Vinh

Dataiker
Dataiker

Hi,

The log shows the following message:

2020-05-13 07:52:21.593829: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

I recommend following the steps highlighted in "Linux setup" on https://www.tensorflow.org/install/gpu.

Best regards,

Alex

Level 2

Hi, 

I verified that CUDA, CUDNN, and Tensorflow RT are installed correctly in the /usr/local/cuda-10.2/lib64 directory. 

I reviewed the libraries it could not open, 

I currently have 10.2 cuda, so the files are labeled libcuda10.2, etc, however DSS is looking for files like :

libcudart.so.10.0

Is DSS fixed to version 10.0 so version 10.2 won't be imported? 

Also, when I select 1 GPU and it trains, I noticed it selected the GPU I didn't select.

Another solution that I'm thinking of is using a nvidia-docker (i'm still new to containers/dockers on how it works) but is it possible to run an instance of DSS and have it point to nvidia-docker so I can utilize my GPUs correctly?

Dataiker
Dataiker

Hi,

Specific versions of tensorflow and keras expect different GPU library versions. In the case of Visual ML in DSS, we recommend:

- tensorflow 1.15

- keras 2.3

- CUDA 10.0

cuDNN 7.6

We will investigate how to support CUDA 10.2 and keep you updated.

I do not recommend nvidia-docker in your case as it would increase the complexity of your setup for no benefit.

Hope it helps,

Alex

Level 2

I think I know why it doesn't recognize the GPUs, the python scripts in DSS are looking for files with lib***10.0 when actually, the filenames in the library are lib***10 which was why it couldn't find it. Renaming the files worked. 

0 Kudos