DSS gives me the option to select both NVIDIA RTX 2080TIs to train, but when I select both GPUs the training session errors out and tells me to reduce the numbers of GPUs. The software allows 1 GPU to run fine. How can I configure DSS to support multiple GPU sessions without errors.
I attached the option of both GPUs selected in addition to the errors.
My python environment is DSS code-envs: python2_7(path) with tensorflow 1.15 and Keras 2.1.5
It works well with one GPU selected.
I also noticed that when I selected one GPU, it still uses a different GPU to train, I also attached that image as well.
Thanks for looking into this.
Could you please send us the full error log (click on the "LOGS" button) as text file attachment? Unfortunately, screenshots do not allow us to diagnose the problem.
The log shows the following message:
2020-05-13 07:52:21.593829: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
I recommend following the steps highlighted in "Linux setup" on https://www.tensorflow.org/install/gpu.
I verified that CUDA, CUDNN, and Tensorflow RT are installed correctly in the /usr/local/cuda-10.2/lib64 directory.
I reviewed the libraries it could not open,
I currently have 10.2 cuda, so the files are labeled libcuda10.2, etc, however DSS is looking for files like :
Is DSS fixed to version 10.0 so version 10.2 won't be imported?
Also, when I select 1 GPU and it trains, I noticed it selected the GPU I didn't select.
Another solution that I'm thinking of is using a nvidia-docker (i'm still new to containers/dockers on how it works) but is it possible to run an instance of DSS and have it point to nvidia-docker so I can utilize my GPUs correctly?
Specific versions of tensorflow and keras expect different GPU library versions. In the case of Visual ML in DSS, we recommend:
- tensorflow 1.15
- keras 2.3
- CUDA 10.0
- cuDNN 7.6
We will investigate how to support CUDA 10.2 and keep you updated.
I do not recommend nvidia-docker in your case as it would increase the complexity of your setup for no benefit.
Hope it helps,
I think I know why it doesn't recognize the GPUs, the python scripts in DSS are looking for files with lib***10.0 when actually, the filenames in the library are lib***10 which was why it couldn't find it. Renaming the files worked.