DSS gives me the option to select both NVIDIA RTX 2080TIs to train, but when I select both GPUs the training session errors out and tells me to reduce the numbers of GPUs. The software allows 1 GPU to run fine. How can I configure DSS to support multiple GPU sessions without errors.
I attached the option of both GPUs selected in addition to the errors.
My python environment is DSS code-envs: python2_7(path) with tensorflow 1.15 and Keras 2.1.5
It works well with one GPU selected.
I also noticed that when I selected one GPU, it still uses a different GPU to train, I also attached that image as well.
Thanks for looking into this.
The log shows the following message:
2020-05-13 07:52:21.593829: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
I recommend following the steps highlighted in "Linux setup" on https://www.tensorflow.org/install/gpu.
I verified that CUDA, CUDNN, and Tensorflow RT are installed correctly in the /usr/local/cuda-10.2/lib64 directory.
I reviewed the libraries it could not open,
I currently have 10.2 cuda, so the files are labeled libcuda10.2, etc, however DSS is looking for files like :
Is DSS fixed to version 10.0 so version 10.2 won't be imported?
Also, when I select 1 GPU and it trains, I noticed it selected the GPU I didn't select.
Another solution that I'm thinking of is using a nvidia-docker (i'm still new to containers/dockers on how it works) but is it possible to run an instance of DSS and have it point to nvidia-docker so I can utilize my GPUs correctly?
Specific versions of tensorflow and keras expect different GPU library versions. In the case of Visual ML in DSS, we recommend:
- tensorflow 1.15
- keras 2.3
- CUDA 10.0
- cuDNN 7.6
We will investigate how to support CUDA 10.2 and keep you updated.
I do not recommend nvidia-docker in your case as it would increase the complexity of your setup for no benefit.
Hope it helps,
I think I know why it doesn't recognize the GPUs, the python scripts in DSS are looking for files with lib***10.0 when actually, the filenames in the library are lib***10 which was why it couldn't find it. Renaming the files worked.