Multiple GPU Support

vinhdiesal
vinhdiesal Registered Posts: 11 ✭✭✭✭

DSS gives me the option to select both NVIDIA RTX 2080TIs to train, but when I select both GPUs the training session errors out and tells me to reduce the numbers of GPUs. The software allows 1 GPU to run fine. How can I configure DSS to support multiple GPU sessions without errors.

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    DSS supports multiple GPUs for training deep learning models. Could you please upload the full log of the failed training session?

    Best regards,

    Alex

  • deeplearnyogi
    deeplearnyogi Registered Posts: 9 ✭✭✭✭

    Hi Alex,

    I attached the option of both GPUs selected in addition to the errors.

    My python environment is DSS code-envs: python2_7(path) with tensorflow 1.15 and Keras 2.1.5

    It works well with one GPU selected.

    I also noticed that when I selected one GPU, it still uses a different GPU to train, I also attached that image as well.

    Thanks for looking into this.

    Vinh

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    Could you please send us the full error log (click on the "LOGS" button) as text file attachment? Unfortunately, screenshots do not allow us to diagnose the problem.

    Cheers,

    Alex

  • deeplearnyogi
    deeplearnyogi Registered Posts: 9 ✭✭✭✭

    Hi,

    Attached is the error log.

    Thanks,

    Vinh

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    The log shows the following message:

    2020-05-13 07:52:21.593829: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.

    I recommend following the steps highlighted in "Linux setup" on https://www.tensorflow.org/install/gpu.

    Best regards,

    Alex

  • deeplearnyogi
    deeplearnyogi Registered Posts: 9 ✭✭✭✭
    edited July 17

    Hi,

    I verified that CUDA, CUDNN, and Tensorflow RT are installed correctly in the /usr/local/cuda-10.2/lib64 directory.

    I reviewed the libraries it could not open,

    I currently have 10.2 cuda, so the files are labeled libcuda10.2, etc, however DSS is looking for files like :

    libcudart.so.10.0

    Is DSS fixed to version 10.0 so version 10.2 won't be imported?

    Also, when I select 1 GPU and it trains, I noticed it selected the GPU I didn't select.

    Another solution that I'm thinking of is using a nvidia-docker (i'm still new to containers/dockers on how it works) but is it possible to run an instance of DSS and have it point to nvidia-docker so I can utilize my GPUs correctly?

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    Specific versions of tensorflow and keras expect different GPU library versions. In the case of Visual ML in DSS, we recommend:

    - tensorflow 1.15

    - keras 2.3

    - CUDA 10.0

    - cuDNN 7.6

    We will investigate how to support CUDA 10.2 and keep you updated.

    I do not recommend nvidia-docker in your case as it would increase the complexity of your setup for no benefit.

    Hope it helps,

    Alex

  • deeplearnyogi
    deeplearnyogi Registered Posts: 9 ✭✭✭✭

    I think I know why it doesn't recognize the GPUs, the python scripts in DSS are looking for files with lib***10.0 when actually, the filenames in the library are lib***10 which was why it couldn't find it. Renaming the files worked.

Setup Info
    Tags
      Help me…