Visual Time series model training on GPU fails

RiaanB
Level 1
Visual Time series model training on GPU fails

Hello,

I'm getting this error whilst trying to train a time series model on GPU. 

OSError: libnvToolsExt.so.1: cannot open shared object file: No such file or directory

I have done the following so far:

1. Created a cuda 10.2 enabled base image on the DSS and pushed the base images

2. Created a code environment  and added the additional packages for visual time series forecasting (cuda 10.2)

I've also tried to use docker append to add cuda-nvtx-10-2 to the base image.

USER root
# Install cuda-nvtx-10-2
RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo && \
yum install -y cuda-nvtx-10-2 && \
yum clean all
# Globally enable cuda-nvtx-10-2
ENV PATH=/usr/local/cuda-10.2/bin:${PATH} \
LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:${LD_LIBRARY_PATH}
USER dataiku

The files are installed and available, but it they're still not found when the code runs.

I've seen online that others resolved this by  including the /usr/local/cuda/lib64 path to $LD_LIBRARY_PATH folder but I'm unable to do so. The ENV from the docker append doesn't seem to take effect. 

Does anyone have any suggestions?

Thanks

Riaan


Operating system used: centos (cloud stack)

0 Kudos
1 Reply
sergeyd
Dataiker

 Hi @RiaanB 

As you have also reported this in the support ticket, I will also reply to this here. 

You will need to update LD_LIBRARY_PATH:

ENV LD_LIBRARY_PATH=/usr/local/cuda/compat:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}

and rebuild images. We are going to fix this permanently in the upcoming releases. 

0 Kudos