Visual Time series model training on GPU fails
Hello,
I'm getting this error whilst trying to train a time series model on GPU.
OSError: libnvToolsExt.so.1: cannot open shared object file: No such file or directory
I have done the following so far:
1. Created a cuda 10.2 enabled base image on the DSS and pushed the base images
2. Created a code environment and added the additional packages for visual time series forecasting (cuda 10.2)
I've also tried to use docker append to add cuda-nvtx-10-2 to the base image.
USER root
# Install cuda-nvtx-10-2
RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo && \
yum install -y cuda-nvtx-10-2 && \
yum clean all
# Globally enable cuda-nvtx-10-2
ENV PATH=/usr/local/cuda-10.2/bin:${PATH} \
LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:${LD_LIBRARY_PATH}
USER dataiku
The files are installed and available, but it they're still not found when the code runs.
I've seen online that others resolved this by including the /usr/local/cuda/lib64 path to $LD_LIBRARY_PATH folder but I'm unable to do so. The ENV from the docker append doesn't seem to take effect.
Does anyone have any suggestions?
Thanks
Riaan
Operating system used: centos (cloud stack)
Answers
-
Sergey Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 365 Dataiker
Hi @RiaanB
As you have also reported this in the support ticket, I will also reply to this here.
You will need to update LD_LIBRARY_PATH:
ENV LD_LIBRARY_PATH=/usr/local/cuda/compat:/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
and rebuild images. We are going to fix this permanently in the upcoming releases.