Deep Learning / GPU training on Dataiku

Turribeach
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
edited July 16 in Setup & Configuration

So we wanted to test some models using Deep Learning / GPU training on Dataiku and I need to setup the GPU in Dataiku. I found this post:

https://community.dataiku.com/t5/Setup-Configuration/How-do-I-use-a-graphics-processing-unit-GPU-with-Dataiku-DSS/m-p/2354

Which provides some guidance but it's using very old versions of CUDA and cuDNN. As we already had automated our DSS build with our cloud Linux images I didn't want to use any of the GCP images that come ready for Deep Learning. So I went ahead and created my own series of steps after working through all the issues to get everything working. Hopefully this will help someone else, assuming you have a similar environment.

VM: GCP n1-highmem-16

GPU: nvidia-tesla-t4

Google Image: rhel-7-v20210401

OS: RHEL 7.9

Steps

vi /etc/default/grub (add the blacklist for nouveau)

GRUB_CMDLINE_LINUX="crashkernel=auto console=ttyS0,38400n8 elevator=noop ipv6.disable=1 modprobe.blacklist=nouveau"

Then run:

grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

Packages needed (RHEL PDF):

yum install gcc make kernel-headers kernel-devel acpid libglvnd-glx libglvnd-opengl libglvnd-devel pkgconfig --enablerepo=epel --assumeyes

Packages needed (Dataiku community post):

yum install libstdc++.i686 dkms --enablerepo=epel --assumeyes


Needed by the driver:

yum install ocl-icd libva-vdpau-driver opencl-filesystem --enablerepo=epel
yum upgrade kernel kernel-devel --assumeyes
reboot

Check nouveau is not loaded (should return nothing)

lsmod | grep nouv

Check output is the same:

uname -r
3.10.0-1127.13.1.el7.x86_64
rpm -q kernel-devel
kernel-devel-3.10.0-1127.18.2.el7.x86_64

Install vulkan-filesystem package and Nvidia drivers (get them off the web, use the corresponding drivers for your GPU)

yum --nogpgcheck install /dataiku/installers/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm --assumeyes
yum --nogpgcheck install /dataiku/installers/nvidia-diag-driver-local-repo-rhel7-410.129-1.0-1.x86_64.rpm --assumeyes
yum clean all
yum install cuda-drivers --assumeyes

Install CUDA 10

yum --nogpgcheck install /dataiku/installers/cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64.rpm --assumeyes
yum install cuda --assumeyes
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install cuDNN 10.0

tar -zxvf /dataiku/installers/cudnn-10.0-linux-x64-v7.4.2.24.tgz
cp -P /dataiku/installers/cuda/include/cudnn*.h /usr/local/cuda/include
cp -P /dataiku/installers/cuda/lib64/libcudnn* /usr/local/cuda/lib64
chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Check you can see the GPU

/usr/local/cuda/bin/nvcc --version
cat /proc/driver/nvidia/version
/usr/bin/nvidia-smi

Test on Dataiku. I used these packages:

tensorflow-gpu==1.15.0
keras==2.3.1
keras-preprocessing==1.1.0
scikit-learn>=0.20,<0.21
scipy>=1.2,<1.3
statsmodels>=0.10,<0.11
jinja2>=2.10,<2.11
flask>=1.0,<1.1
h5py==2.10.0
pillow==6.2.2
cloudpickle>=1.3,<1.6
matplotlib==3.3.4

Answers

Setup Info
    Tags
      Help me…