Using NVIDIA MIG instances
Hi , I have a containerized instance of DSS running without any problems, but whenever I try to use MIG instances with docker run -d --runtime=nvidia --gpus '"device=3:0,3:1"'. and enable GPUs on the Design>Runtime Environment tab, it keeps saying "Failed to fetch GPU stats, maybe nvidia-smi utility is not found" and im unable to see my GPU usage. I'll be very happy if there is a solution to this.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Hi, I am not familiar with NVIDIA Multi-Instance GPU (MIG) technology but from a quick Google search it appears that it enables multiple GPU Instances to run in parallel on a single, physical NVIDIA Ampere architecture GPU. I am not sure if that will work for ML training or for GPU based ML workloads in Dataiku so that's one for you to research. But irrespective of that you should be aware that is not enough to expose the GPU hardware to the container or DSS instance you also need to add all the additional software and drivers that NVIDIA GPUs need to work, including the missing nvidia-smi as per your post above. These steps will be of course dependant on the OS / version / architecture that your container is running on and it's not a trivial setup given that all the software versions need to be compatible between themselves and the GPU used and this not always clearly stated on each of the software components. Pretty much all the cloud vendors provide OS images with the these software components pre-installed and configured properly to be used on GPU enabled instances so you may be able to leverage those.
Over 3 years ago I wrote this post which is a complete guide on all the setup steps needed to get GPU training working in a Dataiku instance running on RHEL v7.9. While this post is now outdated it will give you a rough idea of all the steps involved. It will be up to you to work out the specific steps for your required environment.
-
Thanks for your reply. However, GPU training is not a problem while using the GPU itself. So I'm not just exposing the hardware and hoping that it works. I am running my docker container the same way I do with many other containers that use NVIDIA MIG. I can also access nvidia-smi from inside the container, despite DSS not being able to find it.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
What does the backend log shows and where exactly are you enabling the GPU in DSS?