Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on November 11, 2024 7:14PM
Likes: 0
Replies: 1
Hello Dataiku Community,
We’re currently working on a project where we require GPU support for intensive training tasks. Our setup includes a primary Dataiku instance that is already configured and running in our AWS environment. To handle the more resource-intensive parts of our workflow, we’re exploring the option of adding a separate GPU instance within the same subnet to leverage its processing power directly from our main Dataiku instance.
Our goal is to offload certain tasks, like model training, to this GPU instance while maintaining the main instance for orchestration and general workflows. Here’s what we’d like to understand better:
Thank you in advance for your guidance! We’re eager to hear your thoughts and learn from the community’s expertise.
Best regards,
Isaac Chávez
First of all let say that getting a GPU working in a Dataiku is not usually a trivial task. The difficulty will vary a lot depending on what GPU you want to work with, what software stack you want to use and how you want to expose the GPU to Dataiku. I covered some of the issues on this recent answer I wrote and it's linked post which I strongly suggest you read to get an idea.
Using GPUs in a Cloud context becomes expensive very quickly. Therefore the usual approach is to only have the GPUs active during the training phase and shut them off while you are not using them to stop paying for them. This elastic computation pattern is what Kubernetes is designed for. So start but giving a good read to the Elastic AI computation Dataiku documentation page which covers the integration with Kubernetes in great detail. There are 3 main sections in it which cover the 3 main cloud vendors AWS, Azure and GCP and their 3 Kubernetes services EKS, AKS and GKS. Each of the sections also explain the requirements to build docker images with CUDA and GPU support to use inside Kubernetes. You should be warned that bringing Kubernetes to the mix will raise the complexity level even further, so this is not an easy setup. The level of complexity will depend a lot on what permissions you have in your cloud account and how you integrate with the relevant Kubernetes service.
At this stage it might make sense for you to get a CUDA/GPU Cloud VM already configured by your cloud vendor and install Dataiku on top to do a POC to really find out if GPU training is what your Dataiku use cases need.
There is also no guarantee that your model training will be faster in a GPU. Often it's only certain specific types of machine learning will benefit from GPU training, typically deep learning and neural networks. In addition to this for you to do GPU training the ML algorithm you want to use needs to support GPU training, it's not as simple as just flipping a switch and it will just run in the GPU. Dataiku does support for Keras models to run in GPUs, so in a sense it does allow for "switch flipping" GPU training. But I wouldn't underestimate the effort required to setup those GPUs nor the fact that you wouldn't want them sitting idle for most of the day if you don't have workloads to train in them.