Guidance on Integrating GPU Instance with Main Dataiku Instance and Potential Alternatives
Hello Dataiku Community,
We’re currently working on a project where we require GPU support for intensive training tasks. Our setup includes a primary Dataiku instance that is already configured and running in our AWS environment. To handle the more resource-intensive parts of our workflow, we’re exploring the option of adding a separate GPU instance within the same subnet to leverage its processing power directly from our main Dataiku instance.
Our goal is to offload certain tasks, like model training, to this GPU instance while maintaining the main instance for orchestration and general workflows. Here’s what we’d like to understand better:
- Integrating the GPU Instance: How can we configure the primary Dataiku instance to recognize and utilize the GPU resources from a secondary instance? Is there a specific setup or configuration, such as using remote nodes, that would allow for seamless interaction between the two instances?
- Best Practices for Distributed Workflows: If anyone has experience distributing workflows between a main Dataiku instance and a GPU node, could you share best practices? We’d like to ensure efficient data exchange and processing between the instances.
- Exploring Alternatives: Finally, are there other, perhaps more efficient, alternatives to directly adding a GPU instance? For example, would integrating Dataiku DSS with EKS (Elastic Kubernetes Service) for dynamic GPU scaling be a viable solution? We’re weighing scalability and cost-effectiveness, so any insights into your experiences with different setups would be highly appreciated.
Thank you in advance for your guidance! We’re eager to hear your thoughts and learn from the community’s expertise.
Best regards,
Isaac Chávez
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,126 Neuron
First of all let say that getting a GPU working in a Dataiku is not usually a trivial task. The difficulty will vary a lot depending on what GPU you want to work with, what software stack you want to use and how you want to expose the GPU to Dataiku. I covered some of the issues on this recent answer I wrote and it's linked post which I strongly suggest you read to get an idea.
Using GPUs in a Cloud context becomes expensive very quickly. Therefore the usual approach is to only have the GPUs active during the training phase and shut them off while you are not using them to stop paying for them. This elastic computation pattern is what Kubernetes is designed for. So start but giving a good read to the Elastic AI computation Dataiku documentation page which covers the integration with Kubernetes in great detail. There are 3 main sections in it which cover the 3 main cloud vendors AWS, Azure and GCP and their 3 Kubernetes services EKS, AKS and GKS. Each of the sections also explain the requirements to build docker images with CUDA and GPU support to use inside Kubernetes. You should be warned that bringing Kubernetes to the mix will raise the complexity level even further, so this is not an easy setup. The level of complexity will depend a lot on what permissions you have in your cloud account and how you integrate with the relevant Kubernetes service.At this stage it might make sense for you to get a CUDA/GPU Cloud VM already configured by your cloud vendor and install Dataiku on top to do a POC to really find out if GPU training is what your Dataiku use cases need.
There is also no guarantee that your model training will be faster in a GPU. Often it's only certain specific types of machine learning will benefit from GPU training, typically deep learning and neural networks. In addition to this for you to do GPU training the ML algorithm you want to use needs to support GPU training, it's not as simple as just flipping a switch and it will just run in the GPU. Dataiku does support for Keras models to run in GPUs, so in a sense it does allow for "switch flipping" GPU training. But I wouldn't underestimate the effort required to setup those GPUs nor the fact that you wouldn't want them sitting idle for most of the day if you don't have workloads to train in them.