Using Dask with Dataiku DSS

GusCav · August 2021

Hey everyone - Gus from Coiled (and formerly dku) here. We recently started discussions with @DataikuCarlyT
and the Dataiku team about how DSS users might use Dask. Dask is used across end-to-end ML workloads as well more general scientific computing applications.

Based on my experience at Dataiku and Coiled, I wanted to offer the following recommendations to DSS Python users interested in Dask:

Use Dask within your DSS project via a Jupyter Notebook or Python recipe
- Install Dask and/or Distributed in a DSS code environment
- Use Dask via the default multithreaded/multiprocessing scheduler, or with the distributed scheduler
- This would allow Dask to use the cores & RAM of the underlying compute DSS provisioned for that notebook
Connect to an existing external Dask cluster
- (Assumes IT has deployed Dask on a cluster such as HPC, Kubernetes, and YARN among others)
- Connect to external cluster from Jupyter NB or Python recipe using code
- This would allow the user to access the existing external cluster resources
Create and use external Dask clusters via Coiled
- Create a Coiled account
- Install the coiled python package in your dataiku code environment
- Create, scale, and stop Dask clusters from your Jupyter Notebook or Python recipe

import coiled
cluster = coiled.Cluster(n_workers=10, name="dask_from_dku")

from dask.distributed import Client
client = Client(cluster)
print('Dashboard:', client.dashboard_link)

This would allow the use to configure cluster size and type, e.g., GPUs, and control software dependencies on the cluster.

I think there is some exploration required to read/write with the dataiku dataset API. If completed, this work might be best packaged as a DSS plugin.

I look forward to discussions with the community

Using Dask with Dataiku DSS

Categories

Setup Info

Tags