Using Dask with Dataiku DSS

edited July 16

Hey everyone - Gus from Coiled (and formerly dku) here. We recently started discussions with @DataikuCarlyT
and the Dataiku team about how DSS users might use Dask. Dask is used across end-to-end ML workloads as well more general scientific computing applications.

Based on my experience at Dataiku and Coiled, I wanted to offer the following recommendations to DSS Python users interested in Dask:

  1. Use Dask within your DSS project via a Jupyter Notebook or Python recipe
  2. Connect to an existing external Dask cluster
    • (Assumes IT has deployed Dask on a cluster such as HPC, Kubernetes, and YARN among others)
    • Connect to external cluster from Jupyter NB or Python recipe using code
    • This would allow the user to access the existing external cluster resources
  3. Create and use external Dask clusters via Coiled
    • Create a Coiled account
    • Install the coiled python package in your dataiku code environment
    • Create, scale, and stop Dask clusters from your Jupyter Notebook or Python recipe

import coiled
cluster = coiled.Cluster(n_workers=10, name="dask_from_dku")

from dask.distributed import Client
client = Client(cluster)
print('Dashboard:', client.dashboard_link)​

  • This would allow the use to configure cluster size and type, e.g., GPUs, and control software dependencies on the cluster.

I think there is some exploration required to read/write with the dataiku dataset API. If completed, this work might be best packaged as a DSS plugin.

I look forward to discussions with the community

