Using Dask with Dataiku DSS
GusCav
Registered Posts: 1 ✭✭✭✭
Hey everyone - Gus from Coiled (and formerly dku) here. We recently started discussions with @DataikuCarlyT
and the Dataiku team about how DSS users might use Dask. Dask is used across end-to-end ML workloads as well more general scientific computing applications.
Based on my experience at Dataiku and Coiled, I wanted to offer the following recommendations to DSS Python users interested in Dask:
- Use Dask within your DSS project via a Jupyter Notebook or Python recipe
- Install Dask and/or Distributed in a DSS code environment
- Use Dask via the default multithreaded/multiprocessing scheduler, or with the distributed scheduler
- This would allow Dask to use the cores & RAM of the underlying compute DSS provisioned for that notebook
- Connect to an existing external Dask cluster
- (Assumes IT has deployed Dask on a cluster such as HPC, Kubernetes, and YARN among others)
- Connect to external cluster from Jupyter NB or Python recipe using code
- This would allow the user to access the existing external cluster resources
- Create and use external Dask clusters via Coiled
- Create a Coiled account
- Install the coiled python package in your dataiku code environment
- Create, scale, and stop Dask clusters from your Jupyter Notebook or Python recipe
import coiled cluster = coiled.Cluster(n_workers=10, name="dask_from_dku") from dask.distributed import Client client = Client(cluster) print('Dashboard:', client.dashboard_link)
- This would allow the use to configure cluster size and type, e.g., GPUs, and control software dependencies on the cluster.
I think there is some exploration required to read/write with the dataiku dataset API. If completed, this work might be best packaged as a DSS plugin.
I look forward to discussions with the community