Support for Dask distributed jobs?
Hello,
I'm currently evaluating various engines available in DSS and I was wondering if Dask was something Dataiku was currently working on?
We tried to use PySpark in the past, but it might be overkill for our use case (we have thousands of small partitions) and we never really managed to get it running anyway. Dask seems a bit more suitable for small to medium sized jobs, without the Hadoop overhead.
Any thoughts about it?
Best,
Romain
Best Answer
-
Hi,
We've studied in the past leveraging Dask for our Visual ML but we encountered various stability issues which forced us not to consider it for that specific usage.
We are not currently considering adding Dask as an execution engine for visual recipes in Dataiku.
However, you should be able to leverage Dask as you wish in Python recipes and notebooks. Note that you'll still need to provide the cluster (Kubernetes for example) that Dask will leverage.
Answers
-
I understand it might be tricky to include Dask as an engine within DSS's backend, but could it be possible to allow reading into Dask data structures using the API nonetheless? For instance, having dataiku.Dataset.get_dask_dataframe() (returning a handle to a dask dataframe object) besides the traditional dataiku.get_dataframe() which returns a pandas dataframe.
-
Sampathvinta Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 2 Neuron
restarting a very old thread. But I have the same question around Dask if there are any thoughts around having a dask dataframe handle instead of pandas dataframe handle.
or if I get a pandas dataframe then convert into Dask, would that give me same experience as getting a dask dataframe from Dataiku dataset ?