Support for Dask distributed jobs?

rmnvncnt · ‎04-29-2020

Hello,

I'm currently evaluating various engines available in DSS and I was wondering if Dask was something Dataiku was currently working on?

We tried to use PySpark in the past, but it might be overkill for our use case (we have thousands of small partitions) and we never really managed to get it running anyway. Dask seems a bit more suitable for small to medium sized jobs, without the Hadoop overhead.

Any thoughts about it?

Best,

Romain

Clément_Stenac · ‎04-30-2020

Hi,

We've studied in the past leveraging Dask for our Visual ML but we encountered various stability issues which forced us not to consider it for that specific usage.

We are not currently considering adding Dask as an execution engine for visual recipes in Dataiku.

However, you should be able to leverage Dask as you wish in Python recipes and notebooks. Note that you'll still need to provide the cluster (Kubernetes for example) that Dask will leverage.

View solution in original post

Clément_Stenac · ‎04-30-2020

Hi,

We've studied in the past leveraging Dask for our Visual ML but we encountered various stability issues which forced us not to consider it for that specific usage.

We are not currently considering adding Dask as an execution engine for visual recipes in Dataiku.

However, you should be able to leverage Dask as you wish in Python recipes and notebooks. Note that you'll still need to provide the cluster (Kubernetes for example) that Dask will leverage.

rmnvncnt · ‎05-12-2020

I understand it might be tricky to include Dask as an engine within DSS's backend, but could it be possible to allow reading into Dask data structures using the API nonetheless? For instance, having dataiku.Dataset.get_dask_dataframe() (returning a handle to a dask dataframe object) besides the traditional dataiku.get_dataframe() which returns a pandas dataframe.

Sampathvinta · ‎11-11-2022

restarting a very old thread. But I have the same question around Dask if there are any thoughts around having a dask dataframe handle instead of pandas dataframe handle.

or if I get a pandas dataframe then convert into Dask, would that give me same experience as getting a dask dataframe from Dataiku dataset ?

Sign up to take part

Support for Dask distributed jobs?

Support for Dask distributed jobs?