Support for Dask distributed jobs?

rmnvncnt · April 2020

Hello,

I'm currently evaluating various engines available in DSS and I was wondering if Dask was something Dataiku was currently working on?

We tried to use PySpark in the past, but it might be overkill for our use case (we have thousands of small partitions) and we never really managed to get it running anyway. Dask seems a bit more suitable for small to medium sized jobs, without the Hadoop overhead.

Any thoughts about it?

Best,

Romain

Clément_Stenac · April 2020

Hi,

We've studied in the past leveraging Dask for our Visual ML but we encountered various stability issues which forced us not to consider it for that specific usage.

We are not currently considering adding Dask as an execution engine for visual recipes in Dataiku.

However, you should be able to leverage Dask as you wish in Python recipes and notebooks. Note that you'll still need to provide the cluster (Kubernetes for example) that Dask will leverage.

rmnvncnt · May 2020

I understand it might be tricky to include Dask as an engine within DSS's backend, but could it be possible to allow reading into Dask data structures using the API nonetheless? For instance, having dataiku.Dataset.get_dask_dataframe() (returning a handle to a dask dataframe object) besides the traditional dataiku.get_dataframe() which returns a pandas dataframe.

Sampathvinta · November 2022

restarting a very old thread. But I have the same question around Dask if there are any thoughts around having a dask dataframe handle instead of pandas dataframe handle.

or if I get a pandas dataframe then convert into Dask, would that give me same experience as getting a dask dataframe from Dataiku dataset ?

Support for Dask distributed jobs?

Best Answer

Answers

Categories

Setup Info

Tags