I'm currently evaluating various engines available in DSS and I was wondering if Dask was something Dataiku was currently working on?
We tried to use PySpark in the past, but it might be overkill for our use case (we have thousands of small partitions) and we never really managed to get it running anyway. Dask seems a bit more suitable for small to medium sized jobs, without the Hadoop overhead.
Any thoughts about it?
We've studied in the past leveraging Dask for our Visual ML but we encountered various stability issues which forced us not to consider it for that specific usage.
We are not currently considering adding Dask as an execution engine for visual recipes in Dataiku.
However, you should be able to leverage Dask as you wish in Python recipes and notebooks. Note that you'll still need to provide the cluster (Kubernetes for example) that Dask will leverage.
I understand it might be tricky to include Dask as an engine within DSS's backend, but could it be possible to allow reading into Dask data structures using the API nonetheless? For instance, having dataiku.Dataset.get_dask_dataframe() (returning a handle to a dask dataframe object) besides the traditional dataiku.get_dataframe() which returns a pandas dataframe.