ELT best practices? (workspace, intermediate data sets, views...)

aburkh · December 2024

We are preparing a SAS to Databricks migration and we are considering Dataiku as a low-code ETL for non-technical users. Dataiku feels very close to an awesome experience but there are a few issues that make me worry about the sustainability of such an approach.

Do you have recommended best practices to mitigate those issues? Maybe Dataiku may improve some of those in the future? Or is Dataiku the wrong tool for the job and better to go for Informatica/Qlik and other graphical ETL tools?

==============================================================================================

Workspace (transient datasets)

Most other data management tools have a concept of workspace, an internal area where data is being processed without being exposed/materialized. This has the benefit of minimizing dataset pollution and efficient processing.

In SAS, users work in the context of a "session", that only lasts until they disconnect. All intermediate data sets are stored in a work area and are discarded when the session ends. When users want to persist datasets, they define a storage location (outside of the work area).
In python/pandas/spark, users can define data frames that only reside in memory.
In SAP Hana, a calculation view is composed of many intermediate steps that are not directly exposed to users.
In Databricks/Snowflake and many other databases, the concept of "temporary tables" or "temporary views" serve that purpose.

Dataiku could have a concept of transient datasets, that are only used as part of the execution of a flow/scenario. If the dataset is backed by a filesystem (S3, etc.) they could be stored in separate temp folders (one per "session"). If the datasets are backed by a database, they could use "temporary" views/tables.

==============================================================================================

ETL/ELT, Spark & SQL

Dataiku has good integration with SQL data platforms (e.g. Databricks/Snowflake), but somehow we keep bumping into SQL compatibility issues: column names are randomly upper cased by SQL recipes, limited support for visual recipes, sampling does not always work, inability to leverage modern SQL operations (pivot, etc.) for charts/dashboards…

It feels frustrating because Dataiku feels so close to perfection, yet have those little quirks when working with a SQL database.

The workarounds are to handcode transformations, perform the transformations in databricks notebooks and execute the task with an API… Or to switch to the Spark engine, which is 100% compatible with all visual recipes.

AFAIK, Spark on K8S is not compatible with Unity Catalog (while Apache Spark already supports Unity Catalog since 3.5.3), therefore it requires exporting all the data before transformation. This brings another dilemma: we have always stuck to the principle of ELT to speed up transformations, but switching to the Spark engine would force us to switch to an ETL paradigm, exporting all the data and copying back all the data back in the database.

==============================================================================================

Are there any users who have successfully integrated Dataiku with a SQL data platform (Databricks, Snowflake, Redshift, Azure SQL, Teradata, etc.) for ETL jobs?

Turribeach · December 2024

Workspace (transient datasets): Have a look at SQL pipelines. Not having intermediate datasets is both a good and a bad thing. The whole concept of the Dataiku flow is that you can explore each of the transformation steps and see how the data is changing. This helps non-technical users and people not familiar with the project quickly understand what's going on. Not persisting the intermediate datasets defeats the purpose of using Dataiku and prevents you from using lots of features like Data Quality checks, conditional execution of steps, etc. I think you are trying to adapt an old concept that's not relevant in Dataiku. What actual user requirement does this point covers? Data is cheap to store. There is no point in trashing intermediate datasets data in Dataiku. Don't try to replicate old patterns in a new platform. Think of business outcomes, not technical features.

ETL/ELT, Spark & SQL: Regarding column names read this this post about it I just did. Rename your sources to supported characters and your problems will go away. Every data storage technology has its naming limitations.

Limited support for visual recipes: If you look at the table on this post you can actually see Snowflake supports a lot more than other SQL engines. So while in general there is lack of support on some visual recipes for in-database execution this is not widespread. I agree it could and should be better but it's manageable. For everything else you can either use SQL Recipes or a database view, assuming the database technology supports the desired transformation.

Perform the transformations in databricks notebook / K8S is not compatible with Unity Catalog: You can use Dataiku Compute Clusters within Dataiku. See the following page.

Sampling does not always work: There isn't much anyone can do with the limited information you are proving. I can assure you Dataiku sampling works the way we expect it to work and this is more likely a misconception on your side on how sampling works.

In general while Dataiku can do ETL and it has many advantages to doing ETL in Dataiku is it NOT a pure ETL tool and will have its drawbacks too. Therefore using Dataiku purely as an ETL tool is a misguided approach which will not realise the full capabitilies of the Dataiku platform as machine learning platform. In addition the Designer license is priced with that in mind so it's not going to be cost effective to use it only for ETL purposes. The reason Dataiku does ETL is to accelerate machine learning projects.

Marlan · December 2024

We do lots of ETL with Dataiku and Snowflake. We typically use SQL Script recipes. We typically don't worry about intermediate tables but in some cases use a step in a scenario to delete intermediate tables. Overall, we are happy with Dataiku and Snowflake for ETL.

ELT best practices? (workspace, intermediate data sets, views...)

Answers

Categories

Setup Info

Tags