ETL Dataiku
I hope you are doing well.
I am writing to you because I am trying to set up an etl. Indeed I have a flow zone with the data preparation steps and a step with the final joins.
The goal is to create an ETL with a sandbox of data on which I can build different tables from the output dataset.
In terms of performance, is a final join with all datasets the best option? Because I have about 200 columns to join and it takes easily 1H for the run.
(Each dataset is stored in a postgreed database)
Thanks for your answers
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi @Kevin_dataiku8
,
In DSS 12, you can now build a single flow zone for example. Hard to comment on the performance of the join but typically joining a few columns with a unique join keys ( where are no null/empty value) would perform better. You will need to experiment and see if splitting you final join into several join and perhaps using SQL Pipelines to avoid materializing intermediate dataset may perform better for your flow zone. https://doc.dataiku.com/dss/latest/sql/pipelines/sql_pipelines.htmlThanks,