Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

Parallelize split recipe

0 Kudos

The current split recipe is a powerful way to segment data. However, instead of writing data concurrently, it writes data sequentially to each of the output datasets. Since partitioning an output dataset is often not viable due to deadlock, writing to separate tables is the optimal way to increase ingestion throughput to a target. Currently, to load data in parallel to a target table, multiple filter recipes need to be used instead. Attached is an example pipeline of the pattern needed to rapidly ingest data. 

A partitioned source dataset is defined. That dataset is synced to a partitioned filesystem dataset, which is then enriched to list partition names in a column. Then, those partition names are filtered on to copy data to the target tables, where finally the data can be unioned back into a single table. In my experimentation so far, this process accelerates data loads against Teradata by about 7x. 

Ideally, to save time when creating flows like this, the split recipe can be updated to run the data writers in parallel, rather than using a single data writer to sequentially output data to each dataset.

 

Screenshot 2022-04-29 132600.jpg