How to sync a partitioned table to a single table efficiently

Joshuazzx
Level 2
How to sync a partitioned table to a single table efficiently

Hi,

In my workflow, it generates a partitioned datasets with over 10000 partitions, each partition is just one row of data. I am currently sync this data set to a single table, but I found it very inefficient. To sync a 10000 table, it took over 1 hour to finish. Is there a way to run this more efficiently or another better way to accomplish it?

Joshuazzx_0-1588199062316.png

I am running Dataiku 7.0 on a VM set up in Azure. 

Thanks in advance

3 Replies
Jediv
Dataiker

Is there any way to not generate a table with one row per partition? That is indeed extremely inefficient.

Joshuazzx
Level 2
Author

This partitioned dataset was generated by a Python recipe where each partition ID were feed into the recipe to by using the  dataiku.dku_flow_variables["DKU_DST_ID"] from this dataset.

Any advice how to avoid this partition?

Even there is more than one row per partition, to sync 10000 partitions to a single table would still be inefficient. Is this statement correct?

Thanks a lot

Jediv
Dataiker

Hi,

Your statement is definitely correct. I'm not sure I understand the issue fully, but I'll try to talk about some concepts around partitioning and hopefully that helps. Lets say I have a file-based bunch of datasets that are in the following format - A base directory, then multiple subdirectories each with a dataset inside:

 

 

 

~ tree partition_ex/
partition_ex/
โ”œโ”€โ”€ a
โ”‚   โ””โ”€โ”€ part1.csv
โ”œโ”€โ”€ b
โ”‚   โ””โ”€โ”€ part1.csv
โ””โ”€โ”€ c
    โ””โ”€โ”€ part1.csv

 

 

 

 

I can treat this a couple different ways in Dataiku. If I connect to the directory partition_ex as a server filesystem dataset, we'll get a view like this:

Pasted_Image_5_1_20__6_52_PM.png

I can then choose to partition this dataset, by clicking the partitioned tab and setting this:

Pasted_Image_5_1_20__6_55_PM.png

 

Then my dataset would have 3 partitions and I would take those into account when working with it in python. However, I could also just turn off partitioning in which case my dataset would combine ALL the files in the subdirectories into a single dataset. 

This is the better option of the number of records in every partition is not very large and the number of partitions is very high.

Specifically for your case, instead of syncing your partitioned dataset to an unpartitioned dataset, simply create a new unpartitioned dataset pointing to the same location as the partitioned one.