Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
In my workflow, it generates a partitioned datasets with over 10000 partitions, each partition is just one row of data. I am currently sync this data set to a single table, but I found it very inefficient. To sync a 10000 table, it took over 1 hour to finish. Is there a way to run this more efficiently or another better way to accomplish it?
I am running Dataiku 7.0 on a VM set up in Azure.
Thanks in advance
Is there any way to not generate a table with one row per partition? That is indeed extremely inefficient.
This partitioned dataset was generated by a Python recipe where each partition ID were feed into the recipe to by using the dataiku.dku_flow_variables["DKU_DST_ID"] from this dataset.
Any advice how to avoid this partition?
Even there is more than one row per partition, to sync 10000 partitions to a single table would still be inefficient. Is this statement correct?
Thanks a lot
Hi,
Your statement is definitely correct. I'm not sure I understand the issue fully, but I'll try to talk about some concepts around partitioning and hopefully that helps. Lets say I have a file-based bunch of datasets that are in the following format - A base directory, then multiple subdirectories each with a dataset inside:
~ tree partition_ex/
partition_ex/
โโโ a
โ โโโ part1.csv
โโโ b
โ โโโ part1.csv
โโโ c
โโโ part1.csv
I can treat this a couple different ways in Dataiku. If I connect to the directory partition_ex as a server filesystem dataset, we'll get a view like this:
I can then choose to partition this dataset, by clicking the partitioned tab and setting this:
Then my dataset would have 3 partitions and I would take those into account when working with it in python. However, I could also just turn off partitioning in which case my dataset would combine ALL the files in the subdirectories into a single dataset.
This is the better option of the number of records in every partition is not very large and the number of partitions is very high.
Specifically for your case, instead of syncing your partitioned dataset to an unpartitioned dataset, simply create a new unpartitioned dataset pointing to the same location as the partitioned one.