Efficiently applying window recipes across monthly database partitions

zwag20 · October 1

Hi all,
I’m looking for best practices when applying window recipes that need to span multiple partitions.

In my case, I have a dataset partitioned by month, and I’m using a window recipe with lead/lag functions to look ahead and behind by 3 months. To make this work, I currently:

Unpartition the dataset using a sync recipe
Apply the window recipe
Redispatch the data back into monthly partitions

This approach works, but the redispatch step is painfully slow.
Has anyone found a more efficient way to handle this kind of cross-partition window logic?

Operating system used: windows

Alexandru · October 3

Hi,
What type of datasets are you using? SQL or file-based?

For the first requirement, you will need to collect the last three partitions ( months) you want to perform the window recipe; the output would need to be un-partitioned for the redispatch.

https://knowledge.dataiku.com/latest/automation/partitioning/concept-redispatch.html#collecting-partitions

a) If it's file-based, unfortunetly, there is not much you can do. You will need to redispatch the last three months to the partitioned datasets. Note that you "lose" the partition column when you redispatch, so you may want to duplicate the partition column, or youcan always add it back later from a prepare recipe - Enrich from https://doc.dataiku.com/dss/latest/preparation/processors/enrich-with-record-context.html

b) If it's an SQL dataset, redispatch is not really needed
Simply read the output SQL dataset of the windows a new partitioned dataset and continue from there, there is no difference in what is stored in databases between a partitioned and a non-partitioned table; the difference comes from the way DSS interacts with that dataset.

If you want to keep it connected to your flow, you can create an empty Python recipe with no code and manually set its output to your window recipe output table. This will avoid any overhead of redispatch in case of an SQL dataset.

Thanks

Turribeach · October 3

Personally I think you should move away from legacy technologies that require performance hacks like partitioning to deliver the performance that you are looking for. I would suggest to look at Cloud native database technologies like Snowflake, Databricks, BigQuery, etc which should be able to handle way more data rows without much trouble.

zwag20 · October 9

https://community.dataiku.com/discussion/comment/46665#Comment_46665

Thanks for the response. I am using snowflake as my database. I am seeing weird functionality when I don't use the redispatch partitions. The python recipe appears to work, but I can't see the data and recieve an error saying failed to read data from table. I have no idea why as the job completed successfully. If I don't redispatch partitions, it just seems to put all the data in every partition regardless what I put in my settings for the partition matching. It's like it does the "all" instead of equals regardless of what I pick.

Basically I don't understand what is happening, and I don't really trust the partitioning in dataiku. It seems to do whatever it wants regardless of my settings.

zwag20 · October 9

https://community.dataiku.com/discussion/comment/46666#Comment_46666

Quite the assumptions you are making. I am using snowflake. My issues are with dataiku itself not the db performance.

Efficiently applying window recipes across monthly database partitions

Answers

Categories

Setup Info

Tags