Efficiently applying window recipes across monthly database partitions

zwag20
zwag20 Registered Posts: 4 ✭✭

Hi all,
I’m looking for best practices when applying window recipes that need to span multiple partitions.

In my case, I have a dataset partitioned by month, and I’m using a window recipe with lead/lag functions to look ahead and behind by 3 months. To make this work, I currently:

  1. Unpartition the dataset using a sync recipe
  2. Apply the window recipe
  3. Redispatch the data back into monthly partitions

This approach works, but the redispatch step is painfully slow.
Has anyone found a more efficient way to handle this kind of cross-partition window logic?

Operating system used: windows

Operating system used: windows

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker

    Hi,
    What type of datasets are you using? SQL or file-based?

    For the first requirement, you will need to collect the last three partitions ( months) you want to perform the window recipe; the output would need to be un-partitioned for the redispatch.

    https://knowledge.dataiku.com/latest/automation/partitioning/concept-redispatch.html#collecting-partitions

    a) If it's file-based, unfortunetly, there is not much you can do. You will need to redispatch the last three months to the partitioned datasets. Note that you "lose" the partition column when you redispatch, so you may want to duplicate the partition column, or youcan always add it back later from a prepare recipe - Enrich from https://doc.dataiku.com/dss/latest/preparation/processors/enrich-with-record-context.html


    b) If it's an SQL dataset, redispatch is not really needed
    Simply read the output SQL dataset of the windows a new partitioned dataset and continue from there, there is no difference in what is stored in databases between a partitioned and a non-partitioned table; the difference comes from the way DSS interacts with that dataset.

    If you want to keep it connected to your flow, you can create an empty Python recipe with no code and manually set its output to your window recipe output table. This will avoid any overhead of redispatch in case of an SQL dataset.

    Thanks

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023, Circle Member Posts: 2,590 Neuron

    Personally I think you should move away from legacy technologies that require performance hacks like partitioning to deliver the performance that you are looking for. I would suggest to look at Cloud native database technologies like Snowflake, Databricks, BigQuery, etc which should be able to handle way more data rows without much trouble.

  • zwag20
    zwag20 Registered Posts: 4 ✭✭

    Thanks for the response. I am using snowflake as my database. I am seeing weird functionality when I don't use the redispatch partitions. The python recipe appears to work, but I can't see the data and recieve an error saying failed to read data from table. I have no idea why as the job completed successfully. If I don't redispatch partitions, it just seems to put all the data in every partition regardless what I put in my settings for the partition matching. It's like it does the "all" instead of equals regardless of what I pick.

    Basically I don't understand what is happening, and I don't really trust the partitioning in dataiku. It seems to do whatever it wants regardless of my settings.

  • zwag20
    zwag20 Registered Posts: 4 ✭✭

    Quite the assumptions you are making. I am using snowflake. My issues are with dataiku itself not the db performance.

Setup Info
    Tags
      Help me…