Efficiently applying window recipes across monthly database partitions

zwag20
zwag20 Registered Posts: 1

Hi all,
I’m looking for best practices when applying window recipes that need to span multiple partitions.

In my case, I have a dataset partitioned by month, and I’m using a window recipe with lead/lag functions to look ahead and behind by 3 months. To make this work, I currently:

  1. Unpartition the dataset using a sync recipe
  2. Apply the window recipe
  3. Redispatch the data back into monthly partitions

This approach works, but the redispatch step is painfully slow.
Has anyone found a more efficient way to handle this kind of cross-partition window logic?

Operating system used: windows

Operating system used: windows

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,335 Dataiker

    Hi,
    What type of datasets are you using? SQL or file-based?

    For the first requirement, you will need to collect the last three partitions ( months) you want to perform the window recipe; the output would need to be un-partitioned for the redispatch.

    https://knowledge.dataiku.com/latest/automation/partitioning/concept-redispatch.html#collecting-partitions

    a) If it's file-based, unfortunetly, there is not much you can do. You will need to redispatch the last three months to the partitioned datasets. Note that you "lose" the partition column when you redispatch, so you may want to duplicate the partition column, or youcan always add it back later from a prepare recipe - Enrich from https://doc.dataiku.com/dss/latest/preparation/processors/enrich-with-record-context.html


    b) If it's an SQL dataset, redispatch is not really needed
    Simply read the output SQL dataset of the windows a new partitioned dataset and continue from there, there is no difference in what is stored in databases between a partitioned and a non-partitioned table; the difference comes from the way DSS interacts with that dataset.

    If you want to keep it connected to your flow, you can create an empty Python recipe with no code and manually set its output to your window recipe output table. This will avoid any overhead of redispatch in case of an SQL dataset.

    Thanks

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,580 Neuron

    Personally I think you should move away from legacy technologies that require performance hacks like partitioning to deliver the performance that you are looking for. I would suggest to look at Cloud native database technologies like Snowflake, Databricks, BigQuery, etc which should be able to handle way more data rows without much trouble.

Setup Info
    Tags
      Help me…