File based partitioning

Dataiku DSS Core Designer, Registered Posts: 3

Hello,

I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.*

So far so good. Normally I receive a new file every day, but unfortunately, sometimes a day is skipped and the file is missing. Also, data goes back a few years so there are more than 1300 partitions. I would like to only select partitions from the last 6 months, but only those that are available.

When using a SQLSpark recipe I can define Last Available, All Available, or a time range etc (in the input/output tab). Unfortunately, the time range option doesn't check whether a partition is available or not and the recipe fails when a file/partition is missing. I could write a Python Dependency function, but also here no option to return only available partitions?

What are the best options?

  • I could use substitution variables, but the partition key is not found in the data, it is derived from the filename? How could I use one of the substitution variables and select all available partitions of the last 6 months?
  • Or would it be better to use a pySpark node where I retrieve the list of available partitions (with .list_partitions(raise_if_empty=True), then define which data to read (with .add_read_partitions), and to what partitions to write (with .set_write_partitions ???)

Would it be possible to provide some examples for both options 1) how to filter partitions using variables, e.g. DKU_PARTITION_Filter or a source specific filters > (DKU_DST_DATE – 6 months), or 2) a Python example to select the partitions available in the last 6 months, how best to read these, and write these to the same partition in the output?

This would be a tremendous help!

Many thanks,

Henk

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.