File based partitioning

Henk · December 2022

Hello,

I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.*

So far so good. Normally I receive a new file every day, but unfortunately, sometimes a day is skipped and the file is missing. Also, data goes back a few years so there are more than 1300 partitions. I would like to only select partitions from the last 6 months, but only those that are available.

When using a SQLSpark recipe I can define Last Available, All Available, or a time range etc (in the input/output tab). Unfortunately, the time range option doesn't check whether a partition is available or not and the recipe fails when a file/partition is missing. I could write a Python Dependency function, but also here no option to return only available partitions?

What are the best options?

I could use substitution variables, but the partition key is not found in the data, it is derived from the filename? How could I use one of the substitution variables and select all available partitions of the last 6 months?
Or would it be better to use a pySpark node where I retrieve the list of available partitions (with .list_partitions(raise_if_empty=True), then define which data to read (with .add_read_partitions), and to what partitions to write (with .set_write_partitions ???)

Would it be possible to provide some examples for both options 1) how to filter partitions using variables, e.g. DKU_PARTITION_Filter or a source specific filters > (DKU_DST_DATE – 6 months), or 2) a Python example to select the partitions available in the last 6 months, how best to read these, and write these to the same partition in the output?

This would be a tremendous help!

Many thanks,

Henk

File based partitioning

Categories

Setup Info

Tags