Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on December 8, 2022 11:51AM
Likes: 1
Replies: 0
Hello,
I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> . The partition pattern: /Filename_%Y-%M-%D/.*
So far so good. Normally I receive a new file every day, but unfortunately, sometimes a day is skipped and the file is missing. Also, data goes back a few years so there are more than 1300 partitions. I would like to only select partitions from the last 6 months, but only those that are available.
When using a SQLSpark recipe I can define Last Available, All Available, or a time range etc (in the input/output tab). Unfortunately, the time range option doesn't check whether a partition is available or not and the recipe fails when a file/partition is missing. I could write a Python Dependency function, but also here no option to return only available partitions?
What are the best options?
Would it be possible to provide some examples for both options 1) how to filter partitions using variables, e.g. DKU_PARTITION_Filter or a source specific filters > (DKU_DST_DATE – 6 months), or 2) a Python example to select the partitions available in the last 6 months, how best to read these, and write these to the same partition in the output?
This would be a tremendous help!
Many thanks,
Henk