Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

File based partitioning

Henk
Level 1
File based partitioning

Hello,

I'm using file based partitioning, whereby the partitions are derived from the file name: Filename_YYYY_MM_DD/<files> .  The partition pattern: /Filename_%Y-%M-%D/.*

So far so good. Normally I receive a new file every day, but unfortunately, sometimes a day is skipped and the file is missing.  Also, data goes back a few years so there are more than 1300 partitions.  I would like to only select partitions from the last 6 months, but only those that are available.

When using a SQLSpark recipe I can define Last Available, All Available, or a time range etc (in the input/output tab).  Unfortunately, the time range option doesn't check whether a partition is available or not and the recipe fails when a file/partition is missing.  I could write a Python Dependency function, but also here no option to return only available partitions?

What are the best options? 

  • I could use substitution variables, but the partition key is not found in the data, it is derived from the filename?  How could I use one of the substitution variables and select all available partitions of the last 6 months?
  • Or would it be better to use a pySpark node where I retrieve the list of available partitions (with .list_partitions(raise_if_empty=True), then define which data to read (with .add_read_partitions), and to what partitions to write (with .set_write_partitions ???)

Would it be possible to provide some examples for both options 1) how to filter partitions using variables, e.g. DKU_PARTITION_Filter or a source specific filters > (DKU_DST_DATE – 6 months), or 2) a Python example to select the partitions available in the last 6 months, how best to read these, and write these to the same partition in the output?

This would be a tremendous help!

Many thanks,

Henk

0 Kudos
0 Replies