file-based partitions - collect ALL partitions with several partitioning dimensions

Tanguy · May 2023

I would like to convert a partitioned table that has several partitioning dimensions into a non partitioned table.

So it seems logic to use the "all available" partition dependency function for each dimensions, as in the screenshot below:

However, this can result in an error as dataiku seems to:

parse all available patterns in each dimension independantly
combine each pattern found in each dimension with a cartesian product
try to build the output table from all those combinations

So, as expected, I encounter an error because dataiku does not find certain combinations that do not exist in my data.

For example, as seen in the screenshot above, the pattern "2023/44" (where "2023" is the year and "44" is the week number) does not yet exist (as of the time of writing this post, we are currently in week n°18 in year 2023).

So, isn't there a simple way to collect all currently available partitions (and not all theoritical partitions as, IMO, dataiku works) ?

cc : @ElieA

Sarina · May 2023

Hi @tanguy
,

I think the easiest option would be to simply create a brand new input S3 dataset that points to the same location as your partitioned dataset. And then indeed, you can simply leave partitioning disabled on the dataset, and your entire dataset will be read in from S3 Let me know if that doesn't work as an option for you!!

Thanks,
Sarina

Tanguy · May 2023

Hi @SarinaS
,

Thank you for your answer.

Indeed, I resorted to this solution (which I have also used to point to a higher partitioning granularity, e.g. at the year level in the above example).

IMHO, it is not completely satisfactory though, as it breaks the lineage in the flow and decreases the pipeline lisibilty.

I believe it would be nice to offer an "all *existing* available" dependency function (which would parse the existing partitions, as in the solution you propose, and not find all possible combinations between partitioning dimensions, as the "all available" dependency function currently does).

Sarina · May 2023

Hi @tanguy
, indeed I see your point. Making connection in the flow more naturally would be good. I will pass this feedback along to our product team on your behalf.

file-based partitions - collect ALL partitions with several partitioning dimensions

Best Answer

Answers

Categories

Setup Info

Tags