file-based partitions - collect ALL partitions with several partitioning dimensions
I would like to convert a partitioned table that has several partitioning dimensions into a non partitioned table.
So it seems logic to use the "all available" partition dependency function for each dimensions, as in the screenshot below:
However, this can result in an error as dataiku seems to:
- parse all available patterns in each dimension independantly
- combine each pattern found in each dimension with a cartesian product
- try to build the output table from all those combinations
So, as expected, I encounter an error because dataiku does not find certain combinations that do not exist in my data.
For example, as seen in the screenshot above, the pattern "2023/44" (where "2023" is the year and "44" is the week number) does not yet exist (as of the time of writing this post, we are currently in week n°18 in year 2023).
So, isn't there a simple way to collect all currently available partitions (and not all theoritical partitions as, IMO, dataiku works) ?
cc : @ElieA
Best Answer
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @tanguy
,
I think the easiest option would be to simply create a brand new input S3 dataset that points to the same location as your partitioned dataset. And then indeed, you can simply leave partitioning disabled on the dataset, and your entire dataset will be read in from S3Let me know if that doesn't work as an option for you!!
Thanks,
Sarina
Answers
-
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron
Hi @SarinaS
,
Thank you for your answer.
Indeed, I resorted to this solution (which I have also used to point to a higher partitioning granularity, e.g. at the year level in the above example).
IMHO, it is not completely satisfactory though, as it breaks the lineage in the flow and decreases the pipeline lisibilty.
I believe it would be nice to offer an "all *existing* available" dependency function (which would parse the existing partitions, as in the solution you propose, and not find all possible combinations between partitioning dimensions, as the "all available" dependency function currently does).