Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

file-based partitions - collect ALL partitions with several partitioning dimensions

Solved!
tanguy
Level 5
file-based partitions - collect ALL partitions with several partitioning dimensions

I would like to convert a partitioned table that has several partitioning dimensions into a non partitioned table.

So it seems logic to use the "all available" partition dependency function for each dimensions, as in the screenshot below:

collect_1.jpg

However, this can result in an error as dataiku seems to:

  1. parse all available patterns in each dimension independantly
  2. combine each pattern found in each dimension with a cartesian product
  3. try to build the output table from all those combinations

So, as expected, I encounter an error because dataiku does not find certain combinations that do not exist in my data.

collect_2.jpg

For example, as seen in the screenshot above, the pattern "2023/44" (where "2023" is the year and "44" is the week number) does not yet exist (as of the time of writing this post, we are currently in week n°18 in year 2023).

So, isn't there a simple way to collect all currently available partitions (and not all theoritical partitions as, IMO, dataiku works) ?

cc : @ElieA 

 

 

 

0 Kudos
1 Solution
SarinaS
Dataiker

Hi @tanguy,

I think the easiest option would be to simply create a brand new input S3 dataset that points to the same location as your partitioned dataset. And then indeed, you can simply leave partitioning disabled on the dataset, and your entire dataset will be read in from S3 🙂 

Let me know if that doesn't work as an option for you!! 

Thanks,
Sarina 

View solution in original post

0 Kudos
3 Replies
SarinaS
Dataiker

Hi @tanguy,

I think the easiest option would be to simply create a brand new input S3 dataset that points to the same location as your partitioned dataset. And then indeed, you can simply leave partitioning disabled on the dataset, and your entire dataset will be read in from S3 🙂 

Let me know if that doesn't work as an option for you!! 

Thanks,
Sarina 

0 Kudos
tanguy
Level 5
Author

Hi @SarinaS,

Thank you for your answer.

Indeed, I resorted to this solution (which I have also used to point to a higher partitioning granularity, e.g. at the year level in the above example).

IMHO, it is not completely satisfactory though, as it breaks the lineage in the flow and decreases the pipeline lisibilty.

I believe it would be nice to offer an "all *existing* available" dependency function (which would parse the existing partitions, as in the solution you propose, and not find all possible combinations between partitioning dimensions, as the "all available" dependency function currently does).

0 Kudos
SarinaS
Dataiker

Hi @tanguy, indeed I see your point. Making connection in the flow more naturally would be good. I will pass this feedback along to our product team on your behalf.  

Labels

?
Labels (1)

Setup info

?
A banner prompting to get Dataiku