file-based partitions - collect ALL partitions with several partitioning dimensions

Tanguy
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron

I would like to convert a partitioned table that has several partitioning dimensions into a non partitioned table.

So it seems logic to use the "all available" partition dependency function for each dimensions, as in the screenshot below:

collect_1.jpg

However, this can result in an error as dataiku seems to:

  1. parse all available patterns in each dimension independantly
  2. combine each pattern found in each dimension with a cartesian product
  3. try to build the output table from all those combinations

So, as expected, I encounter an error because dataiku does not find certain combinations that do not exist in my data.

collect_2.jpg

For example, as seen in the screenshot above, the pattern "2023/44" (where "2023" is the year and "44" is the week number) does not yet exist (as of the time of writing this post, we are currently in week n°18 in year 2023).

So, isn't there a simple way to collect all currently available partitions (and not all theoritical partitions as, IMO, dataiku works) ?

cc : @ElieA

Tagged:

Best Answer

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    Answer ✓

    Hi @tanguy
    ,

    I think the easiest option would be to simply create a brand new input S3 dataset that points to the same location as your partitioned dataset. And then indeed, you can simply leave partitioning disabled on the dataset, and your entire dataset will be read in from S3 Let me know if that doesn't work as an option for you!!

    Thanks,
    Sarina

Answers

  • Tanguy
    Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 118 Neuron

    Hi @SarinaS
    ,

    Thank you for your answer.

    Indeed, I resorted to this solution (which I have also used to point to a higher partitioning granularity, e.g. at the year level in the above example).

    IMHO, it is not completely satisfactory though, as it breaks the lineage in the flow and decreases the pipeline lisibilty.

    I believe it would be nice to offer an "all *existing* available" dependency function (which would parse the existing partitions, as in the solution you propose, and not find all possible combinations between partitioning dimensions, as the "all available" dependency function currently does).

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @tanguy
    , indeed I see your point. Making connection in the flow more naturally would be good. I will pass this feedback along to our product team on your behalf.

Setup Info
    Tags
      Help me…