Join partition and unpartitioned dataset on partition key

Options
julesbertrand
julesbertrand Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 Partner

Hey,

I am trying to join a partitioned dataset (by month_key, type str not date) with a small unpartitioned dataset in which month_key is a column. I went to join them on 2 columns including month_key, and I want my output dataset to be unpartitioned. Is there a way to do it other than de-partitioning one of the datasets please ?

Thanks,

Jules


Operating system used: MacOS

Best Answer

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron
    Answer ✓
    Options

    Hi @julesbertrand

    Is your unpartitioned dataset small enough that it can fit in memory?
    If yes then one approach you can use is to leverage the prepare recipe.

    1. Define the input to the prepare recipe as your partitioned dataset (data A)
    2. The output to the prepare recipe (data C) should be defined as Not Partitioned
    3. In the prepare recipe add a step Enrich records with Files info and define a name(Ex: part_date) in the parameter Output Partition column this will create a new column part_date
    4. now in the prepare recipe add a new step join with other dataset (memory based) here you can define the part_date column as the join column and your smaller dataset (dataB) as the join dataset and can define all columns you need to pull from the small dataset and and prefix you wish to define.
    5. you can then run the recipe

    Please note the part_date column is derived at runtime hence when designing the prepare recipe you may not see any value in it.
    Hope i've understood your question.

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 411 Neuron
    Options

    Hi,

    I guess the problem you are having is because the month_key is not a column in the partitioned dataset, and in that case, the solution by @NN
    will work. Never thought before of this approach!

    I was thinking that perhaps another solution, specially since you say the second dataset is small, is to activate the partitioning for this small one, and then join both datasets with the join recipe. But I'm not sure that the output can be left unpartitioned in that case...

    Cheers

  • julesbertrand
    julesbertrand Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 Partner
    Options

    @NN
    it works like a charm ! However if you have a bigger unpartitioned dataset, I think you have to build a partitioned output dataset and then un-partition it, in terms of efficiency.

    Thanks !

Setup Info
    Tags
      Help me…