Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

Join partition and unpartitioned dataset on partition key

Solved!
julesbertrand
Level 2
Level 2
Join partition and unpartitioned dataset on partition key

Hey,

I am trying to join a partitioned dataset (by month_key, type str not date) with a small unpartitioned dataset in which month_key is a column. I went to join them on 2 columns including month_key, and I want my output dataset to be unpartitioned. Is there a way to do it other than de-partitioning one of the datasets please ?

Thanks,

Jules


Operating system used: MacOS

0 Kudos
1 Solution
NN
Neuron
Neuron

Hi @julesbertrand 

Is your unpartitioned dataset small enough that it can fit in memory?
If yes then one approach you can use is to leverage the prepare recipe.

1. Define the input to the prepare recipe as your partitioned dataset (data A)
2. The output to the prepare recipe (data C) should be defined as Not Partitioned
3. In the prepare recipe add a step Enrich records with Files info and define a name(Ex: part_date) in the parameter Output Partition column this will create a new column part_date
4. now in the prepare recipe add a new step join with other dataset (memory based) here you can define the part_date column as the join column and your smaller dataset (dataB) as the join dataset and can define all columns you need to pull from the small dataset and and prefix you wish to define.
5. you can then run the recipe

Please note the part_date column is derived at runtime hence when designing the prepare recipe you may not see any value in it.
Hope i've understood your question. 

View solution in original post

3 Replies
NN
Neuron
Neuron

Hi @julesbertrand 

Is your unpartitioned dataset small enough that it can fit in memory?
If yes then one approach you can use is to leverage the prepare recipe.

1. Define the input to the prepare recipe as your partitioned dataset (data A)
2. The output to the prepare recipe (data C) should be defined as Not Partitioned
3. In the prepare recipe add a step Enrich records with Files info and define a name(Ex: part_date) in the parameter Output Partition column this will create a new column part_date
4. now in the prepare recipe add a new step join with other dataset (memory based) here you can define the part_date column as the join column and your smaller dataset (dataB) as the join dataset and can define all columns you need to pull from the small dataset and and prefix you wish to define.
5. you can then run the recipe

Please note the part_date column is derived at runtime hence when designing the prepare recipe you may not see any value in it.
Hope i've understood your question. 

Ignacio_Toledo

Hi,

I guess the problem you are having is because the month_key is not a column in the partitioned dataset, and in that case, the solution by @NN will work. Never thought before of this approach!

I was thinking that perhaps another solution, specially since you say the second dataset is small, is to activate the partitioning for this small one, and then join both datasets with the join recipe. But I'm not sure that the output can be left unpartitioned in that case...

Cheers

julesbertrand
Level 2
Level 2
Author

@NN it works like a charm ! However if you have a bigger unpartitioned dataset, I think you have to build a partitioned output dataset and then un-partition it, in terms of efficiency.

Thanks !