Joining partitioned datasets

cly91 · July 2020

Hi, I have two partitioned datasets and I would like to join them on a common key. The issue I have is the common key doesn't fall in the same partition (eg dataset 1 - 2020-07-17 & dataset 2 - 2020-07-01)

The difference in the number of days between dataset 1 and 2 could be anywhere from 0 to 365 prior

What's the best strategy for joining these two datasets

Clément_Stenac · July 2020

Hi,

The main thing that you will need is to use an "All available" dependency on at least one of the sides of the join, so that all partitions are considered.

For example, you may have the "main" dataset A, and a "reference" (left-joined) dataset B, writing into output dataset C.

All of A,B,C are partitioned by day. Your join recipe could have:

An "equal" dependency from A to C
An "all available" dependency from B to C

This way, C will contain the rows of A, per day, enriched by considering all days of B.

Please be aware of the fact that "all available" really means "partitions of B" already computed. We have more details on this here: https://doc.dataiku.com/dss/latest/partitions/dependencies.html#note-about-available-dependencies

Joining partitioned datasets

Answers

Categories

Setup Info

Tags