Joining partitioned datasets
Hi, I have two partitioned datasets and I would like to join them on a common key. The issue I have is the common key doesn't fall in the same partition (eg dataset 1 - 2020-07-17 & dataset 2 - 2020-07-01)
The difference in the number of days between dataset 1 and 2 could be anywhere from 0 to 365 prior
What's the best strategy for joining these two datasets
Answers
-
Hi,
The main thing that you will need is to use an "All available" dependency on at least one of the sides of the join, so that all partitions are considered.
For example, you may have the "main" dataset A, and a "reference" (left-joined) dataset B, writing into output dataset C.
All of A,B,C are partitioned by day. Your join recipe could have:
- An "equal" dependency from A to C
- An "all available" dependency from B to C
This way, C will contain the rows of A, per day, enriched by considering all days of B.
Please be aware of the fact that "all available" really means "partitions of B" already computed. We have more details on this here: https://doc.dataiku.com/dss/latest/partitions/dependencies.html#note-about-available-dependencies