Joining partitioned datasets

Options
cly91
cly91 Registered Posts: 1 ✭✭✭

Hi, I have two partitioned datasets and I would like to join them on a common key. The issue I have is the common key doesn't fall in the same partition (eg dataset 1 - 2020-07-17 & dataset 2 - 2020-07-01)

The difference in the number of days between dataset 1 and 2 could be anywhere from 0 to 365 prior

What's the best strategy for joining these two datasets

Tagged:

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer Posts: 753 Dataiker
    Options

    Hi,

    The main thing that you will need is to use an "All available" dependency on at least one of the sides of the join, so that all partitions are considered.

    For example, you may have the "main" dataset A, and a "reference" (left-joined) dataset B, writing into output dataset C.

    All of A,B,C are partitioned by day. Your join recipe could have:

    • An "equal" dependency from A to C
    • An "all available" dependency from B to C

    This way, C will contain the rows of A, per day, enriched by considering all days of B.

    Please be aware of the fact that "all available" really means "partitions of B" already computed. We have more details on this here: https://doc.dataiku.com/dss/latest/partitions/dependencies.html#note-about-available-dependencies

Setup Info
    Tags
      Help me…