Record unmatched rows from two datasets (without code recipe)

yesitsmeoffical
yesitsmeoffical Registered Posts: 15

I would like to compare Dataset 1 and Dataset 2 (see the sample below), and record all the differences in Dataset 3. I have been exploring the functions in the Join recipe but no luck so far.

Note that in Dataset 1 the ID field (join key) is unsorted, I could manually sort it in the prepare recipe but I would like to automate the sorting if possible. I tried "sort array" but it didn't seem to work.

What's the best way to create Dataset 3 without using code recipe?

Dataset 1

IDGenderAge
A C2F20
A A4F40
A A5F50
A D1M10

Dataset 2

IDGenderAge
A D1M10
A C2M20
A A3M30
A A4F44

Dataset 3 (record all the unmatched rows)

IDGenderAgeSource
A C2F20Dataset 1
A C2M20Dataset 2
A A3M30Dataset 2
A A4F40Dataset 1
A A4F44Dataset 2
A A5F50Dataset 1

Operating system used: win 11

Answers

  • NN
    NN Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 145 Neuron

    Hi ,

    Till someone gives you a better solution
    One approach you can use if you dont have too many columns is

    use a stack recipe to stack the 2 datasets , it has the option to generate a source column as well to tell you which dataset the row came from.

    Use a group by recipe to group by all the columns except the source column
    And put a post filter to only show rows where count=1

    this will only drop all rows which are a total match in both datasets.

  • yesitsmeoffical
    yesitsmeoffical Registered Posts: 15
    edited July 17

    Thanks, this is a good approach.

    In the actual table there are about 50 columns so adding all of them manually is a bit of work.

    I was wondering if a computed column can be used as the group key, where this computed column is a for-loop on the number of columns minus 1 (the last one being the source column)?

    Basically the python code below but in Dataiku syntax:

    df['concat'] = ''
    for col in df.columns:
        df['concat'] += df[col].fillna('')
    df['concat'] = df['concat'].str.rstrip()

    Also in the group recipe, since the "Source" column is not part of the group keys and will be dropped in the output table so that Dataset 3 will look like this, but then it's hard to tell the source apart. Is there a way to readd the "Source" column after grouping?

    IDGenderAge
    A C2F20
    A C2M20
    A A3M30
    A A4F40
    A A4F44
    A A5F50
  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,877 Neuron

    Below is how I did it without any code recipes and only took 1min to build. Join your datasets twice, first Dataset 1 joined to Dataset 2 and then Dataset 2 joined to Dataset 1. In both join recipes select the join type as "Left anti join" and select all the columns to be equal to each other (Dataiku will do this automatically if the columns are named the same on both datasets). This will in effect find the differences missing in each of the datasets in two new datasets. Then use the Stack recipe to join the results together. Don't forget to turn the Origin Column On in the Stack recipe to see where the records come from.

    Screenshot 2023-12-30 at 00.05.28.png

    Screenshot 2023-12-30 at 00.07.39.pngScreenshot 2023-12-30 at 00.08.03.pngScreenshot 2023-12-30 at 00.07.56.png

    Screenshot 2023-12-30 at 00.12.22.png

    I think @NN
    solution might be faster for larger datasets since it will have only 2 visual recipes and a Stack and Group By will be less hrd to compute than 2 left anti joins and a Stack recipe. However I think my solution will be slightly easier to debug as the differences can be seen in separate datasets as well as in a joined one.

Setup Info
    Tags
      Help me…