Record unmatched rows from two datasets (without code recipe)

yesitsmeoffical · ‎12-29-2023

I would like to compare Dataset 1 and Dataset 2 (see the sample below), and record all the differences in Dataset 3. I have been exploring the functions in the Join recipe but no luck so far.

Note that in Dataset 1 the ID field (join key) is unsorted, I could manually sort it in the prepare recipe but I would like to automate the sorting if possible. I tried "sort array" but it didn't seem to work.

What's the best way to create Dataset 3 without using code recipe?

Dataset 1

ID	Gender	Age
A C2	F	20
A A4	F	40
A A5	F	50
A D1	M	10

Dataset 2

ID	Gender	Age
A D1	M	10
A C2	M	20
A A3	M	30
A A4	F	44

Dataset 3 (record all the unmatched rows)

ID	Gender	Age	Source
A C2	F	20	Dataset 1
A C2	M	20	Dataset 2
A A3	M	30	Dataset 2
A A4	F	40	Dataset 1
A A4	F	44	Dataset 2
A A5	F	50	Dataset 1

Operating system used: win 11

NN · ‎12-29-2023

Hi ,

Till someone gives you a better solution
One approach you can use if you dont have too many columns is

use a stack recipe to stack the 2 datasets , it has the option to generate a source column as well to tell you which dataset the row came from.

Use a group by recipe to group by all the columns except the source column
And put a post filter to only show rows where count=1

this will only drop all rows which are a total match in both datasets.

yesitsmeoffical · ‎12-29-2023

Thanks, this is a good approach.

In the actual table there are about 50 columns so adding all of them manually is a bit of work.

I was wondering if a computed column can be used as the group key, where this computed column is a for-loop on the number of columns minus 1 (the last one being the source column)?

Basically the python code below but in Dataiku syntax:

df['concat'] = ''
for col in df.columns:
    df['concat'] += df[col].fillna('')
df['concat'] = df['concat'].str.rstrip()

Also in the group recipe, since the "Source" column is not part of the group keys and will be dropped in the output table so that Dataset 3 will look like this, but then it's hard to tell the source apart. Is there a way to readd the "Source" column after grouping?

ID	Gender	Age
A C2	F	20
A C2	M	20
A A3	M	30
A A4	F	40
A A4	F	44
A A5	F	50

Turribeach · ‎12-30-2023

Below is how I did it without any code recipes and only took 1min to build. Join your datasets twice, first Dataset 1 joined to Dataset 2 and then Dataset 2 joined to Dataset 1. In both join recipes select the join type as "Left anti join" and select all the columns to be equal to each other (Dataiku will do this automatically if the columns are named the same on both datasets). This will in effect find the differences missing in each of the datasets in two new datasets. Then use the Stack recipe to join the results together. Don't forget to turn the Origin Column On in the Stack recipe to see where the records come from.

I think @NN solution might be faster for larger datasets since it will have only 2 visual recipes and a Stack and Group By will be less hrd to compute than 2 left anti joins and a Stack recipe. However I think my solution will be slightly easier to debug as the differences can be seen in separate datasets as well as in a joined one.

Sign up to take part

Record unmatched rows from two datasets (without code recipe)

Record unmatched rows from two datasets (without code recipe)

Setup info