Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

To compare two datasets and find out the difference

pranauv
Level 1
To compare two datasets and find out the difference

Hi All,

I am trying to create a project flow where I am having two datasets let us say 
Dataset 1:
Name      id
xxxx        user1
yyyy        user2

Dataset 2:
Name     id
aaaa      user3
xxxx       user1
yyyy       user2
zzzz       user4

In both the datasets the id column is unique, I need to compare both the datasets and my output dataset should have 
output dataset:
Name    id
zzzz      user4
aaaa      user3

Which means I need to compare the dataset1 with dataset2 and if any of the users in dataset1 is not available in dataset2 I need to take them and put it as an output dataset.

Similarly I want the vice versa of the scenario where I also want an output dataset which  compare the dataset1 with dataset2 and if any of the users in dataset1 is available in dataset2 I need to take them and put it as an output dataset.

Output dataset :
Name    id
xxxx      user1
yyyy      user2

Can you help me to achieve this?

0 Kudos
1 Reply
fchataigner2
Dataiker
Dataiker

Hi,

you can do a left outer join of dataset 2 (on the left) with dataset 1 (on the right), then have a post-filter in the Join recipe to only keep rows where right_id (or whatever name you give that column) is not defined.

If the id are unique, then an inner join will give only rows for which there is a match. You can have an inner join by changing the join type in the modal of the join

Screenshot 2021-05-19 at 12.27.53.png