Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

To compare two datasets and find out the difference

pranauv
Level 1
To compare two datasets and find out the difference

Hi All,

I am trying to create a project flow where I am having two datasets let us say 
Dataset 1:
Name      id
xxxx        user1
yyyy        user2

Dataset 2:
Name     id
aaaa      user3
xxxx       user1
yyyy       user2
zzzz       user4

In both the datasets the id column is unique, I need to compare both the datasets and my output dataset should have 
output dataset:
Name    id
zzzz      user4
aaaa      user3

Which means I need to compare the dataset1 with dataset2 and if any of the users in dataset1 is not available in dataset2 I need to take them and put it as an output dataset.

Similarly I want the vice versa of the scenario where I also want an output dataset which  compare the dataset1 with dataset2 and if any of the users in dataset1 is available in dataset2 I need to take them and put it as an output dataset.

Output dataset :
Name    id
xxxx      user1
yyyy      user2

Can you help me to achieve this?

0 Kudos
1 Reply
fchataigner2
Dataiker
Dataiker

Hi,

you can do a left outer join of dataset 2 (on the left) with dataset 1 (on the right), then have a post-filter in the Join recipe to only keep rows where right_id (or whatever name you give that column) is not defined.

If the id are unique, then an inner join will give only rows for which there is a match. You can have an inner join by changing the join type in the modal of the join

Screenshot 2021-05-19 at 12.27.53.png

0 Kudos
A banner prompting to get Dataiku DSS