Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

Join datasets A,B and replace matching entries with data from all columns in B

Flo_G
Level 2
Join datasets A,B and replace matching entries with data from all columns in B

I have the following dataset A:

entrytag_1tag_2tag_3tag_4
aasdertdfgghj
basdertdfgghj
casdertdfgghj

 

And the following dataset B:

entrytag_2tag_4
aqwerty
cqwerty

 

I would like to join them together so that when there's an entry match, the column in dataset A is filled with the values from dataset B. If there is no match, we keep the data from A. So the output is:

entrytag_1tag_2tag_3tag_4
aasdqwedfgrty
basdertdfgghj
casdqwedfgrty

 

What is the most efficient way to go about it? I had considered creating an inner join A + B where I only keep data from B columns, then a left join A + B where I exclude all entries that match between A and B, and finally stack the two - but this seems overly complicated?

I would also not consider the "Computed columns" option (e.g. with an if formula to fill in with B values) because in my real dataset I would have to compute data for around 20 columns.

Any suggestion?

0 Kudos
3 Replies
tgb417
Neuron
Neuron

@Flo_G ,

Welcome to the Dataiku Community.

One of the cool things about Dataiku is that you can mix a variety of tools (SQL, R, Python, ... and the Built-in Visual tools) in a single data flow, using the most useful bits of all, and storing your data in a variety of data stores to meet the scale of the work you are doing.

I tend to back my Dataiku instance with a PostgreSQL server even on my laptop design node. So, I might use a SQL recipe to do this the most efficiently inside a SQL server:

Something like this is a SQL recipe:

Select A.entry                     as entry,
A.tag_1 as tag_1,
COALESCE(B.tag_2, A.tag_2) as tag_2,
B.tag_3 as tag_3,
COALESCE(B.tag_4, A.tag_4) as tag_4,
from A
Left Join B
on A.entry = B.entry    

This assumes you have a SQL Server of some sort.

If you are using built-in data storage you might try a Visual Join Recipe using a left join like above (which is the default) and a post join computed column for tag_2 and tag_4 that uses a formula something like this.

Assuming on the select column screen you have set the prefix to A for table A and B for Table B.

COALESCE(B_tag_2, A_tag_2)  

As I said at the top of this note there are lots of ways to pull off these things with DSS.  I'm sure others might think of an R way and a Python Way.  You might be able to do this with several steps in a single visual recipe.  Using a Join Step, followed by two formula steps to implement the Coalesce steps, followed by a bit of cleanup to get rid of the extra copies of columns.

Let us know how you get on with this.

Others please feel free to jump in with your favorite solution to this kind of problem.

--Tom
0 Kudos
Flo_G
Level 2
Author

@tgb417 thank you for the input! Indeed the file at the moment is stored locally, so I don't have the SQL recipe option.

My main concern was to avoid the work of coalescing every column individually (I have about 20 to replace), I'll see if I can write a simpler workaround in R or Python 🙂

Flo_G
Level 2
Author

At the end I decided to go with the join + stack option:

  • Inner join A and B, select all columns from B and unselect the corresponding columns from A - so all values are replaced
  • Left excluding join A and B: take from B only the ID column, then filter rows where B.ID is null (ref.  http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)
  • Stack A and B
  • For some reason one row is duplicated, so I run a distinct recipe to clean it up

Hope this helps someone else in the future!

A banner prompting to get Dataiku DSS