Join datasets A,B and replace matching entries with data from all columns in B

Flo_G · ‎09-22-2021

I have the following dataset A:

entry	tag_1	tag_2	tag_3	tag_4
a	asd	ert	dfg	ghj
b	asd	ert	dfg	ghj
c	asd	ert	dfg	ghj

And the following dataset B:

entry	tag_2	tag_4
a	qwe	rty
c	qwe	rty

I would like to join them together so that when there's an entry match, the column in dataset A is filled with the values from dataset B. If there is no match, we keep the data from A. So the output is:

entry	tag_1	tag_2	tag_3	tag_4
a	asd	qwe	dfg	rty
b	asd	ert	dfg	ghj
c	asd	qwe	dfg	rty

What is the most efficient way to go about it? I had considered creating an inner join A + B where I only keep data from B columns, then a left join A + B where I exclude all entries that match between A and B, and finally stack the two - but this seems overly complicated?

I would also not consider the "Computed columns" option (e.g. with an if formula to fill in with B values) because in my real dataset I would have to compute data for around 20 columns.

Any suggestion?

tgb417 · ‎09-23-2021

@Flo_G ,

Welcome to the Dataiku Community.

One of the cool things about Dataiku is that you can mix a variety of tools (SQL, R, Python, ... and the Built-in Visual tools) in a single data flow, using the most useful bits of all, and storing your data in a variety of data stores to meet the scale of the work you are doing.

I tend to back my Dataiku instance with a PostgreSQL server even on my laptop design node. So, I might use a SQL recipe to do this the most efficiently inside a SQL server:

Something like this is a SQL recipe:

Select A.entry                     as entry,
       A.tag_1                     as tag_1,
       COALESCE(B.tag_2, A.tag_2)  as tag_2,
       B.tag_3                     as tag_3,
       COALESCE(B.tag_4, A.tag_4)  as tag_4,
from A
     Left Join B
        on A.entry = B.entry

This assumes you have a SQL Server of some sort.

If you are using built-in data storage you might try a Visual Join Recipe using a left join like above (which is the default) and a post join computed column for tag_2 and tag_4 that uses a formula something like this.

Assuming on the select column screen you have set the prefix to A for table A and B for Table B.

COALESCE(B_tag_2, A_tag_2)

As I said at the top of this note there are lots of ways to pull off these things with DSS. I'm sure others might think of an R way and a Python Way. You might be able to do this with several steps in a single visual recipe. Using a Join Step, followed by two formula steps to implement the Coalesce steps, followed by a bit of cleanup to get rid of the extra copies of columns.

Let us know how you get on with this.

Others please feel free to jump in with your favorite solution to this kind of problem.

--Tom

Flo_G · ‎09-23-2021

@tgb417 thank you for the input! Indeed the file at the moment is stored locally, so I don't have the SQL recipe option.

My main concern was to avoid the work of coalescing every column individually (I have about 20 to replace), I'll see if I can write a simpler workaround in R or Python 🙂

Flo_G · ‎09-23-2021

At the end I decided to go with the join + stack option:

Inner join A and B, select all columns from B and unselect the corresponding columns from A - so all values are replaced
Left excluding join A and B: take from B only the ID column, then filter rows where B.ID is null (ref. http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)
Stack A and B
For some reason one row is duplicated, so I run a distinct recipe to clean it up

Hope this helps someone else in the future!

Sign up to take part

Join datasets A,B and replace matching entries with data from all columns in B

Join datasets A,B and replace matching entries with data from all columns in B