Join datasets A,B and replace matching entries with data from all columns in B
I have the following dataset A:
entry | tag_1 | tag_2 | tag_3 | tag_4 |
a | asd | ert | dfg | ghj |
b | asd | ert | dfg | ghj |
c | asd | ert | dfg | ghj |
And the following dataset B:
entry | tag_2 | tag_4 |
a | qwe | rty |
c | qwe | rty |
I would like to join them together so that when there's an entry match, the column in dataset A is filled with the values from dataset B. If there is no match, we keep the data from A. So the output is:
entry | tag_1 | tag_2 | tag_3 | tag_4 |
a | asd | qwe | dfg | rty |
b | asd | ert | dfg | ghj |
c | asd | qwe | dfg | rty |
What is the most efficient way to go about it? I had considered creating an inner join A + B where I only keep data from B columns, then a left join A + B where I exclude all entries that match between A and B, and finally stack the two - but this seems overly complicated?
I would also not consider the "Computed columns" option (e.g. with an if formula to fill in with B values) because in my real dataset I would have to compute data for around 20 columns.
Any suggestion?
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
@Flo_G
,Welcome to the Dataiku Community.
One of the cool things about Dataiku is that you can mix a variety of tools (SQL, R, Python, ... and the Built-in Visual tools) in a single data flow, using the most useful bits of all, and storing your data in a variety of data stores to meet the scale of the work you are doing.
I tend to back my Dataiku instance with a PostgreSQL server even on my laptop design node. So, I might use a SQL recipe to do this the most efficiently inside a SQL server:
Something like this is a SQL recipe:
Select A.entry as entry,
A.tag_1 as tag_1,
COALESCE(B.tag_2, A.tag_2) as tag_2,
B.tag_3 as tag_3,
COALESCE(B.tag_4, A.tag_4) as tag_4,
from A
Left Join B
on A.entry = B.entryThis assumes you have a SQL Server of some sort.
If you are using built-in data storage you might try a Visual Join Recipe using a left join like above (which is the default) and a post join computed column for tag_2 and tag_4 that uses a formula something like this.
Assuming on the select column screen you have set the prefix to A for table A and B for Table B.
COALESCE(B_tag_2, A_tag_2)
As I said at the top of this note there are lots of ways to pull off these things with DSS. I'm sure others might think of an R way and a Python Way. You might be able to do this with several steps in a single visual recipe. Using a Join Step, followed by two formula steps to implement the Coalesce steps, followed by a bit of cleanup to get rid of the extra copies of columns.
Let us know how you get on with this.
Others please feel free to jump in with your favorite solution to this kind of problem.
-
@tgb417
thank you for the input! Indeed the file at the moment is stored locally, so I don't have the SQL recipe option.My main concern was to avoid the work of coalescing every column individually (I have about 20 to replace), I'll see if I can write a simpler workaround in R or Python
-
At the end I decided to go with the join + stack option:
- Inner join A and B, select all columns from B and unselect the corresponding columns from A - so all values are replaced
- Left excluding join A and B: take from B only the ID column, then filter rows where B.ID is null (ref. http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins)
- Stack A and B
- For some reason one row is duplicated, so I run a distinct recipe to clean it up
Hope this helps someone else in the future!