Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have the following dataset A:
entry | tag_1 | tag_2 | tag_3 | tag_4 |
a | asd | ert | dfg | ghj |
b | asd | ert | dfg | ghj |
c | asd | ert | dfg | ghj |
And the following dataset B:
entry | tag_2 | tag_4 |
a | qwe | rty |
c | qwe | rty |
I would like to join them together so that when there's an entry match, the column in dataset A is filled with the values from dataset B. If there is no match, we keep the data from A. So the output is:
entry | tag_1 | tag_2 | tag_3 | tag_4 |
a | asd | qwe | dfg | rty |
b | asd | ert | dfg | ghj |
c | asd | qwe | dfg | rty |
What is the most efficient way to go about it? I had considered creating an inner join A + B where I only keep data from B columns, then a left join A + B where I exclude all entries that match between A and B, and finally stack the two - but this seems overly complicated?
I would also not consider the "Computed columns" option (e.g. with an if formula to fill in with B values) because in my real dataset I would have to compute data for around 20 columns.
Any suggestion?
@Flo_G ,
Welcome to the Dataiku Community.
One of the cool things about Dataiku is that you can mix a variety of tools (SQL, R, Python, ... and the Built-in Visual tools) in a single data flow, using the most useful bits of all, and storing your data in a variety of data stores to meet the scale of the work you are doing.
I tend to back my Dataiku instance with a PostgreSQL server even on my laptop design node. So, I might use a SQL recipe to do this the most efficiently inside a SQL server:
Something like this is a SQL recipe:
Select A.entry as entry,
A.tag_1 as tag_1,
COALESCE(B.tag_2, A.tag_2) as tag_2,
B.tag_3 as tag_3,
COALESCE(B.tag_4, A.tag_4) as tag_4,
from A
Left Join B
on A.entry = B.entry
This assumes you have a SQL Server of some sort.
If you are using built-in data storage you might try a Visual Join Recipe using a left join like above (which is the default) and a post join computed column for tag_2 and tag_4 that uses a formula something like this.
Assuming on the select column screen you have set the prefix to A for table A and B for Table B.
COALESCE(B_tag_2, A_tag_2)
As I said at the top of this note there are lots of ways to pull off these things with DSS. I'm sure others might think of an R way and a Python Way. You might be able to do this with several steps in a single visual recipe. Using a Join Step, followed by two formula steps to implement the Coalesce steps, followed by a bit of cleanup to get rid of the extra copies of columns.
Let us know how you get on with this.
Others please feel free to jump in with your favorite solution to this kind of problem.
@tgb417 thank you for the input! Indeed the file at the moment is stored locally, so I don't have the SQL recipe option.
My main concern was to avoid the work of coalescing every column individually (I have about 20 to replace), I'll see if I can write a simpler workaround in R or Python 🙂
At the end I decided to go with the join + stack option:
Hope this helps someone else in the future!