Left join with empty values as one of the keys

jack
Level 3
Left join with empty values as one of the keys

Hi,

I have difficulties to make a left join with multiple columns as keys.
One of the columns can have empty cells (null) and it seems that the join is considering that no value can be found with the join.
For example :
Key : Columns 1 , 2 and 3.

left table : A B C D
D E Null F

on the right table:
A B C G
D E Null H

result :
A B C D G
D E Null F Null

So Dataiku seems to not recognize an empty value as a valid key and so it is removing the values.

Is there a way to avoid this problem ?

Thanks !

0 Kudos
5 Replies
GCase
Dataiker

Jack,

What recipe engine are you using?

GCase_0-1579515453672.png

Also

 

LEFT Table  
    
Col 1Col 2Col 3Col 4
ABCD
DENULLF
    
    
RIGHT Table  
Col 1Col 2Col 3Col 4
ABCG
DENULLH

 

Also, are these representations of the table you are trying to get results from?

0 Kudos
jack
Level 3
Author

Hi @GCase ,
11 months after my question, here i am, sorry for the quite long delay.
I have the same problem today, but i think i will be able to illustrate it more precisely, with an project example.
Left table of the join:

image.png

Right table of the join:

 

image.png

Join:
image.pngSelected columns:
image.png

 

Result:
image.png

Why is there no data in column "E_rapporté" ?

The join should be effective as there is the same data on the left side and the right side of the join.

I am using DSS engine. Dataiku says that Spark is not available.

Thanks!

0 Kudos
Ignacio_Toledo

Hi @jack. This is an interest topic that I was ready to dismiss with a comment like: "one should never do joins between columns that are nullable". However, life is not perfect and because of design or because of data problems, there are cases when you need to do it.

On the subject of nulls' joins, I was able to find this article that might help to get some ideas on how to proceed: https://bertwagner.com/posts/joining-on-nulls

----

Going back to your example, the outcome you are getting is correct, or at least is the expected behavior: there were no matches between the left and right tables. Since you used "Left Join" all the rows in the left table are kept, but since there was no match with the right table, all the values of E_rapporté are NULL (empty).

Why is the expected behavior? Because in most programming languages NULL != NULL (in python you can test what happens when you do np.nan == np.nan). And this has some logic, as a missing value is not necessarily equal to another missing value. So when B is Null then left.B != right.B

I hope this helps to understand the behavior you are seeing, and that the article can help you to see how to solve the problem.

Cheers!

 

jack
Level 3
Author

Thanks @Ignacio_Toledo , i will read the article, hoping i will find something.

Concerning the example, the origin of this column is caused by data. I do builds of my projects successively and sometimes the two sides of the join on column B are empty (the whole columns).
I don't understand why there is no match between the left side and right.
The key is the content of the column A/B/C/D.
For the first line of the two tables:
on left side, this key represents the values: a1/empty/c1/d1
on right side, this key represents the values: a1/empty/c1/d1

So for me there is a match.

The alternative solution i found is to make a concatenate of the key before the join of the columns:  concat(A,B,C,D), and then even if a column is totally empty on both side, the join works.

But since i have more than 200 tables with joins which can potentially have this problem, i hope there is a more elegant solution 😁

Thanks again

0 Kudos
Ignacio_Toledo

Hi @jack. About this part of your message:

I don't understand why there is no match between the left side and right.
The key is the content of the column A/B/C/D.
For the first line of the two tables:
on left side, this key represents the values: a1/empty/c1/d1
on right side, this key represents the values: a1/empty/c1/d1

The problem is that while a1==a1, c1==c1 and d1==d1, empty != empty, at least not when you have a NULL in place, and that is the way that SQL databases work. I hope you had time to check the article, because there are examples on to understand why this is so.

In your case, and without knowing anything about the particularities of your data, the solution could be simple: make first a preparation recipe where you fill the null or empty rows of column B with a string like 'NA' or 'NoData', and then go with the join, that should do the trick.

Cheers!

0 Kudos