Left join with empty values as one of the keys

jack
jack Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭✭

Hi,

I have difficulties to make a left join with multiple columns as keys.
One of the columns can have empty cells (null) and it seems that the join is considering that no value can be found with the join.
For example :
Key : Columns 1 , 2 and 3.

left table : A B C D
D E Null F

on the right table:
A B C G
D E Null H

result :
A B C D G
D E Null F Null

So Dataiku seems to not recognize an empty value as a valid key and so it is removing the values.

Is there a way to avoid this problem ?

Thanks !

Answers

  • GCase
    GCase Dataiker, PartnerAdmin, Registered Posts: 27 Dataiker

    Jack,

    What recipe engine are you using?

    GCase_0-1579515453672.png

    Also

    LEFT Table
    Col 1Col 2Col 3Col 4
    ABCD
    DENULLF
    RIGHT Table
    Col 1Col 2Col 3Col 4
    ABCG
    DENULLH

    Also, are these representations of the table you are trying to get results from?

  • jack
    jack Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭✭

    Hi @GCase
    ,
    11 months after my question, here i am, sorry for the quite long delay.
    I have the same problem today, but i think i will be able to illustrate it more precisely, with an project example.
    Left table of the join:

    image.png

    Right table of the join:

    image.png

    Join:
    image.pngSelected columns:
    image.png

    Result:
    image.png

    Why is there no data in column "E_rapporté" ?

    The join should be effective as there is the same data on the left side and the right side of the join.

    I am using DSS engine. Dataiku says that Spark is not available.

    Thanks!

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @jack
    . This is an interest topic that I was ready to dismiss with a comment like: "one should never do joins between columns that are nullable". However, life is not perfect and because of design or because of data problems, there are cases when you need to do it.

    On the subject of nulls' joins, I was able to find this article that might help to get some ideas on how to proceed: https://bertwagner.com/posts/joining-on-nulls

    ----

    Going back to your example, the outcome you are getting is correct, or at least is the expected behavior: there were no matches between the left and right tables. Since you used "Left Join" all the rows in the left table are kept, but since there was no match with the right table, all the values of E_rapporté are NULL (empty).

    Why is the expected behavior? Because in most programming languages NULL != NULL (in python you can test what happens when you do np.nan == np.nan). And this has some logic, as a missing value is not necessarily equal to another missing value. So when B is Null then left.B != right.B

    I hope this helps to understand the behavior you are seeing, and that the article can help you to see how to solve the problem.

    Cheers!

  • jack
    jack Dataiku DSS Core Designer, Registered Posts: 13 ✭✭✭✭

    Thanks @Ignacio_Toledo
    , i will read the article, hoping i will find something.

    Concerning the example, the origin of this column is caused by data. I do builds of my projects successively and sometimes the two sides of the join on column B are empty (the whole columns).
    I don't understand why there is no match between the left side and right.
    The key is the content of the column A/B/C/D.
    For the first line of the two tables:
    on left side, this key represents the values: a1/empty/c1/d1
    on right side, this key represents the values: a1/empty/c1/d1

    So for me there is a match.

    The alternative solution i found is to make a concatenate of the key before the join of the columns: concat(A,B,C,D), and then even if a column is totally empty on both side, the join works.

    But since i have more than 200 tables with joins which can potentially have this problem, i hope there is a more elegant solution Thanks again

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @jack
    . About this part of your message:

    I don't understand why there is no match between the left side and right.
    The key is the content of the column A/B/C/D.
    For the first line of the two tables:
    on left side, this key represents the values: a1/empty/c1/d1
    on right side, this key represents the values: a1/empty/c1/d1

    The problem is that while a1==a1, c1==c1 and d1==d1, empty != empty, at least not when you have a NULL in place, and that is the way that SQL databases work. I hope you had time to check the article, because there are examples on to understand why this is so.

    In your case, and without knowing anything about the particularities of your data, the solution could be simple: make first a preparation recipe where you fill the null or empty rows of column B with a string like 'NA' or 'NoData', and then go with the join, that should do the trick.

    Cheers!

Setup Info
    Tags
      Help me…