Cartesian product detection in join recipe

simon_aubert
simon_aubert Dataiku DSS Core Designer, Registered Posts: 15 ✭✭✭

What's your use case?
Cartesian product is a common issue when joining dataset with a bad key. It's not always easy to detect and users can even forget to check for it because they think they know their data.

What's your proposed solution?
What I suggest is an option to check if there will be a cartesian product on the join tool.

-there is a label "Cartesian product (non join key uniqueness) detection"
-under it a drop down menu with three choices
-do nothing
-fail
-warning

Algo :
if do nothing==> well... do nothing more than actual behaviour.
if "fail" or "warning" : count distinct of join key (potentially a combination of fields) versus count row on each side of the join. If no count distinct is equal to simple count, display a warning or an error message.

order_id

customer_id

amount

1

1

100

2

1

150

3

2

142

4

2

145

5

1

145

customer_id

customer_name

1

toto

2

tata

2

titi

on customer_id

=> count distinct left : 2; count left : 5
=> count distinct right : 2; count right : 3
2<>5 and 2<>3
=> there will be a cartesian product

=> for the count, it must be a count(*) or a sum(1) but not a count of the field itself because of the null.

Are there any alternative solutions?
N/A

Best regards,

Simon

Tagged:
1
1 votes

New · Last Updated

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    I believe this could be done with custom Python data quality rule. And in v13.1 data quality templates so you can apply pre-defined rules much more easily. And without any actual changes to your flow have a look at Records count view. This displays a heatmap of the range of record counts for your datasets in the Flow which provides a great visual clue on where record counts are changing, such as when a cartisian join may be happening. You also get the ability to recalculate record counts for all your datasets in the flow at once allowing you to get a consistent view of your datasets row counts.

  • simon_aubert
    simon_aubert Dataiku DSS Core Designer, Registered Posts: 15 ✭✭✭
    edited September 24

    @Turribeach Thanks for your answer. The counting record method is way easier to set but it's also misleading.
    (sorry for the formatting)
    Here an example :
    Order Table : 4 rows

    Category table

    Order_id

    Category

    Amount

    Category

    Category label

    1

    A

    100

    A

    toto

    2

    A

    200

    B

    tata

    3

    B

    300

    B

    titi

    4

    C

    400

    Result of inner join : 4 rows

    Order_id

    Category

    Amount

    Category label

    1

    A

    100

    toto

    2

    A

    200

    toto

    3

    B

    300

    tata

    3

    B

    300

    titi




    Without luck, I still have 4 rows… but there is a cartesian product !

    The second reason I'm not fan, it's an a posteriori control and you have to do the join to check, meaning you have spend time on something wrong instead of a faster a priori control.

    Best regards,

    Simon

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    Fair enough about the row count not being perfect for detecting a cartesian join but it's still a very good tool and doesn't need a new feature. Also the example you given above is not a cartesian join. A cartesian join is when there is a join for each row of one table to every row of another table. What you presented is a join with tables of different granularity or with duplicate rows. A cartesian join will have returned 12 rows as each row from the order table (4 rows) will be joined with each row of category table (3 rows).

  • simon_aubert
    simon_aubert Dataiku DSS Core Designer, Registered Posts: 15 ✭✭✭

    @Turribeach : About semantics, this may be an abuse of language, but we usually call it also a cartesian product (and not cartesian join) even if it's not a perfect cross join. ;)
    At least, with the people I have worked it and chatgpt seems to agree.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron

    Good point! Cartesian product not the same as Cartesian join. I stand corrected…

Setup Info
    Tags
      Help me…