Cartesian product detection in join recipe
What's your use case?
Cartesian product is a common issue when joining dataset with a bad key. It's not always easy to detect and users can even forget to check for it because they think they know their data.
What's your proposed solution?
What I suggest is an option to check if there will be a cartesian product on the join tool.
-there is a label "Cartesian product (non join key uniqueness) detection"
-under it a drop down menu with three choices
-do nothing
-fail
-warning
Algo :
if do nothing==> well... do nothing more than actual behaviour.
if "fail" or "warning" : count distinct of join key (potentially a combination of fields) versus count row on each side of the join. If no count distinct is equal to simple count, display a warning or an error message.
order_id | customer_id | amount |
---|---|---|
1 | 1 | 100 |
2 | 1 | 150 |
3 | 2 | 142 |
4 | 2 | 145 |
5 | 1 | 145 |
customer_id | customer_name |
---|---|
1 | toto |
2 | tata |
2 | titi |
on customer_id
=> count distinct left : 2; count left : 5
=> count distinct right : 2; count right : 3
2<>5 and 2<>3
=> there will be a cartesian product
=> for the count, it must be a count(*) or a sum(1) but not a count of the field itself because of the null.
Are there any alternative solutions?
N/A
Best regards,
Simon
Comments
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
I believe this could be done with custom Python data quality rule. And in v13.1 data quality templates so you can apply pre-defined rules much more easily. And without any actual changes to your flow have a look at Records count view. This displays a heatmap of the range of record counts for your datasets in the Flow which provides a great visual clue on where record counts are changing, such as when a cartisian join may be happening. You also get the ability to recalculate record counts for all your datasets in the flow at once allowing you to get a consistent view of your datasets row counts.
-
@Turribeach Thanks for your answer. The counting record method is way easier to set but it's also misleading.
(sorry for the formatting)
Here an example :
Order Table : 4 rowsCategory table
Order_id
Category
Amount
Category
Category label
1
A
100
A
toto
2
A
200
B
tata
3
B
300
B
titi
4
C
400
Result of inner join : 4 rows
Order_id
Category
Amount
Category label
1
A
100
toto
2
A
200
toto
3
B
300
tata
3
B
300
titi
Without luck, I still have 4 rows… but there is a cartesian product !
The second reason I'm not fan, it's an a posteriori control and you have to do the join to check, meaning you have spend time on something wrong instead of a faster a priori control.
Best regards,
Simon -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Fair enough about the row count not being perfect for detecting a cartesian join but it's still a very good tool and doesn't need a new feature. Also the example you given above is not a cartesian join. A cartesian join is when there is a join for each row of one table to every row of another table. What you presented is a join with tables of different granularity or with duplicate rows. A cartesian join will have returned 12 rows as each row from the order table (4 rows) will be joined with each row of category table (3 rows).
-
@Turribeach : About semantics, this may be an abuse of language, but we usually call it also a cartesian product (and not cartesian join) even if it's not a perfect cross join. ;)
At least, with the people I have worked it and chatgpt seems to agree. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,167 Neuron
Good point! Cartesian product not the same as Cartesian join. I stand corrected…