Cartesian product detection in join recipe

simon_aubert · September 2024

What's your use case?
Cartesian product is a common issue when joining dataset with a bad key. It's not always easy to detect and users can even forget to check for it because they think they know their data.

What's your proposed solution?
What I suggest is an option to check if there will be a cartesian product on the join tool.

-there is a label "Cartesian product (non join key uniqueness) detection"
-under it a drop down menu with three choices
-do nothing
-fail
-warning

Algo :
if do nothing==> well... do nothing more than actual behaviour.
if "fail" or "warning" : count distinct of join key (potentially a combination of fields) versus count row on each side of the join. If no count distinct is equal to simple count, display a warning or an error message.

order_id	customer_id	amount
order_id	1	customer_id	1	amount	100
order_id	2	customer_id	1	amount	150
order_id	3	customer_id	2	amount	142
order_id	4	customer_id	2	amount	145
order_id	5	customer_id	1	amount	145

customer_id	customer_name
customer_id	1	customer_name	toto
customer_id	2	customer_name	tata
customer_id	2	customer_name	titi

on customer_id

=> count distinct left : 2; count left : 5
=> count distinct right : 2; count right : 3
2<>5 and 2<>3
=> there will be a cartesian product

=> for the count, it must be a count(*) or a sum(1) but not a count of the field itself because of the null.

Are there any alternative solutions?
N/A

Best regards,

Simon

Turribeach · September 2024

I believe this could be done with custom Python data quality rule. And in v13.1 data quality templates so you can apply pre-defined rules much more easily. And without any actual changes to your flow have a look at Records count view. This displays a heatmap of the range of record counts for your datasets in the Flow which provides a great visual clue on where record counts are changing, such as when a cartisian join may be happening. You also get the ability to recalculate record counts for all your datasets in the flow at once allowing you to get a consistent view of your datasets row counts.

simon_aubert · September 2024

@Turribeach Thanks for your answer. The counting record method is way easier to set but it's also misleading.
(sorry for the formatting)
Here an example :
Order Table : 4 rows

Category table

Order_id

Category

Amount

Category label

1

A

100

toto

2

A

200

toto

3

B

300

tata

3

B

300

titi

Without luck, I still have 4 rows… but there is a cartesian product !

The second reason I'm not fan, it's an a posteriori control and you have to do the join to check, meaning you have spend time on something wrong instead of a faster a priori control.

Best regards,

Simon

Turribeach · September 2024

Fair enough about the row count not being perfect for detecting a cartesian join but it's still a very good tool and doesn't need a new feature. Also the example you given above is not a cartesian join. A cartesian join is when there is a join for each row of one table to every row of another table. What you presented is a join with tables of different granularity or with duplicate rows. A cartesian join will have returned 12 rows as each row from the order table (4 rows) will be joined with each row of category table (3 rows).

simon_aubert · September 2024

@Turribeach : About semantics, this may be an abuse of language, but we usually call it also a cartesian product (and not cartesian join) even if it's not a perfect cross join. ;)
At least, with the people I have worked it and chatgpt seems to agree.

Turribeach · September 2024

Good point! Cartesian product not the same as Cartesian join. I stand corrected…

rlien1 · March 20

Cardinality detection during the join recipe would help avoid this. Showing values like "one to one", "one to many", or "many to many" would take the guesswork out of whether the user has created the output dataset they were intending or have unintentionally created a headache.

Cartesian product detection in join recipe

Tags

New · Last Updated September 2024

Welcome!

Comments

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Cartesian product detection in join recipe

Tags

New · Last Updated September 2024

Welcome!

Comments

Welcome!

Welcome!

Quick Links

Categories