Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on September 21, 2024 7:11AM
Likes: 2
Replies: 6
What's your use case?
Cartesian product is a common issue when joining dataset with a bad key. It's not always easy to detect and users can even forget to check for it because they think they know their data.
What's your proposed solution?
What I suggest is an option to check if there will be a cartesian product on the join tool.
-there is a label "Cartesian product (non join key uniqueness) detection"
-under it a drop down menu with three choices
-do nothing
-fail
-warning
Algo :
if do nothing==> well... do nothing more than actual behaviour.
if "fail" or "warning" : count distinct of join key (potentially a combination of fields) versus count row on each side of the join. If no count distinct is equal to simple count, display a warning or an error message.
order_id | customer_id | amount |
---|---|---|
1 | 1 | 100 |
2 | 1 | 150 |
3 | 2 | 142 |
4 | 2 | 145 |
5 | 1 | 145 |
customer_id | customer_name |
---|---|
1 | toto |
2 | tata |
2 | titi |
on customer_id
=> count distinct left : 2; count left : 5
=> count distinct right : 2; count right : 3
2<>5 and 2<>3
=> there will be a cartesian product
=> for the count, it must be a count(*) or a sum(1) but not a count of the field itself because of the null.
Are there any alternative solutions?
N/A
Best regards,
Simon
I believe this could be done with custom Python data quality rule. And in v13.1 data quality templates so you can apply pre-defined rules much more easily. And without any actual changes to your flow have a look at Records count view. This displays a heatmap of the range of record counts for your datasets in the Flow which provides a great visual clue on where record counts are changing, such as when a cartisian join may be happening. You also get the ability to recalculate record counts for all your datasets in the flow at once allowing you to get a consistent view of your datasets row counts.
@Turribeach Thanks for your answer. The counting record method is way easier to set but it's also misleading.
(sorry for the formatting)
Here an example :
Order Table : 4 rows
Category table |
Order_id | Category | Amount | Category | Category label |
1 | A | 100 | A | toto |
2 | A | 200 | B | tata |
3 | B | 300 | B | titi |
4 | C | 400 |
Result of inner join : 4 rows |
Order_id | Category | Amount | Category label |
1 | A | 100 | toto |
2 | A | 200 | toto |
3 | B | 300 | tata |
3 | B | 300 | titi |
Without luck, I still have 4 rows… but there is a cartesian product !
The second reason I'm not fan, it's an a posteriori control and you have to do the join to check, meaning you have spend time on something wrong instead of a faster a priori control.
Best regards,
Simon
Fair enough about the row count not being perfect for detecting a cartesian join but it's still a very good tool and doesn't need a new feature. Also the example you given above is not a cartesian join. A cartesian join is when there is a join for each row of one table to every row of another table. What you presented is a join with tables of different granularity or with duplicate rows. A cartesian join will have returned 12 rows as each row from the order table (4 rows) will be joined with each row of category table (3 rows).
@Turribeach : About semantics, this may be an abuse of language, but we usually call it also a cartesian product (and not cartesian join) even if it's not a perfect cross join. ;)
At least, with the people I have worked it and chatgpt seems to agree.
Good point! Cartesian product not the same as Cartesian join. I stand corrected…
Cardinality detection during the join recipe would help avoid this. Showing values like "one to one", "one to many", or "many to many" would take the guesswork out of whether the user has created the output dataset they were intending or have unintentionally created a headache.