Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

In fuzzy join visual recipe add the option for strict inequality when joining columns

0 Kudos

Background:

As a data analyst that works with "dirty data" I regularly have to identify duplicate records.  The Fuzzy Join visual recipe is helpful with this challenge.  This will typically be done with a self join of a dataset with itself and a column or columns that represents the data items to be considered for the identification of duplicates.  However, during a self join in such a case every record will match with itself.  In such scenarios it is typical to exclude self joins by making sure that the primary keys on the records do not match.  Unfortunately with the fuzzy join visual recipe one can not use a strict inequality on columns.  Only strict equality.

User Story

As a data analyst working with the self joins in the fuzzy join visual recipe,  I would like to have the option to setup the join with a least one strict inequality between columns, so that deduplication of records can be done more effectively and efficiently, also avoiding the extra step to filter out these self joined records.

Condition of Satisfaction

  • This should be done with efficiency in mind avoiding the compute needed to do the fuzzy match of the record on itself.

Screenshot of the Fuzzy Join recipie's  columns to join screen.  This shows where the strict inequality option might be added.Screenshot of the Fuzzy Join recipie's columns to join screen. This shows where the strict inequality option might be added.

 

--Tom