In fuzzy join visual recipe add the option for strict inequality when joining columns

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

Background:

As a data analyst that works with "dirty data" I regularly have to identify duplicate records. The Fuzzy Join visual recipe is helpful with this challenge. This will typically be done with a self join of a dataset with itself and a column or columns that represents the data items to be considered for the identification of duplicates. However, during a self join in such a case every record will match with itself. In such scenarios it is typical to exclude self joins by making sure that the primary keys on the records do not match. Unfortunately with the fuzzy join visual recipe one can not use a strict inequality on columns. Only strict equality.

User Story

As a data analyst working with the self joins in the fuzzy join visual recipe, I would like to have the option to setup the join with a least one strict inequality between columns, so that deduplication of records can be done more effectively and efficiently, also avoiding the extra step to filter out these self joined records.

Condition of Satisfaction

  • This should be done with efficiency in mind avoiding the compute needed to do the fuzzy match of the record on itself.

Screenshot of the Fuzzy Join recipie's  columns to join screen.  This shows where the strict inequality option might be added.

0
0 votes

New · Last Updated

Setup Info
    Tags
      Help me…