Using Fuzzy Joins for Record Linkage

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

All,

I've been working on a record linkage project, in order to drive decisions about record merges in a CRM database. We have recently upgraded to DSS 9.0.1. Is there anyone out there who has been successfully using the new Fuzzy Join Visual recipe to do this kind of thing?

Let's imagine for simplicity's sake that I have ~1/4 million records. I'd like to find those customers who are in reality very likely to be exactly the same customer. So that I can use special steps to merge these customers for data analytics.

Let's oversimplify... Imagine I had this table:

IDNameEmail_Address
1Fred JonesFred.Jones@example.net
2Mary JonesMary@example.net
3Fred Jonsefred.jones@example.net
4Mary SmithMary1@example.net
5Mary M Jonesmary@example.net

6

Fred JonesFred.Jones+1@example.net

7

Tom Jonestom@example.net

I'm looking for a result table that shows

IDDupes
1[3,6]
2[5]

I don't want any of the self-joined rows (See Id in (1,2,4,7) or duplicates that start from already paired sets. (See ID in (3,5,6) )

IDDupesProblem
1[1,3,6]
  • ID 1 is showing up in Dupes Column (Self Reference)
2[2,5]
  • ID 2 is showing up in Dupe Column (Self Reference)
3[3,1,6]
  • ID 3 is showing up in Dupe Column (Self Reference)
  • Already covered in ID one above
4[4]
  • Unnecessary Self Reference
5[5,2]
  • Self Reference ID = 5
  • Already covered in ID 2 above
6

[6,3,1]

  • Self Reference ID = 6
  • Already covered in ID 1 above
7

[7]

  • Un necessary self reference ID = 7

To attempt to do this I'm trying to join a table to itself. Use the Fuzzy Join logic visual recipe.

My guess is that there might need to be several intermediate steps between source data and final data. That would be OK.

However, I've had several currently unresolved challenges using the New Fuzzy Join Recipe.

  1. How to say I don't want the self Joins that indeed would be perfect matches. (I'd like to say in the where clause ID <> ID.
  2. I might be able to use the distinct to keep different permutations of rows.
  3. When using Postgres Database and working with 1/4 million rows I've done a lot of processing but ended up with errors.

Looking forward to your thoughts.

Answers

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭

    Hi Tom,

    To answer your questions:

    1. How to say I don't want the self Joins that indeed would be perfect matches. (I'd like to say in the where clause ID <> ID.
    2. I might be able to use the distinct to keep different permutations of rows.
    3. When using Postgres Database and working with 1/4 million rows I've done a lot of processing but ended up with errors.

    The first two sound like what you need. You could combine them both in one filter recipe after the join with a condition: left_id < right_id. That would remove exact person matches and also permutations.

    You were mentioning the errors, could you give more details about what didn't work?

    Regards

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    edited July 17

    @Andrey
    ,

    I've gone ahead and opened a Support ticket on the error messages.

    [#22857] New Fuzzy Join Visual Recipe Failing

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Andrey

    One of the challenges with the Fuzzy Join visual recipes interface is that it only offers "Strict Equality". Nothing like the typical >, >=, !=, etc. So I don't see here how to implement your suggestions:

    I'm on DSS V9.0.1

    Limitations of Fuzzy Joins.jpg

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Andrey
    ,

    Maybe I miss understood. Are you are suggesting a post fuzzy join filter recipe with the ID< ID filter? I see.

    I guess that I'd really like to do this in a single join. And hopefully, avoid doing some of the potential extra time-consuming compares.

  • Andrey
    Andrey Dataiker Alumni Posts: 119 ✭✭✭✭✭✭✭

    Yes, I meant an extra recipe after the fuzzy join one.

    Fuzzy join doesn't support post-filtering so there's no other way to do it.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Andrey

    Would it be possible to request in a future version that has post-filtering of Fuzzy Joins? I was even thinking about it in the actual Join logic. (Not really the where clause logic)

Setup Info
    Tags
      Help me…