Looking for advise on robust record linkage

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

I'm working with customer records. We think that there is a moderate duplicate problem with the data. We would not be at all surprised if we have a 10% duplicate rate.

It would be great to do some robust record linkage, even though we have at least some missing data and identifiers like postal address, email address, phone, age, and gender, on a majority of records. Each record will have it's own unique set of data we have and data we don't have.

Has anyone used the new visual fuzzy match visual recipe for this type of work?

We have also played around with the python library record linkage.

Our dataset is on the order of ~250,000 records. From a computer point of view, we can be patient, a small number of hours would be fine for ultimately calculating these record linkages.

What approaches have folks successfully used to deal with such problems at this scale or larger? I'm working with a smaller non-profit we do not have a budget to implement tools like Snowflake, SPARK or the like.

How are folks dealing with missing data columns?

We want to turn over to the pro bono data science team, a list of all of the current record keys, and the consolidated key that they should use for analysis. Preferably selected from the existing keys. (That is promoting a record to a master of a set of duplicates.

Thanks for any insights anyone can provide.

--Tom

Setup Info
    Tags
      Help me…