I'm working with customer records. We think that there is a moderate duplicate problem with the data. We would not be at all surprised if we have a 10% duplicate rate.
It would be great to do some robust record linkage, even though we have at least some missing data and identifiers like postal address, email address, phone, age, and gender, on a majority of records. Each record will have it's own unique set of data we have and data we don't have.
Has anyone used the new visual fuzzy match visual recipe for this type of work?
We have also played around with the python library record linkage.
Our dataset is on the order of ~250,000 records. From a computer point of view, we can be patient, a small number of hours would be fine for ultimately calculating these record linkages.
What approaches have folks successfully used to deal with such problems at this scale or larger? I'm working with a smaller non-profit we do not have a budget to implement tools like Snowflake, SPARK or the like.
How are folks dealing with missing data columns?
We want to turn over to the pro bono data science team, a list of all of the current record keys, and the consolidated key that they should use for analysis. Preferably selected from the existing keys. (That is promoting a record to a master of a set of duplicates.