Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a data with some entity names say like 1000 names in 1000 rows. Some of the entity names are duplicate. For e.g refer to the following names. These are normalised in the right column. This is just and example there are different type of names with varies inconsistencies due to punctuations or excess names in the cell. Is there any way to normalise these names automatically?
|JMC Inc McDonald's 11310||Mcdonald'S|
|JMC Inc McDonald's 15301||Mcdonald'S|
|JMC Inc McDonal'ds 26990||Mcdonald'S|
|JMC Inc McDonald's 6412||Mcdonald'S|
|JMC, INC. #4496||Mcdonald'S|
|True North Energy||True North Energy|
|True North Energy *||True North Energy|
|American Gas & Oil||American Gas & Oil|
|American Gas & Oil, Inc. *||American Gas & Oil|
|Auburn Hills #1 Inc - Store 25500||Auburn Hills|
|Auburn Hills #2 Inc||Auburn Hills|
|Auburn Hills Car Wash||Auburn Hills|
|Circle K #2231||Circle K|
|Circle K #2239||Circle K|
|City of St. Clair||City Of St. Clair|
|City of St. Clair Shores||City Of St. Clair|
|Dayton Freight **||Dayton Freight **|
|Dayton Freight 32717 Hollingsworth**||Dayton Freight **|
|Dayton Freight Corporation*||Dayton Freight **|
|Phillips 66||Phillips 66|
|Phillips 66 Food Mart||Phillips 66|
Welcome to the Dataiku Community.
I tend to think of the kind of problem you are describing here as record linkage. The task of looking at records and finding the things that would make you think you are talking about the same identity.
There are tools like the python library dedupe and panda-dedupe that specialize in the problem for record linkage. You may want to take a look at those. There are also some consultancies that also specialize in this topic.
One of the things you are showing below is very interesting.
|JMC, INC. #4496||Mcdonald'S|
From the data listed here I do not know how one would link those records. You seem to be using information beyond what is listed above to make that determination.
Hope that helps a little bit.
Yeah, I would tend to use pandas-dedupe in a Jupyter Notebook to do this kind of fuzzy record linkage / normalization, without creating all sorts of data set specific rules by hand.
That is a simplified version of the library dedupe.
The panda-dedupe has not had a bunch of work in the past few years. But the underlying technology is based on the dedupe library. The dedupe library does seem to be under active development.
In looking at your specific use case; I might split the try to split out the numbers at the end of the name strings into a separate column. That would get you a set of clusters and you could use the canonical option to get a consistent name in the Normalize Column. If you had specific ideas about what you wanted in that column you might still need some sort of lookup.
Here is a youtube video that is about using this library.
Here is a bit more on this subject in a previous article here in the Dataiku Community.
I'm hoping that some in the community would get together to create a Dataiku Plugin on this topic.
In doing this for my work, I've discovered that depending on the size of your dataset this may take a bunch of memory. But if you keep the training set small you can make this work on computers with less memory, but the model is less accurate.
If using the Dedup or Pandas-Dedupe library in Dataiku seems like a good idea to you, but you don't want to go through the trouble of doing this in a Jupyter Notebook. I'd like to invite you to up-vote the product idea I've suggested over here.