Name Normalisation

gauravgarg · March 2023

I have a data with some entity names say like 1000 names in 1000 rows. Some of the entity names are duplicate. For e.g refer to the following names. These are normalised in the right column. This is just and example there are different type of names with varies inconsistencies due to punctuations or excess names in the cell. Is there any way to normalise these names automatically?

Name	Normalised
JMC Inc McDonald's 11310	Mcdonald'S
JMC Inc McDonald's 15301	Mcdonald'S
JMC Inc McDonal'ds 26990	Mcdonald'S
JMC Inc McDonald's 6412	Mcdonald'S
JMC, INC. #4496	Mcdonald'S
McDonalds	Mcdonald'S
McDonald's	Mcdonald'S
True North Energy	True North Energy
True North Energy *	True North Energy
American Gas & Oil	American Gas & Oil
American Gas & Oil, Inc. *	American Gas & Oil
Auburn Hills #1 Inc - Store 25500	Auburn Hills
Auburn Hills #2 Inc	Auburn Hills
Auburn Hills Car Wash	Auburn Hills
Circle K #2231	Circle K
Circle K #2239	Circle K
City of St. Clair	City Of St. Clair
City of St. Clair Shores	City Of St. Clair
Dayton Freight **	Dayton Freight **
Dayton Freight 32717 Hollingsworth**	Dayton Freight **
Dayton Freight Corporation*	Dayton Freight **
Phillips 66	Phillips 66
Phillips 66 Food Mart	Phillips 66

tgb417 · March 2023

@gauravgarg

Welcome to the Dataiku Community.

I tend to think of the kind of problem you are describing here as record linkage. The task of looking at records and finding the things that would make you think you are talking about the same identity.

There are tools like the python library dedupe and panda-dedupe that specialize in the problem for record linkage. You may want to take a look at those. There are also some consultancies that also specialize in this topic.

One of the things you are showing below is very interesting.

JMC, INC. #4496

Mcdonald'S

From the data listed here I do not know how one would link those records. You seem to be using information beyond what is listed above to make that determination.

Hope that helps a little bit.

gauravgarg · March 2023

@tgb417

I mistakenly mentioned the following. If we ignore this, is there any tool to normalise such names?

JMC, INC. #4496

Mcdonald'S

tgb417 · March 2023

@gauravgarg
,

Yeah, I would tend to use pandas-dedupe in a Jupyter Notebook to do this kind of fuzzy record linkage / normalization, without creating all sorts of data set specific rules by hand.

https://pypi.org/project/pandas-dedupe/

That is a simplified version of the library dedupe.

https://pypi.org/project/dedupe/

The panda-dedupe has not had a bunch of work in the past few years. But the underlying technology is based on the dedupe library. The dedupe library does seem to be under active development.

In looking at your specific use case; I might split the try to split out the numbers at the end of the name strings into a separate column. That would get you a set of clusters and you could use the canonical option to get a consistent name in the Normalize Column. If you had specific ideas about what you wanted in that column you might still need some sort of lookup.

Here is a youtube video that is about using this library.

https://www.youtube.com/watch?v=lCFEzRaqoJA

Here is a bit more on this subject in a previous article here in the Dataiku Community.

https://community.dataiku.com/t5/Using-Dataiku/How-to-identify-duplicates-in-a-data-set/td-p/25823

I'm hoping that some in the community would get together to create a Dataiku Plugin on this topic.

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821

In doing this for my work, I've discovered that depending on the size of your dataset this may take a bunch of memory. But if you keep the training set small you can make this work on computers with less memory, but the model is less accurate.

tgb417 · March 2023

@gauravgarg

If using the Dedup or Pandas-Dedupe library in Dataiku seems like a good idea to you, but you don't want to go through the trouble of doing this in a Jupyter Notebook. I'd like to invite you to up-vote the product idea I've suggested over here.

https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idc-p/33193#M1136

Name Normalisation

Answers

Categories

Setup Info

Tags