Name Normalisation

Options
gauravgarg
gauravgarg Registered Posts: 6

I have a data with some entity names say like 1000 names in 1000 rows. Some of the entity names are duplicate. For e.g refer to the following names. These are normalised in the right column. This is just and example there are different type of names with varies inconsistencies due to punctuations or excess names in the cell. Is there any way to normalise these names automatically?

NameNormalised
JMC Inc McDonald's 11310Mcdonald'S
JMC Inc McDonald's 15301Mcdonald'S
JMC Inc McDonal'ds 26990Mcdonald'S
JMC Inc McDonald's 6412Mcdonald'S
JMC, INC. #4496Mcdonald'S
McDonaldsMcdonald'S
McDonald'sMcdonald'S
True North EnergyTrue North Energy
True North Energy *True North Energy
American Gas & OilAmerican Gas & Oil
American Gas & Oil, Inc. *American Gas & Oil
Auburn Hills #1 Inc - Store 25500Auburn Hills
Auburn Hills #2 IncAuburn Hills
Auburn Hills Car WashAuburn Hills
Circle K #2231Circle K
Circle K #2239Circle K
City of St. ClairCity Of St. Clair
City of St. Clair ShoresCity Of St. Clair
Dayton Freight **Dayton Freight **
Dayton Freight 32717 Hollingsworth**Dayton Freight **
Dayton Freight Corporation*Dayton Freight **
Phillips 66Phillips 66
Phillips 66 Food MartPhillips 66
Tagged:

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @gauravgarg

    Welcome to the Dataiku Community.

    I tend to think of the kind of problem you are describing here as record linkage. The task of looking at records and finding the things that would make you think you are talking about the same identity.

    There are tools like the python library dedupe and panda-dedupe that specialize in the problem for record linkage. You may want to take a look at those. There are also some consultancies that also specialize in this topic.

    One of the things you are showing below is very interesting.

    JMC, INC. #4496Mcdonald'S

    From the data listed here I do not know how one would link those records. You seem to be using information beyond what is listed above to make that determination.

    Hope that helps a little bit.

  • gauravgarg
    gauravgarg Registered Posts: 6
    Options

    @tgb417

    I mistakenly mentioned the following. If we ignore this, is there any tool to normalise such names?

    JMC, INC. #4496Mcdonald'S
  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @gauravgarg
    ,

    Yeah, I would tend to use pandas-dedupe in a Jupyter Notebook to do this kind of fuzzy record linkage / normalization, without creating all sorts of data set specific rules by hand.

    https://pypi.org/project/pandas-dedupe/

    That is a simplified version of the library dedupe.

    https://pypi.org/project/dedupe/

    The panda-dedupe has not had a bunch of work in the past few years. But the underlying technology is based on the dedupe library. The dedupe library does seem to be under active development.

    In looking at your specific use case; I might split the try to split out the numbers at the end of the name strings into a separate column. That would get you a set of clusters and you could use the canonical option to get a consistent name in the Normalize Column. If you had specific ideas about what you wanted in that column you might still need some sort of lookup.

    Here is a youtube video that is about using this library.

    https://www.youtube.com/watch?v=lCFEzRaqoJA

    Here is a bit more on this subject in a previous article here in the Dataiku Community.

    https://community.dataiku.com/t5/Using-Dataiku/How-to-identify-duplicates-in-a-data-set/td-p/25823

    I'm hoping that some in the community would get together to create a Dataiku Plugin on this topic.

    https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idi-p/19821

    In doing this for my work, I've discovered that depending on the size of your dataset this may take a bunch of memory. But if you keep the training set small you can make this work on computers with less memory, but the model is less accurate.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @gauravgarg

    If using the Dedup or Pandas-Dedupe library in Dataiku seems like a good idea to you, but you don't want to go through the trouble of doing this in a Jupyter Notebook. I'd like to invite you to up-vote the product idea I've suggested over here.

    https://community.dataiku.com/t5/Product-Ideas/Entity-Resolution-Record-Linkage-Plug-In/idc-p/33193#M1136

Setup Info
    Tags
      Help me…