ML Data prediction

Options
rreff
rreff Partner Posts: 14 Partner

Hi Community.

We are not sure if we can find an ML solution based on the data we have.

Mainly we do have just on complex source field containing product description including a company name. We need the company name only. We are still using regex to get string parts from product desc field. And sure we could use static mapping but if new products appear manuell entering would required.

Our idea would be using ML as if you check the attachement our brain is able to see a pattern.

Regards Roberto

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Answer ✓
    Options

    @rreff

    Here is a link to an article that looks interesting. I've not worked through this approach in detail, so go with care.

    https://www.analyticsinsight.net/company-names-standardization-using-a-fuzzy-nlp-approach/

    From your example file, I note a few things about your source data:

    1. It looks like this comes from some other source that is concatenating a bunch of existing fields together. I'd first try to get a set of data from that source that provides those fields separated. Ask then for clean data and then go to step 4 of the article above.
    2. If I could not get the original data I'd then work on cleaning up the data. It appears from this sample that you have a concatenation of 4 fields in the one Source_ProductName column. These include:
      1. Company Name (There can be multiple white spaces between the company name and the one-word alphanumeric code, I'd likely use those extra spaces as a help.)
      2. A product code or something like that which is one "word" and is alphanumeric in nature.
      3. A Number (Spaces between this and the date are not well-formatted. Sometimes omited.)
      4. A Date
    3. To pull this apart I'd
      1. Likely start with getting the multiple spaces as a way to quickly find most of the names.
      2. For the rest without multiple spaces as a break between company name and what I'm calling a product code, I'd start at the right end. This will take some clever look behind (?<= constructions in regex.
        1. I'd get rid of the date first (being careful to leave a number and one-word code)
        2. Then extract the number and get rid of it
        3. Then take off the one alphanumeric code with spaces on each side.
    4. Then I'd take a serious look at the article I've shared. It looks like a reasonable starting place for the standardization problem you are facing.

    Good luck. From my point of view, this is a non-trivial challenge. Let us know how you get on with this.

Answers

Setup Info
    Tags
      Help me…