Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

ML Data prediction

Solved!
rreff
Level 2
Level 2
ML Data prediction

Hi Community.

 

We are not sure if we can find an ML solution based on the data we have.

Mainly we do have just on complex source field containing product description including a company name. We need the company name only. We are still using regex to get string parts from product desc field. And sure we could use static mapping but if new products appear manuell entering would required. 

Our idea would be using ML as if you check the attachement our brain is able to see a pattern.

 

Regards Roberto

0 Kudos
1 Solution
tgb417
Neuron
Neuron

@rreff 

Here is a link to an article that looks interesting.  I've not worked through this approach in detail, so go with care.

https://www.analyticsinsight.net/company-names-standardization-using-a-fuzzy-nlp-approach/

From your example file, I note a few things about your source data:

  1. It looks like this comes from some other source that is concatenating a bunch of existing fields together.  I'd first try to get a set of data from that source that provides those fields separated.  Ask then for clean data and then go to step 4 of the article above.
  2. If I could not get the original data I'd then work on cleaning up the data.  It appears from this sample that you have a concatenation of 4 fields in the one Source_ProductName column.  These include:
    1. Company Name  (There can be multiple white spaces between the company name and the one-word alphanumeric code, I'd likely use those extra spaces as a help.)
    2.  A product code or something like that which is one "word" and is alphanumeric in nature.
    3. A Number (Spaces between this and the date are not well-formatted. Sometimes omited.)
    4. A Date
  3. To pull this apart I'd
    1. Likely start with getting the multiple spaces as a way to quickly find most of the names.
    2. For the rest without multiple spaces as a break between company name and what I'm calling a product code, I'd start at the right end.  This will take some clever look behind (?<= constructions in regex. 
      1. I'd get rid of the date first (being careful to leave a number and one-word code)
      2. Then extract the number and get rid of it
      3. Then take off the one alphanumeric code with spaces on each side.
  4. Then I'd take a serious look at the article I've shared.  It looks like a reasonable starting place for the standardization problem you are facing.

Good luck.  From my point of view, this is a non-trivial challenge.  Let us know how you get on with this.

--Tom

View solution in original post

0 Kudos
3 Replies
rreff
Level 2
Level 2
Author

Attached some source data.

COL1 shows the source

COL2 the Source after regex = ^(.*?)\ [0-9]+

COL3 is the target we try to archive

0 Kudos
tgb417
Neuron
Neuron

@rreff 

Here is a link to an article that looks interesting.  I've not worked through this approach in detail, so go with care.

https://www.analyticsinsight.net/company-names-standardization-using-a-fuzzy-nlp-approach/

From your example file, I note a few things about your source data:

  1. It looks like this comes from some other source that is concatenating a bunch of existing fields together.  I'd first try to get a set of data from that source that provides those fields separated.  Ask then for clean data and then go to step 4 of the article above.
  2. If I could not get the original data I'd then work on cleaning up the data.  It appears from this sample that you have a concatenation of 4 fields in the one Source_ProductName column.  These include:
    1. Company Name  (There can be multiple white spaces between the company name and the one-word alphanumeric code, I'd likely use those extra spaces as a help.)
    2.  A product code or something like that which is one "word" and is alphanumeric in nature.
    3. A Number (Spaces between this and the date are not well-formatted. Sometimes omited.)
    4. A Date
  3. To pull this apart I'd
    1. Likely start with getting the multiple spaces as a way to quickly find most of the names.
    2. For the rest without multiple spaces as a break between company name and what I'm calling a product code, I'd start at the right end.  This will take some clever look behind (?<= constructions in regex. 
      1. I'd get rid of the date first (being careful to leave a number and one-word code)
      2. Then extract the number and get rid of it
      3. Then take off the one alphanumeric code with spaces on each side.
  4. Then I'd take a serious look at the article I've shared.  It looks like a reasonable starting place for the standardization problem you are facing.

Good luck.  From my point of view, this is a non-trivial challenge.  Let us know how you get on with this.

--Tom

View solution in original post

0 Kudos
rreff
Level 2
Level 2
Author

Hi Tom. 

Thanks for you reply. i didn't quite manage to build your solution. i will have to try this again.

Regards Roberto

0 Kudos
A banner prompting to get Dataiku DSS