Extract word that begins with specific character anywhere in the text

Solved!
LuizCavalcanti
Level 1
Extract word that begins with specific character anywhere in the text

Dear colleagues,

I stumbled across a situation where I need to extract part numbers from product descriptions so I can use them for some audit processes. Here are some premises:

1 - Product descriptions do not have a limited size (could be 50 characters or 500+);

2 - Part numbers can start with different letters or numbers - however, I have a database with the "starting patterns" of these parts, such as "WND", "448", "100W", which I believe that should be useful for this process;

3 - Part numbers can be anywhere in the description.

Having this information in mind, here is a draft of what the final product should look like:

Description448 ExtractWND extract100W extractPart Number
Copper cable, identified as 448W78568P001, intended use for electrical devices...448W78568P001  448W78568P001
WND09875-10 Electric motors for wind turbines, with 50Nm of... WND09875-10 WND09875-10
Turning device for positioning blades on wind turbine hubs 100W5678P004  100W5678P004100W5678P004

I know this kind of extract is feasible in excel, simply by following the formula below: 

=TRIM(LEFT(SUBSTITUTE(MID(txt,FIND("Prefix",txt),LEN(txt))," ",REPT(" ",100)),100))

Such idea comes from this article here: https://exceljet.net/formula/extract-word-that-begins-with-specific-character 

However, as I am working with massive datasets that are constantly updating, working with excel (even with PowerQuery features) would not be the best solution for me and the company. The real challenge then remains on how to replicate the idea of the formula above, but using dataiku language...That's where I am stuck 😫

I would need to use this formula to create the column "448 Extract", "WND extract" and "100W extract" and ultimately merge them in a final columns with the proper part numbers.

If any of you has ever done something similar to this in Dataiku and have some suggestions, that would be highly appreciated!

Thank you!

0 Kudos
1 Solution
tgb417

@LuizCavalcanti 

Welcome to the Dataiku Community!

I'm running between activities today so I don't have time to work this out completely.

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways.  There are many visual recipes that can use regex to do text pattern matching.

Regular expressions are very powerful however they can be a bit challenging to write.  There are several web sites out there that can be used to help you write regex.  For example https://regexr.com/

That said, I'm wondering if even this approach would make sense.  How many different extract types do you have.  If there are only three series, this likely would work OK.  However, if there are 100 series or 1000 series this approach is probably not the best.  In those cases, I'd likely be looking at finding an R Package or a Python library that does this kind of extraction from free text.  I don't know if there is a specific package to do this kind or thing.  However, I've got to imaging that others have had similar problems.  So, I'd look to leverage other folks genius.

Hope that helps a little bit. 

P.S. You could also write a Formula in a prepare recipe to do similar things.  The visual formula has most of the things you are showing in your MS Excel formula.  The syntax of visual formulas is a little different.  (And one hint, they are case sensitive. So if and IF are not the same thing.  One will work one will not.)

--Tom

View solution in original post

4 Replies
tgb417

@LuizCavalcanti 

Welcome to the Dataiku Community!

I'm running between activities today so I don't have time to work this out completely.

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways.  There are many visual recipes that can use regex to do text pattern matching.

Regular expressions are very powerful however they can be a bit challenging to write.  There are several web sites out there that can be used to help you write regex.  For example https://regexr.com/

That said, I'm wondering if even this approach would make sense.  How many different extract types do you have.  If there are only three series, this likely would work OK.  However, if there are 100 series or 1000 series this approach is probably not the best.  In those cases, I'd likely be looking at finding an R Package or a Python library that does this kind of extraction from free text.  I don't know if there is a specific package to do this kind or thing.  However, I've got to imaging that others have had similar problems.  So, I'd look to leverage other folks genius.

Hope that helps a little bit. 

P.S. You could also write a Formula in a prepare recipe to do similar things.  The visual formula has most of the things you are showing in your MS Excel formula.  The syntax of visual formulas is a little different.  (And one hint, they are case sensitive. So if and IF are not the same thing.  One will work one will not.)

--Tom
LuizCavalcanti
Level 1
Author

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways.  There are many visual recipes that can use regex to do text pattern matching.

Hey Tom,

Thank you a lot for this suggestion! I haven't heard of Regexp before, but I am extremely glad you introduced this topic. I took some lessons on this subject yesterday and in fact it is exactly what I need. Despite being a bit confusing at first, it is being a joy to use it.

Also, my dataset brings a column with part numbers, however in some cases it is messed up (showing blanks or random words that were not supposed to be there). So, what I am doing now is basically filtering out these lines where part codes are ok, reducing the scope of descriptions that I really need to write Regexp to get the right codes.

It is working so far! Thank you once again!!

 

--Luiz

CoreyS
Dataiker Alumni

Awesome to hear @LuizCavalcanti and thanks for the solution @tgb417

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
0 Kudos
YiXiangChong
Level 2

Hi @LuizCavalcanti ,

I'm not too sure. But can you try the following regex patterns in dataiku?

From prepare recipe, look for "Extract with regular expression" processor.

Then, key in the following regular expression or regex patterns in the Regular Expression fill box (Note: you will need 3 separate processors for each regex pattern):

^448.*?

^WND.*?

^100W.*?

 

(Note: "^" means "starts with" in regex; "." means "any character" after; "*?" means "0 or more matches")

See if it works. Hope this was helpful

0 Kudos