## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Extract word that begins with specific character anywhere in the text

Dear colleagues,

I stumbled across a situation where I need to extract part numbers from product descriptions so I can use them for some audit processes. Here are some **premises:**

1 - Product descriptions do not have a limited size (could be 50 characters or 500+);

2 - Part numbers can start with different letters or numbers - however, I have a database with the "starting patterns" of these parts, such as "WND", "448", "100W", which I believe that should be useful for this process;

3 - Part numbers can be anywhere in the description.

Having this information in mind, here is a draft **of what the final product should look like:**

Description | 448 Extract | WND extract | 100W extract | Part Number |

Copper cable, identified as 448W78568P001, intended use for electrical devices... | 448W78568P001 | 448W78568P001 | ||

WND09875-10 Electric motors for wind turbines, with 50Nm of... | WND09875-10 | WND09875-10 | ||

Turning device for positioning blades on wind turbine hubs 100W5678P004 | 100W5678P004 | 100W5678P004 |

I know this kind of extract is feasible in excel, simply by following the formula below:

=TRIM(LEFT(SUBSTITUTE(MID(txt,FIND("Prefix",txt),LEN(txt))," ",REPT(" ",100)),100))

Such idea comes from this article here: https://exceljet.net/formula/extract-word-that-begins-with-specific-character

However, as I am working with massive datasets that are constantly updating, working with excel (even with PowerQuery features) would not be the best solution for me and the company. The real challenge then remains on how to replicate the idea of the formula above, but using dataiku language...That's where I am stuck 😫

I would need to use this formula to create the column "448 Extract", "WND extract" and "100W extract" and ultimately merge them in a final columns with the proper part numbers.

If any of you has ever done something similar to this in Dataiku and have some suggestions, that would be highly appreciated!

**Thank you!**

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Welcome to the Dataiku Community!

I'm running between activities today so I don't have time to work this out completely.

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways. There are many visual recipes that can use regex to do text pattern matching.

Regular expressions are very powerful however they can be a bit challenging to write. There are several web sites out there that can be used to help you write regex. For example https://regexr.com/

That said, I'm wondering if even this approach would make sense. How many different extract types do you have. If there are only three series, this likely would work OK. However, if there are 100 series or 1000 series this approach is probably not the best. In those cases, I'd likely be looking at finding an R Package or a Python library that does this kind of extraction from free text. I don't know if there is a specific package to do this kind or thing. However, I've got to imaging that others have had similar problems. So, I'd look to leverage other folks genius.

Hope that helps a little bit.

P.S. You could also write a Formula in a prepare recipe to do similar things. The visual formula has most of the things you are showing in your MS Excel formula. The syntax of visual formulas is a little different. (And one hint, they are case sensitive. So if and IF are not the same thing. One will work one will not.)

--Tom

Solutions shown first - Read whole discussion

4 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Welcome to the Dataiku Community!

I'm running between activities today so I don't have time to work this out completely.

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways. There are many visual recipes that can use regex to do text pattern matching.

Regular expressions are very powerful however they can be a bit challenging to write. There are several web sites out there that can be used to help you write regex. For example https://regexr.com/

That said, I'm wondering if even this approach would make sense. How many different extract types do you have. If there are only three series, this likely would work OK. However, if there are 100 series or 1000 series this approach is probably not the best. In those cases, I'd likely be looking at finding an R Package or a Python library that does this kind of extraction from free text. I don't know if there is a specific package to do this kind or thing. However, I've got to imaging that others have had similar problems. So, I'd look to leverage other folks genius.

Hope that helps a little bit.

P.S. You could also write a Formula in a prepare recipe to do similar things. The visual formula has most of the things you are showing in your MS Excel formula. The syntax of visual formulas is a little different. (And one hint, they are case sensitive. So if and IF are not the same thing. One will work one will not.)

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways. There are many visual recipes that can use regex to do text pattern matching.

Hey Tom,

Thank you a lot for this suggestion! I haven't heard of Regexp before, but I am extremely glad you introduced this topic. I took some lessons on this subject yesterday and in fact it is exactly what I need. Despite being a bit confusing at first, it is being a joy to use it.

Also, my dataset brings a column with part numbers, however in some cases it is messed up (showing blanks or random words that were not supposed to be there). So, what I am doing now is basically filtering out these lines where part codes are ok, reducing the scope of descriptions that I really need to write Regexp to get the right codes.

It is working so far! Thank you once again!!

--Luiz

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Awesome to hear @LuizCavalcanti and thanks for the solution @tgb417!

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi @LuizCavalcanti ,

I'm not too sure. But can you try the following regex patterns in dataiku?

From prepare recipe, look for **"Extract with regular expression"** processor.

Then, key in the following regular expression or regex patterns in the Regular Expression fill box (Note: you will need 3 separate processors for each regex pattern):

^448.*?

^WND.*?

^100W.*?

(Note: "^" means "starts with" in regex; "." means "any character" after; "*?" means "0 or more matches")

See if it works. Hope this was helpful