Extract word that begins with specific character anywhere in the text

LuizCavalcanti
LuizCavalcanti Dataiku DSS Core Designer, Registered Posts: 2 ✭✭✭
edited July 2024 in General Discussion

Dear colleagues,

I stumbled across a situation where I need to extract part numbers from product descriptions so I can use them for some audit processes. Here are some premises:

1 - Product descriptions do not have a limited size (could be 50 characters or 500+);

2 - Part numbers can start with different letters or numbers - however, I have a database with the "starting patterns" of these parts, such as "WND", "448", "100W", which I believe that should be useful for this process;

3 - Part numbers can be anywhere in the description.

Having this information in mind, here is a draft of what the final product should look like:

Description448 ExtractWND extract100W extractPart Number
Copper cable, identified as 448W78568P001, intended use for electrical devices...448W78568P001 448W78568P001
WND09875-10 Electric motors for wind turbines, with 50Nm of... WND09875-10 WND09875-10
Turning device for positioning blades on wind turbine hubs 100W5678P004 100W5678P004100W5678P004

I know this kind of extract is feasible in excel, simply by following the formula below:

=TRIM(LEFT(SUBSTITUTE(MID(txt,FIND("Prefix",txt),LEN(txt))," ",REPT(" ",100)),100))

Such idea comes from this article here: https://exceljet.net/formula/extract-word-that-begins-with-specific-character

However, as I am working with massive datasets that are constantly updating, working with excel (even with PowerQuery features) would not be the best solution for me and the company. The real challenge then remains on how to replicate the idea of the formula above, but using dataiku language...That's where I am stuck " height="20" loading="lazy">">

I would need to use this formula to create the column "448 Extract", "WND extract" and "100W extract" and ultimately merge them in a final columns with the proper part numbers.

If any of you has ever done something similar to this in Dataiku and have some suggestions, that would be highly appreciated!

Thank you!

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,618 Neuron
    Answer ✓

    @LuizCavalcanti

    Welcome to the Dataiku Community!

    I'm running between activities today so I don't have time to work this out completely.

    However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways. There are many visual recipes that can use regex to do text pattern matching.

    Regular expressions are very powerful however they can be a bit challenging to write. There are several web sites out there that can be used to help you write regex. For example https://regexr.com/

    That said, I'm wondering if even this approach would make sense. How many different extract types do you have. If there are only three series, this likely would work OK. However, if there are 100 series or 1000 series this approach is probably not the best. In those cases, I'd likely be looking at finding an R Package or a Python library that does this kind of extraction from free text. I don't know if there is a specific package to do this kind or thing. However, I've got to imaging that others have had similar problems. So, I'd look to leverage other folks genius.

    Hope that helps a little bit.

    P.S. You could also write a Formula in a prepare recipe to do similar things. The visual formula has most of the things you are showing in your MS Excel formula. The syntax of visual formulas is a little different. (And one hint, they are case sensitive. So if and IF are not the same thing. One will work one will not.)

Answers

  • LuizCavalcanti
    LuizCavalcanti Dataiku DSS Core Designer, Registered Posts: 2 ✭✭✭

    However, the hint that I would offer is that regular expressions (regex) have the power to pull out this kind of substring information in some powerful ways. There are many visual recipes that can use regex to do text pattern matching.

    Hey Tom,

    Thank you a lot for this suggestion! I haven't heard of Regexp before, but I am extremely glad you introduced this topic. I took some lessons on this subject yesterday and in fact it is exactly what I need. Despite being a bit confusing at first, it is being a joy to use it.

    Also, my dataset brings a column with part numbers, however in some cases it is messed up (showing blanks or random words that were not supposed to be there). So, what I am doing now is basically filtering out these lines where part codes are ok, reducing the scope of descriptions that I really need to write Regexp to get the right codes.

    It is working so far! Thank you once again!!

    --Luiz

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Awesome to hear @LuizCavalcanti
    and thanks for the solution @tgb417
    !

  • YiXiangChong
    YiXiangChong Registered Posts: 5 ✭✭✭✭

    Hi @LuizCavalcanti
    ,

    I'm not too sure. But can you try the following regex patterns in dataiku?

    From prepare recipe, look for "Extract with regular expression" processor.

    Then, key in the following regular expression or regex patterns in the Regular Expression fill box (Note: you will need 3 separate processors for each regex pattern):

    ^448.*?

    ^WND.*?

    ^100W.*?

    (Note: "^" means "starts with" in regex; "." means "any character" after; "*?" means "0 or more matches")

    See if it works. Hope this was helpful

Setup Info
    Tags
      Help me…