regex that will remove everything between begin: and end:

Elzbieta
Elzbieta Registered Posts: 8 ✭✭

Hi

I've got dataset with regex patterns in one column and python code that uses it to make replacement. I need regex pattern that will remove everything between "begin:" and "end:" Tried to use "begin:[\s\S]*?end: " but it doesn't work. in between I can have more then 3000 chars including special ones.

Thank you

Ela

Operating system used: Windows

Answers

  • JaneB
    JaneB Dataiker, Registered Posts: 2 Dataiker
    edited July 29

    Hi Elzbieta,

    Have you tried to use the Smart Pattern Builder to help you write the right regex ?

    To use it, select one regex in your column and from the menu, click on "extract text like … ", then you should see a modal for your regex.

    You can find the documentation here: https://knowledge.dataiku.com/latest/ml-analytics/nlp/concept-regex.html#smart-pattern-builder

  • Elzbieta
    Elzbieta Registered Posts: 8 ✭✭

    Thank you for your suggestion - I will use it for sure to master my list of regex. Currently I've got more then 1000 of regex to be applied and I believe that using dataset storing them is the only option. Finally, to solve problem above I had to use flag dotall to take into consideration whole text between begin: and end: including new lines. The point is that using python logic and long list of regex is time consuming. I've tried to used precompiled batches but I finally failed to validate this logic - comparison of results after pure regex application and regex with batches gave me some differences (on 8000 texts samples ~120 had diffrences). Do you have a template/example how to apply precompiled regex batches in python code for dataiku assuming regex are in separate dataset? Thank you Ela

  • JaneB
    JaneB Dataiker, Registered Posts: 2 Dataiker

    Hi again

    I don't have any template/example, but the reason of those differences may come from the order of compilation of your regexes. Precompiling regex changes how replacements are applied if order or overlapping matches matter. If multiple regexes interact, applying them as a batch can produce different results compared to applying them one by one.

    For example:

    1. First pattern : Replace tototiti
    2. Second pattern: Replace tititutu

    If you batch them, the toto → tititutu chain won’t happen, because 2 won’t see 1's result

    So I would advise you to make sure the regexes are compiled sequentially.

Setup Info
    Tags
      Help me…