regex that will remove everything between begin: and end:

Hi
I've got dataset with regex patterns in one column and python code that uses it to make replacement. I need regex pattern that will remove everything between "begin:" and "end:" Tried to use "begin:[\s\S]*?end: " but it doesn't work. in between I can have more then 3000 chars including special ones.
Thank you
Ela
Operating system used: Windows
Answers
-
Hi Elzbieta,
Have you tried to use the Smart Pattern Builder to help you write the right regex ?
To use it, select one regex in your column and from the menu, click on "extract text like … ", then you should see a modal for your regex.
You can find the documentation here:
-
Thank you for your suggestion - I will use it for sure to master my list of regex. Currently I've got more then 1000 of regex to be applied and I believe that using dataset storing them is the only option. Finally, to solve problem above I had to use flag dotall to take into consideration whole text between begin: and end: including new lines. The point is that using python logic and long list of regex is time consuming. I've tried to used precompiled batches but I finally failed to validate this logic - comparison of results after pure regex application and regex with batches gave me some differences (on 8000 texts samples ~120 had diffrences). Do you have a template/example how to apply precompiled regex batches in python code for dataiku assuming regex are in separate dataset? Thank you Ela
-
Hi again
I don't have any template/example, but the reason of those differences may come from the order of compilation of your regexes. Precompiling regex changes how replacements are applied if order or overlapping matches matter. If multiple regexes interact, applying them as a batch can produce different results compared to applying them one by one.
For example:
- First pattern : Replace
toto
→titi
- Second pattern: Replace
titi
→tutu
If you batch them, the
toto → titi
→tutu
chain won’t happen, because 2 won’t see 1's resultSo I would advise you to make sure the regexes are compiled sequentially.
- First pattern : Replace