Standardize the syntax for regular expressions (regex) across all uses in Dataiku DSS

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

User Story

As a lazy data analyst with limited experience with regular expressions I would like the implementation and syntax of regular expressions to be the same across all uses of regex in DSS. This would lead to more confidence in my use of Regex. Ultimately increase the power of DSS.

Nice to Have:

  • It would be nice if the new v9 regular expressions helper showed up in every place that a regex can be used in DSS. This should include appropriate in situ examples from column names, cells, or wherever I might be trying to match data.
  • In fact from an UI perspective I wish that the helper did not show up as a text string under the field that can use the Regex. But that there is a simple icon used across DSS that shows up after a regex enabled field that would get me to the Regex Helper. Today for most of my regex work, I tend to copy examples to one of the Regex web sites. (Being careful not to release confidential information). Figure out my expression and copy the results back to DSS.

Notes:

  • There appear to be a number of different implementations of Regular expressions used in DSS. (Possibly because DSS is written with multiple libraries out of Java, Python and from other places.) For example:
    • In some cases in shaker formulas it appears that I have to escape single back slash \ as double back slash \\ . In other places this does not seem to be necessary.
    • in some places it appears that I have to put quotes around regular expressions, for example in visual recipe formulas. And in some cases I don’t. For example in regex based column selection.
    • In some cases it appears that I need to create groupings in parentheses to make a match ( ). However, in some other cases it appears that I don’t.
    • In some cases I need to account for all characters in a string to make a match padding my criteria with something like .* or [\s\S]* on both ends, and in other cases I do not.
    • In some cases it appears that I need to include the leading and closing slash for example /.*/ and in some cases it appears that I do not just using something like .*
    • This is some times made more difficult because of cell level “duck-typing” changing something like 08840 to 8840 or when something like 012345678901234567890 gets changed to 1.234567890E19. My regular expressions that would work on the string versions of these cells fail on the duck typed integer or decimal versions of these cells.

Regex is powerful and great. However, making the experience more consistent would be very helpful. The Regex helper added in DSS v9 is a nice start in this direction.

6
6 votes

In the Backlog · Last Updated

Comments

Setup Info
    Tags
      Help me…