How to create dynamic regex function

abalo006
abalo006 Registered Posts: 29

I have a column called filename and the field is formatted as

name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp

up until now, the number of names can differ, but the filename always ends with YYYYMMDDHHMMSS.txt.pgp

I've been using the regex function (.+)_ to extract everything before the last _YYYYMMDDHHMMSS.txt.pgp resulting in the names only

but now I have a new field in the column formatted as

name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp

Using my old regex this results in name1_name2_name3_name4_ddmmyyyy, is there a different regex function I can use that's more dynamic that will work for either of those naming conventions?

in my real data the dates are listed as numbers within the filenames, but for my configuration file its just letters.

Example of the data in my configuration file

name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp

Example of the data in my real data

name1_name2_name3_name4_name5_20240401120000.txt.pgp

name1_name2_name3_name4_20240401120000.txt.pgp

name1_name2_name3_name4_20240401_120000.txt.pgp

I'm trying to create 2 separate regex functions, one for my real data and one for my configuration data that only extracts the filename without the date, if there's one regex function that will work for both, that's fine too, I just don't know if that's possible. Does anyone know how I can accomplish this?


Operating system used: windows

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,914 Neuron

    Either you are making this way more complex than it needs to be or you are not explaining the requirements properly. A simple substring achieves what you want:

    Screenshot 2024-06-05 at 22.37.21.png

  • abalo006
    abalo006 Registered Posts: 29

    sorry maybe I'm not explaining correctly, essentially, I have these 4 values

    cat1_cat2_cat3_cat4_cat5_YYYYMMDDHHMMSS.txt.pgp

    product1_product2_product3_product4_product5_YYYYMMDDHHMMSS.txt.pgp

    name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp

    subcategory1_subcategory2_subcategory3_subcategory4_ddmmyyyy_HHMMSS.txt.pgp

    and im looking for dynamic regex function that returns:

    cat1_cat2_cat3_cat4_cat5

    product1_product2_product3_product4_product5

    name1_name2_name3_name4_name5_name6

    subcategory1_subcategory2_subcategory3_subcategory4

    I think it has to be a regex, substring won't work because the names aren't always the same length, im looking to return everything before the underscore preceding the (YYYYMMDDHHMMSS/ddmmyyyy_HHMMSS)

    does that make sense?

    In the example you provided I see that there's still an _ left at the end in some cases even after you performed your calculation, so I think it needs to be a regex

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,914 Neuron
    edited July 17

    You are discarding the simplest option just because it has an extra underscore at the end? That's trivial to fix. In my view this is a more robust option since the date part is unlikely to ever change.

    if(endsWith(substring(file_name, 0, length(file_name) - 23), "_"), substring(substring(file_name, 0, length(file_name) - 23), 0, -1), substring(file_name, 0, length(file_name) - 23))

    Screenshot 2024-06-06 at 07.37.52.png

Setup Info
    Tags
      Help me…