How to create dynamic regex function
I have a column called filename and the field is formatted as
name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp
name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp
name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp
up until now, the number of names can differ, but the filename always ends with YYYYMMDDHHMMSS.txt.pgp
I've been using the regex function (.+)_ to extract everything before the last _YYYYMMDDHHMMSS.txt.pgp resulting in the names only
but now I have a new field in the column formatted as
name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp
Using my old regex this results in name1_name2_name3_name4_ddmmyyyy, is there a different regex function I can use that's more dynamic that will work for either of those naming conventions?
in my real data the dates are listed as numbers within the filenames, but for my configuration file its just letters.
Example of the data in my configuration file
name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp
name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp
name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp
Example of the data in my real data
name1_name2_name3_name4_name5_20240401120000.txt.pgp
name1_name2_name3_name4_20240401120000.txt.pgp
name1_name2_name3_name4_20240401_120000.txt.pgp
I'm trying to create 2 separate regex functions, one for my real data and one for my configuration data that only extracts the filename without the date, if there's one regex function that will work for both, that's fine too, I just don't know if that's possible. Does anyone know how I can accomplish this?
Operating system used: windows
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
Either you are making this way more complex than it needs to be or you are not explaining the requirements properly. A simple substring achieves what you want:
-
sorry maybe I'm not explaining correctly, essentially, I have these 4 values
cat1_cat2_cat3_cat4_cat5_YYYYMMDDHHMMSS.txt.pgp
product1_product2_product3_product4_product5_YYYYMMDDHHMMSS.txt.pgp
name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp
subcategory1_subcategory2_subcategory3_subcategory4_ddmmyyyy_HHMMSS.txt.pgp
and im looking for dynamic regex function that returns:
cat1_cat2_cat3_cat4_cat5
product1_product2_product3_product4_product5
name1_name2_name3_name4_name5_name6
subcategory1_subcategory2_subcategory3_subcategory4
I think it has to be a regex, substring won't work because the names aren't always the same length, im looking to return everything before the underscore preceding the (YYYYMMDDHHMMSS/ddmmyyyy_HHMMSS)
does that make sense?
In the example you provided I see that there's still an _ left at the end in some cases even after you performed your calculation, so I think it needs to be a regex
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron
You are discarding the simplest option just because it has an extra underscore at the end? That's trivial to fix. In my view this is a more robust option since the date part is unlikely to ever change.
if(endsWith(substring(file_name, 0, length(file_name) - 23), "_"), substring(substring(file_name, 0, length(file_name) - 23), 0, -1), substring(file_name, 0, length(file_name) - 23))