Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

How to create dynamic regex function

abalo006
Level 3
How to create dynamic regex function

I have a column called filename and the field is formatted as

 

name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp

 

 

up until now, the number of names can differ, but the filename always ends with YYYYMMDDHHMMSS.txt.pgp 

 

I've been using the regex function (.+)_ to extract everything before the last _YYYYMMDDHHMMSS.txt.pgp resulting in the names only

 

but now I have a new field in the column formatted as 

 

name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp

 

Using my old regex this results in name1_name2_name3_name4_ddmmyyyy, is there a different regex function I can use that's more dynamic that will work for either of those naming conventions?

 

in my real data the dates are listed as numbers within the filenames, but for my configuration file its just letters.

 

 

Example of the data in my configuration file 

 

name1_name2_name3_name4_name5_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_ddmmyyyy_HHMMSS.txt.pgp

 

 

Example of the data in my real data 

 

name1_name2_name3_name4_name5_20240401120000.txt.pgp

name1_name2_name3_name4_20240401120000.txt.pgp

name1_name2_name3_name4_20240401_120000.txt.pgp

 

I'm trying to create 2 separate regex functions, one for my real data and one for my configuration data that only extracts the filename without the date, if there's one regex function that will work for both, that's fine too, I just don't know if that's possible. Does anyone know how I can accomplish this?


Operating system used: windows

0 Kudos
3 Replies
Turribeach

Either you are making this way more complex than it needs to be or you are not explaining the requirements properly. A simple substring achieves what you want:

 

Screenshot 2024-06-05 at 22.37.21.png

0 Kudos
abalo006
Level 3
Author

sorry maybe I'm not explaining correctly, essentially, I have these 4 values

 

cat1_cat2_cat3_cat4_cat5_YYYYMMDDHHMMSS.txt.pgp

product1_product2_product3_product4_product5_YYYYMMDDHHMMSS.txt.pgp

name1_name2_name3_name4_name5_name6_YYYYMMDDHHMMSS.txt.pgp

subcategory1_subcategory2_subcategory3_subcategory4_ddmmyyyy_HHMMSS.txt.pgp

 

and im looking for dynamic regex function that returns:

 

cat1_cat2_cat3_cat4_cat5

product1_product2_product3_product4_product5

name1_name2_name3_name4_name5_name6

subcategory1_subcategory2_subcategory3_subcategory4

 

I think it has to be a regex, substring won't work because the names aren't always the same length, im looking to return everything before the underscore preceding the (YYYYMMDDHHMMSS/ddmmyyyy_HHMMSS)

does that make sense?

 

In the example you provided I see that there's still an _ left at the end in some cases even after you performed your calculation, so I think it needs to be a regex

 

 

 

0 Kudos

You are discarding the simplest option just because it has an extra underscore at the end? That's trivial to fix. In my view this is a more robust option since the date part is unlikely to ever change. 

if(endsWith(substring(file_name, 0, length(file_name) - 23), "_"), substring(substring(file_name, 0, length(file_name) - 23), 0, -1), substring(file_name, 0, length(file_name) - 23))

Screenshot 2024-06-06 at 07.37.52.png

0 Kudos

Labels

?
Labels (3)

Setup info

?
A banner prompting to get Dataiku