REMOVAL OF CHAR USING FIND AND REPLACE

Solved!
Aminmin
Level 3
REMOVAL OF CHAR USING FIND AND REPLACE

Hi fellow Dataiku community, i have used a prepare recipe to create the following logic in Dataiku:

if COL_A contains the word "column", keep COL_A
if COL_A=COL_B, keep one of them
if one is blank, remove "," after COL_A or before COL_B
if both columns blank, then blank

My code is: 

if (contains(to lowercase(COL_A)="column"), COL_A, if(isNotNull(COL_A) && isNotNull(COL_B), COL_A +","+ COL_B, if(isNotNull(COL_A) == isNotNull(COL B), COL_A, if(isNull(COL_A),COL_B, if(isNull(COL_B), COL_A, "")))))

My problem lies in the following rows:

RowCOL_ACOL_BOUTPUT
1statement Astatement A.statement A,statement A.
2*statement B*,statement B
3NANA Statement NA in the middle,Statement in the middle

 

QN 1-Row 1: COL_A and COL_B has same value so only value from 1 column shld be populated(result-> statement A). but because of  the "." in COL_B, both values are populated.

From COL_B, I tried using find and replace .$ with blank (using regex) to remove the dot but in other rows the last letter was removed as well (e..g 4%p.a becomes 4%p.)

How do i prevent this from happening? am i wrong to use .$?

QN 2-Row 2 and 3: for COL_A, I used find and replace to remove * and NA using

^* replace with blank

NA replace with blank

under Substring (bec using regex induce an error)

commands did not work. even if i use a separate find and replace for each command of COL_A, it works for 1 but not the other. what shld i do?

QN 3-Row 3: for COL_A, when i find and replace NA (individually) under substring. I managed to remove NA. but for COL_B using the same method, dataiku find all "NA" in the sentence and remove both of them. what should i do to prevent the middle NA from getting removed too?

QN 4- in what instance do i use Substring or Regular Expression?

I wish to thank the kind person in advance for replying to my queries.

Regards

Amy


Operating system used: IOS

0 Kudos
1 Solution
Jurre
Level 5

Hi @Aminmin ,

QN-1 : the dot (.) is a special character when using regex, it stands for any character. When you mean the actual dot itself you have to escape that out with a backslash. You can test your regex here if experimenting in DSS itself is not an option for you 

QN-2 : As the mentioned values are not part of a bigger value why not use the option "complete value"  ? And : (*) is a special character in Regex too, it matches the previous token in your regex between zero and unlimited times.. If you insist on using regex here then escape it with a backslash.  

QN-3 : try ^NA for this. This is a very simplistic solution because i can't check a better one right now.. ( a general search "regex replace first occurrence of string" would give you some more options)

QN-4 : i don't have a general rule for using one or the other... regex are great because of the near infinite possibilities, but can be complex to construct and read after a while. Using the substring-option is also nice because it's a quick solution for easy challenges where a big gun like regex is simply over-reacting. When i do not have a clear view of what's happening in a certain column i like regex over substrings as regex can be very specific. 

hope this helps!

Jurre

 

View solution in original post

2 Replies
Jurre
Level 5

Hi @Aminmin ,

QN-1 : the dot (.) is a special character when using regex, it stands for any character. When you mean the actual dot itself you have to escape that out with a backslash. You can test your regex here if experimenting in DSS itself is not an option for you 

QN-2 : As the mentioned values are not part of a bigger value why not use the option "complete value"  ? And : (*) is a special character in Regex too, it matches the previous token in your regex between zero and unlimited times.. If you insist on using regex here then escape it with a backslash.  

QN-3 : try ^NA for this. This is a very simplistic solution because i can't check a better one right now.. ( a general search "regex replace first occurrence of string" would give you some more options)

QN-4 : i don't have a general rule for using one or the other... regex are great because of the near infinite possibilities, but can be complex to construct and read after a while. Using the substring-option is also nice because it's a quick solution for easy challenges where a big gun like regex is simply over-reacting. When i do not have a clear view of what's happening in a certain column i like regex over substrings as regex can be very specific. 

hope this helps!

Jurre

 

Aminmin
Level 3
Author

Hi Jurre, greatly appreciate your gudance. Your explanation is easy to understand and the suggestions are very useful.