REMOVAL OF CHAR USING FIND AND REPLACE

Aminmin · February 2022

Hi fellow Dataiku community, i have used a prepare recipe to create the following logic in Dataiku:

if COL_A contains the word "column", keep COL_A
if COL_A=COL_B, keep one of them
if one is blank, remove "," after COL_A or before COL_B
if both columns blank, then blank

My code is:

if (contains(to lowercase(COL_A)="column"), COL_A, if(isNotNull(COL_A) && isNotNull(COL_B), COL_A +","+ COL_B, if(isNotNull(COL_A) == isNotNull(COL , COL_A, if(isNull(COL_A),COL_B, if(isNull(COL_B), COL_A, "")))))

My problem lies in the following rows:

Row	COL_A	COL_B	OUTPUT
1	statement A	statement A.	statement A,statement A.
2	*	statement B	*,statement B
3	NA	NA Statement NA in the middle	,Statement in the middle

QN 1-Row 1: COL_A and COL_B has same value so only value from 1 column shld be populated(result-> statement A). but because of the "." in COL_B, both values are populated.

From COL_B, I tried using find and replace .$ with blank (using regex) to remove the dot but in other rows the last letter was removed as well (e..g 4%p.a becomes 4%p.)

How do i prevent this from happening? am i wrong to use .$?

QN 2-Row 2 and 3: for COL_A, I used find and replace to remove * and NA using

^* replace with blank

NA replace with blank

under Substring (bec using regex induce an error)

commands did not work. even if i use a separate find and replace for each command of COL_A, it works for 1 but not the other. what shld i do?

QN 3-Row 3: for COL_A, when i find and replace NA (individually) under substring. I managed to remove NA. but for COL_B using the same method, dataiku find all "NA" in the sentence and remove both of them. what should i do to prevent the middle NA from getting removed too?

QN 4- in what instance do i use Substring or Regular Expression?

I wish to thank the kind person in advance for replying to my queries.

Regards

Amy

Operating system used: IOS

Jurre · March 2022

Hi @Aminmin
,

QN-1 : the dot (.) is a special character when using regex, it stands for any character. When you mean the actual dot itself you have to escape that out with a backslash. You can test your regex here if experimenting in DSS itself is not an option for you

QN-2 : As the mentioned values are not part of a bigger value why not use the option "complete value" ? And : (*) is a special character in Regex too, it matches the previous token in your regex between zero and unlimited times.. If you insist on using regex here then escape it with a backslash.

QN-3 : try ^NA for this. This is a very simplistic solution because i can't check a better one right now.. ( a general search "regex replace first occurrence of string" would give you some more options)

QN-4 : i don't have a general rule for using one or the other... regex are great because of the near infinite possibilities, but can be complex to construct and read after a while. Using the substring-option is also nice because it's a quick solution for easy challenges where a big gun like regex is simply over-reacting. When i do not have a clear view of what's happening in a certain column i like regex over substrings as regex can be very specific.

hope this helps!

Jurre

Aminmin · March 2022

Hi Jurre, greatly appreciate your gudance. Your explanation is easy to understand and the suggestions are very useful.

REMOVAL OF CHAR USING FIND AND REPLACE

Best Answer

Answers

Categories

Setup Info

Tags