REMOVAL OF CHAR USING FIND AND REPLACE
Hi fellow Dataiku community, i have used a prepare recipe to create the following logic in Dataiku:
if COL_A contains the word "column", keep COL_A
if COL_A=COL_B, keep one of them
if one is blank, remove "," after COL_A or before COL_B
if both columns blank, then blank
My code is:
if (contains(to lowercase(COL_A)="column"), COL_A, if(isNotNull(COL_A) && isNotNull(COL_B), COL_A +","+ COL_B, if(isNotNull(COL_A) == isNotNull(COL , COL_A, if(isNull(COL_A),COL_B, if(isNull(COL_B), COL_A, "")))))
My problem lies in the following rows:
Row | COL_A | COL_B | OUTPUT |
1 | statement A | statement A. | statement A,statement A. |
2 | * | statement B | *,statement B |
3 | NA | NA Statement NA in the middle | ,Statement in the middle |
QN 1-Row 1: COL_A and COL_B has same value so only value from 1 column shld be populated(result-> statement A). but because of the "." in COL_B, both values are populated.
From COL_B, I tried using find and replace .$ with blank (using regex) to remove the dot but in other rows the last letter was removed as well (e..g 4%p.a becomes 4%p.)
How do i prevent this from happening? am i wrong to use .$?
QN 2-Row 2 and 3: for COL_A, I used find and replace to remove * and NA using
^* replace with blank
NA replace with blank
under Substring (bec using regex induce an error)
commands did not work. even if i use a separate find and replace for each command of COL_A, it works for 1 but not the other. what shld i do?
QN 3-Row 3: for COL_A, when i find and replace NA (individually) under substring. I managed to remove NA. but for COL_B using the same method, dataiku find all "NA" in the sentence and remove both of them. what should i do to prevent the middle NA from getting removed too?
QN 4- in what instance do i use Substring or Regular Expression?
I wish to thank the kind person in advance for replying to my queries.
Regards
Amy
Operating system used: IOS
Best Answer
-
Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 115 ✭✭✭✭✭✭✭
Hi @Aminmin
,QN-1 : the dot (.) is a special character when using regex, it stands for any character. When you mean the actual dot itself you have to escape that out with a backslash. You can test your regex here if experimenting in DSS itself is not an option for you
QN-2 : As the mentioned values are not part of a bigger value why not use the option "complete value" ? And : (*) is a special character in Regex too, it matches the previous token in your regex between zero and unlimited times.. If you insist on using regex here then escape it with a backslash.
QN-3 : try ^NA for this. This is a very simplistic solution because i can't check a better one right now.. ( a general search "regex replace first occurrence of string" would give you some more options)
QN-4 : i don't have a general rule for using one or the other... regex are great because of the near infinite possibilities, but can be complex to construct and read after a while. Using the substring-option is also nice because it's a quick solution for easy challenges where a big gun like regex is simply over-reacting. When i do not have a clear view of what's happening in a certain column i like regex over substrings as regex can be very specific.
hope this helps!
Jurre
Answers
-
Hi Jurre, greatly appreciate your gudance. Your explanation is easy to understand and the suggestions are very useful.