Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Masking of middle string of text

Solved!
Aminmin
Level 3
Masking of middle string of text

Hi all, I need to mask some columns of personal data while retaining the first 2 and last 2 letters.

Please see my mock data

NAMEADDRESSEMAILCONTACT NO
Emily Brown21 Annabelle Street New Jerseyemily.brown@gmail.com 
Amit Balakrishnan JuniorBlk 999, Hougang Str 99 #01-221 Singapore 221999Amit_Bala@hotmail.sg91234567
Farhan Bin MusaNo 24, Jalan Segamat SelangorF.musa@temp.com.my;farhanm@yahoo.com;+60 11 12345678
Kit NgBlk 2 Tingkat 7 unit 02 Kawasan Perumahan Jalan Bukit Jalil Malaysiakit-ng@src.com;ng_yee_long@gmail.com;020 700 11111
Alexander Bartholomew Desdemona2 Kitten St QLDadmin@sugar.com+61400111222

 

For name, i want to keep the first 2 letters and last 2 letters with the other letters to be replaced by *
eg. for Emily Brown my output should be Em*******wn, another example Kit Ng my output should be Ki**Ng
The middle part to be replaced will vary in length depending on the string in that cell.
 
i tried using the Pseudonymize Text in Prepare recipe but the whole text is replaced by random characters. This is not the output i want.
 
The same will apply for address and contact number, where i want to retain first 2 and last 2 letters/digit while everything else in the middle is replaced by *
 
For multiple email address eg F.musa@temp.com.my;farhanm@yahoo.com; my output should be F.***********om;fa*************om;
 
Would greatly appreciate if any kind expert is able to help me.
I thank you in advance for your attention and guidance
 
Regards
Aminmin
 

Operating system used: ios


Operating system used: ios

0 Kudos
2 Solutions
AlexT
Dataiker
Dataiker

Hi,

Just to expand on the suggestions from @tgb417. You can indeed use formula with regex to achieve this

To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.
 
The regex is using \w but if you want to replace spaces you can adapt the regex in the replace. 

concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))

Screenshot 2022-08-31 at 22.05.56.png

 

 Hope that helps. 

View solution in original post

CatalinaS
Dataiker
Dataiker
 
For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:
 
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
DATAmask = dataiku.Dataset("DATAmask")
df = DATAmask.get_dataframe()

for i in range(0, df.count()[2]):
  o="" 
  for s in df.at[i, 'email'].split(";"): 
      o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:]  if len(s) >2 else s
  df.at[i, 'email'] = o[1:] 
print(df)

Maskdata_df = df 

# Write recipe outputs
Maskdata = dataiku.Dataset("Maskdata")
Maskdata.write_with_schema(Maskdata_df)

 

The output of this will be F.**************my;fa*************om.

View solution in original post

6 Replies
tgb417
Neuron
Neuron

@Aminmin 

So there are two way's I might consider doing something like this. 

 

--Tom
AlexT
Dataiker
Dataiker

Hi,

Just to expand on the suggestions from @tgb417. You can indeed use formula with regex to achieve this

To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.
 
The regex is using \w but if you want to replace spaces you can adapt the regex in the replace. 

concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))

Screenshot 2022-08-31 at 22.05.56.png

 

 Hope that helps. 

Aminmin
Level 3
Author

Hi AlexT, extremely grateful and really appreciate your explanation.

Will try it out!!

 

Kindest regards

Aminmin

0 Kudos
Aminmin
Level 3
Author

Hi Tom, thank you for taking time to reply me.

I will read through your suggestion and try it together with AlexT's input.

Appreciate your help

Kindest regards

Aminmin

CatalinaS
Dataiker
Dataiker
 
For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:
 
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
DATAmask = dataiku.Dataset("DATAmask")
df = DATAmask.get_dataframe()

for i in range(0, df.count()[2]):
  o="" 
  for s in df.at[i, 'email'].split(";"): 
      o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:]  if len(s) >2 else s
  df.at[i, 'email'] = o[1:] 
print(df)

Maskdata_df = df 

# Write recipe outputs
Maskdata = dataiku.Dataset("Maskdata")
Maskdata.write_with_schema(Maskdata_df)

 

The output of this will be F.**************my;fa*************om.

Aminmin
Level 3
Author

Dear Catalina S, my apologies for the late reply and thank you for your kind guidance.

Regards

Aminmin

0 Kudos