Masking of middle string of text

Aminmin · ‎08-31-2022

Hi all, I need to mask some columns of personal data while retaining the first 2 and last 2 letters.

Please see my mock data

NAME	ADDRESS	EMAIL	CONTACT NO
Emily Brown	21 Annabelle Street New Jersey	emily.brown@gmail.com
Amit Balakrishnan Junior	Blk 999, Hougang Str 99 #01-221 Singapore 221999	Amit_Bala@hotmail.sg	91234567
Farhan Bin Musa	No 24, Jalan Segamat Selangor	F.musa@temp.com.my;farhanm@yahoo.com;	+60 11 12345678
Kit Ng	Blk 2 Tingkat 7 unit 02 Kawasan Perumahan Jalan Bukit Jalil Malaysia	kit-ng@src.com;ng_yee_long@gmail.com;	020 700 11111
Alexander Bartholomew Desdemona	2 Kitten St QLD	admin@sugar.com	+61400111222

For name, i want to keep the first 2 letters and last 2 letters with the other letters to be replaced by *

eg. for Emily Brown my output should be Em*******wn, another example Kit Ng my output should be Ki**Ng

The middle part to be replaced will vary in length depending on the string in that cell.

i tried using the Pseudonymize Text in Prepare recipe but the whole text is replaced by random characters. This is not the output i want.

The same will apply for address and contact number, where i want to retain first 2 and last 2 letters/digit while everything else in the middle is replaced by *

For multiple email address eg F.musa@temp.com.my;farhanm@yahoo.com; my output should be F.***********om;fa*************om;

Would greatly appreciate if any kind expert is able to help me.

I thank you in advance for your attention and guidance

Regards

Aminmin

Operating system used: ios

AlexT · ‎08-31-2022

Hi,

Just to expand on the suggestions from @tgb417. You can indeed use formula with regex to achieve this

To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.

The regex is using \w but if you want to replace spaces you can adapt the regex in the replace.

concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))

Hope that helps.

View solution in original post

CatalinaS · ‎09-01-2022

Hi @Aminmin,

For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
DATAmask = dataiku.Dataset("DATAmask")
df = DATAmask.get_dataframe()

for i in range(0, df.count()[2]):
  o="" 
  for s in df.at[i, 'email'].split(";"): 
      o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:]  if len(s) >2 else s
  df.at[i, 'email'] = o[1:] 
print(df)

Maskdata_df = df 

# Write recipe outputs
Maskdata = dataiku.Dataset("Maskdata")
Maskdata.write_with_schema(Maskdata_df)

The output of this will be F.**************my;fa*************om.

View solution in original post

tgb417 · ‎08-31-2022

@Aminmin

So there are two way's I might consider doing something like this.

I would have looked at the Pseudonymize which is likely the safest way to do something like this.
You might be able to do something with a regular expression (regx). https://doc.dataiku.com/dss/latest/preparation/processors/find-replace.html
You might also take a look at
https://knowledge.dataiku.com/latest/courses/advanced-data-prep/prepare-recipe/smart-pattern-builder... OR https://knowledge.dataiku.com/latest/courses/advanced-data-prep/visual-recipes-102/adv-formula-regex...

--Tom

AlexT · ‎08-31-2022

Hi,

Just to expand on the suggestions from @tgb417. You can indeed use formula with regex to achieve this

To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.

The regex is using \w but if you want to replace spaces you can adapt the regex in the replace.

concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))

Hope that helps.

Aminmin · ‎09-01-2022

Hi AlexT, extremely grateful and really appreciate your explanation.

Will try it out!!

Kindest regards

Aminmin

Aminmin · ‎09-01-2022

Hi Tom, thank you for taking time to reply me.

I will read through your suggestion and try it together with AlexT's input.

Appreciate your help

Kindest regards

Aminmin

CatalinaS · ‎09-01-2022

Hi @Aminmin,

For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
DATAmask = dataiku.Dataset("DATAmask")
df = DATAmask.get_dataframe()

for i in range(0, df.count()[2]):
  o="" 
  for s in df.at[i, 'email'].split(";"): 
      o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:]  if len(s) >2 else s
  df.at[i, 'email'] = o[1:] 
print(df)

Maskdata_df = df 

# Write recipe outputs
Maskdata = dataiku.Dataset("Maskdata")
Maskdata.write_with_schema(Maskdata_df)

The output of this will be F.**************my;fa*************om.

Aminmin · ‎09-08-2022

Dear Catalina S, my apologies for the late reply and thank you for your kind guidance.

Regards

Aminmin

Sign up to take part

Masking of middle string of text

Masking of middle string of text

Setup info