Masking of middle string of text
Hi all, I need to mask some columns of personal data while retaining the first 2 and last 2 letters.
Please see my mock data
NAME | ADDRESS | CONTACT NO | |
Emily Brown | 21 Annabelle Street New Jersey | emily.brown@gmail.com | |
Amit Balakrishnan Junior | Blk 999, Hougang Str 99 #01-221 Singapore 221999 | Amit_Bala@hotmail.sg | 91234567 |
Farhan Bin Musa | No 24, Jalan Segamat Selangor | F.musa@temp.com.my;farhanm@yahoo.com; | +60 11 12345678 |
Kit Ng | Blk 2 Tingkat 7 unit 02 Kawasan Perumahan Jalan Bukit Jalil Malaysia | kit-ng@src.com;ng_yee_long@gmail.com; | 020 700 11111 |
Alexander Bartholomew Desdemona | 2 Kitten St QLD | admin@sugar.com | +61400111222 |
Operating system used: ios
Operating system used: ios
Best Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,211 Dataiker
Hi,
Just to expand on the suggestions from @tgb417
. You can indeed use formula with regex to achieve this
To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.
The regex is using \w but if you want to replace spaces you can adapt the regex in the replace.concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))
Hope that helps.
-
Hi @Aminmin
,For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs DATAmask = dataiku.Dataset("DATAmask") df = DATAmask.get_dataframe() for i in range(0, df.count()[2]): o="" for s in df.at[i, 'email'].split(";"): o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:] if len(s) >2 else s df.at[i, 'email'] = o[1:] print(df) Maskdata_df = df # Write recipe outputs Maskdata = dataiku.Dataset("Maskdata") Maskdata.write_with_schema(Maskdata_df)
The output of this will be F.**************my;fa*************om.
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
So there are two way's I might consider doing something like this.
- I would have looked at the Pseudonymize which is likely the safest way to do something like this.
- You might be able to do something with a regular expression (regx). https://doc.dataiku.com/dss/latest/preparation/processors/find-replace.html
- You might also take a look at
https://knowledge.dataiku.com/latest/courses/advanced-data-prep/prepare-recipe/smart-pattern-builder.html. OR https://knowledge.dataiku.com/latest/courses/advanced-data-prep/visual-recipes-102/adv-formula-regex/regex-summary.html
-
Hi Tom, thank you for taking time to reply me.
I will read through your suggestion and try it together with AlexT's input.
Appreciate your help
Kindest regards
Aminmin
-
Hi AlexT, extremely grateful and really appreciate your explanation.
Will try it out!!
Kindest regards
Aminmin
-
Dear Catalina S, my apologies for the late reply and thank you for your kind guidance.
Regards
Aminmin