Masking of middle string of text

Options
Aminmin
Aminmin Dataiku DSS Core Designer, Registered Posts: 18 ✭✭✭✭

Hi all, I need to mask some columns of personal data while retaining the first 2 and last 2 letters.

Please see my mock data

NAMEADDRESSEMAILCONTACT NO
Emily Brown21 Annabelle Street New Jerseyemily.brown@gmail.com
Amit Balakrishnan JuniorBlk 999, Hougang Str 99 #01-221 Singapore 221999Amit_Bala@hotmail.sg91234567
Farhan Bin MusaNo 24, Jalan Segamat SelangorF.musa@temp.com.my;farhanm@yahoo.com;+60 11 12345678
Kit NgBlk 2 Tingkat 7 unit 02 Kawasan Perumahan Jalan Bukit Jalil Malaysiakit-ng@src.com;ng_yee_long@gmail.com;020 700 11111
Alexander Bartholomew Desdemona2 Kitten St QLDadmin@sugar.com+61400111222

For name, i want to keep the first 2 letters and last 2 letters with the other letters to be replaced by *
eg. for Emily Brown my output should be Em*******wn, another example Kit Ng my output should be Ki**Ng
The middle part to be replaced will vary in length depending on the string in that cell.
i tried using the Pseudonymize Text in Prepare recipe but the whole text is replaced by random characters. This is not the output i want.
The same will apply for address and contact number, where i want to retain first 2 and last 2 letters/digit while everything else in the middle is replaced by *
For multiple email address eg F.musa@temp.com.my;farhanm@yahoo.com; my output should be F.***********om;fa*************om;
Would greatly appreciate if any kind expert is able to help me.
I thank you in advance for your attention and guidance
Regards
Aminmin

Operating system used: ios


Operating system used: ios

Best Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17 Answer ✓
    Options

    Hi,

    Just to expand on the suggestions from @tgb417
    . You can indeed use formula with regex to achieve this

    To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.

    The regex is using \w but if you want to replace spaces you can adapt the regex in the replace.

    concat(slice(ip_address_country,0,2),replace(slice(ip_address_country,2,length(ip_address_country)-2),/\w/,"*"),slice(ip_address_country,(length(ip_address_country)-2),length(ip_address_country)))

    Screenshot 2022-08-31 at 22.05.56.png

    Hope that helps.

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 17 Answer ✓
    Options
    For multiple email addresses separated by semicolon (such as F.musa@temp.com.my;farhanm@yahoo.com; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    DATAmask = dataiku.Dataset("DATAmask")
    df = DATAmask.get_dataframe()
    
    for i in range(0, df.count()[2]):
      o="" 
      for s in df.at[i, 'email'].split(";"): 
          o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:]  if len(s) >2 else s
      df.at[i, 'email'] = o[1:] 
    print(df)
    
    Maskdata_df = df 
    
    # Write recipe outputs
    Maskdata = dataiku.Dataset("Maskdata")
    Maskdata.write_with_schema(Maskdata_df)

    The output of this will be F.**************my;fa*************om.

Answers

Setup Info
    Tags
      Help me…