Masking of middle string of text

Aminmin Dataiku DSS Core Designer, Registered Posts: 18 ✭✭✭✭

Hi all, I need to mask some columns of personal data while retaining the first 2 and last 2 letters.

Please see my mock data

Emily Brown21 Annabelle Street New
Amit Balakrishnan JuniorBlk 999, Hougang Str 99 #01-221 Singapore 221999Amit_Bala@hotmail.sg91234567
Farhan Bin MusaNo 24, Jalan Segamat;;+60 11 12345678
Kit NgBlk 2 Tingkat 7 unit 02 Kawasan Perumahan Jalan Bukit Jalil;;020 700 11111
Alexander Bartholomew Desdemona2 Kitten St

For name, i want to keep the first 2 letters and last 2 letters with the other letters to be replaced by *
eg. for Emily Brown my output should be Em*******wn, another example Kit Ng my output should be Ki**Ng
The middle part to be replaced will vary in length depending on the string in that cell.
i tried using the Pseudonymize Text in Prepare recipe but the whole text is replaced by random characters. This is not the output i want.
The same will apply for address and contact number, where i want to retain first 2 and last 2 letters/digit while everything else in the middle is replaced by *
For multiple email address eg;; my output should be F.***********om;fa*************om;
Would greatly appreciate if any kind expert is able to help me.
I thank you in advance for your attention and guidance

Operating system used: ios

Operating system used: ios

Best Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓


    Just to expand on the suggestions from @tgb417
    . You can indeed use formula with regex to achieve this

    To break down in case I did a concatenation of the first 2 characters, then using regex to replace all characters in the middle, and then also adding the last 2 characters based on the string length to get the final string.

    The regex is using \w but if you want to replace spaces you can adapt the regex in the replace.


    Screenshot 2022-08-31 at 22.05.56.png

    Hope that helps.

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    Answer ✓
    For multiple email addresses separated by semicolon (such as;; ) you could use a Python recipe with below code to split them and replace the middle of every email address by *:
    import dataikuimport pandas as pd, numpy as npfrom dataiku import pandasutils as pdu# Read recipe inputsDATAmask = dataiku.Dataset("DATAmask")df = DATAmask.get_dataframe()for i in range(0, df.count()[2]):o=""for s in[i, 'email'].split(";"):o+=";"+ s[0:2]+"*"*(len(s)-4)+s[-2:] if len(s) >2 else[i, 'email'] = o[1:]print(df)Maskdata_df = df# Write recipe outputsMaskdata = dataiku.Dataset("Maskdata")Maskdata.write_with_schema(Maskdata_df)

    The output of this will be F.**************my;fa*************om.


Setup Info
      Help me…