Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

HOW TO VALIDATE EMAIL ADDRESS IN A COLUMN OF A FILE

Aminmin
Level 3
HOW TO VALIDATE EMAIL ADDRESS IN A COLUMN OF A FILE

Hi Dataiku Community, i have a column in an excel file comprising of email addresses.

I would appreciate any advise on how to validate the email address to ensure it is correct, for example by ensuring that the '@' and '.' sign is at the correct place separating the name, domain name and domain.

How can i create a simple flag or pattern or use a recipe in a flow? I saw some post on using Python or Plug-In but have no brain capacity to understand those jargons 😅

Thank you in advance for your time and kind advise.

Regards

Aminmin


Operating system used: IOS

0 Kudos
7 Replies
AlexT
Dataiker
Dataiker

Hi,

You can use a DSS prepare recipe with the processor: https://doc.dataiku.com/dss/latest/preparation/processors/flag-on-meaning.html

Set the meaning to email for your column and you be to flag valid/invalid email addresses.

Thanks,

0 Kudos
Aminmin
Level 3
Author

Hi AlexT, i have tried your suggestion. However, dataiku seemed to validate e-mail as correct even if i made entries such as @.sc.com or @sccom. Pls see row 2 and 4.

Can you please advise what Dataiku checks when the meaning is E-mail? What else do you suggest i can try?

Also i have tried @Jurre 's suggestion to split multiple email addresses based on ;

Thank you for your attention. I look forward to your advise.

 

Regards

Aminmin

0 Kudos
Aminmin
Level 3
Author

Hi AlexT, thank you for your reply.

I will try it out. 

Btw, will this work if there is multiple email addresses in a cell and separated by a semi-colon.

Thanks and have a great weekend!!

 

Regards

Aminmin

0 Kudos
Jurre
Neuron
Neuron

Hi @Aminmin ,

It might be a good idea to filter out those records with multiple emailadresses, for example by splitting the dataset on occurence of a semicolon in that emailadress column. Then in the resulting multi-emaildataset split the column containing multiple emailadresses on that semicolon to get individual recognisable adresses. Just a thought, best wishes for the weekend all!

0 Kudos
Aminmin
Level 3
Author

Hi Jurre, thank you for your suggestion. I will sure to try it out.

Regards

Aminmin

Jurre
Neuron
Neuron

Another option might be to :

  • check if the separation of emailadresses is always that semicolon (with a formula processor within the prepare recipe)
  • split and fold the column with emailadresses to get a single column with (hopefully) single values, which then can be processed further as @AlexT  suggested.

Just a suggestion, i'm sure you will find alternatives or variations which better suit your challenge. Would it be possible to share the one which worked best for you ? Thanx!

 

Turribeach
Level 6

So your question has two questions in one really. With regards to separating multiple email addresses in separate columns you should post another question as it is a complete different issue. With regards to email validation it clearly seems that Dataiku's email meaning is not clever enough to detect incorrect email addresses like @.sc.com or @sccom. Validating email addresses can be a very complex task depending on the level of validation that you want to achieve. For instance do you want to validate the email domain exists? Does the user account exists? The py3-validate-email Python package is one of the most complete email validators out there supporting many levels of validation. But since you are unwilling to get your hands dirty to code a solution in Python your best you could achieve really is to use a regular expression:

https://stackoverflow.com/questions/8022530/how-to-check-for-valid-email-address