How to parse improperly concatenated data set with missing nulls and delimiters into correct columns

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
I need to parse a dataset with incorrectly concatenated fields of vendor data crammed into one "trouble" cell per row.
I have cleaned up the prior columns to the left except for the column with these "trouble" cells.
All of the data within the concatenated cell technically is correctly ordered, and the groupings of columns the data belong in do have the same order each time.
However (the BIG however), the concatenated data within the "trouble" cell is semicolon delimited, but was never given delimiters for the null fields.

The end result is any missing data in the semicolon-delimited cell causes shifted data into later fields for the same record when it is parsed out.

For example, here is an eight column contact detail per vendor:

vendor_name_1, vendor_title_1, vendor_address_1, vendor_phone_1, vendor_fax_1, vendor_contact_mobile_1, vendor_contact_email_1

When there is more than one vendor for that product, there will need to be another repeat of the columns in the same order to hold the data:

vendor_name_2, vendor_title_2, vendor_address_2, vendor_phone_2, vendor_fax_2, vendor_contact_mobile_2, vendor_contact_email_2

The sets of columns will need to repeat as long as there are more vendors for that product to be parsed out on that row until the trouble cell is fully read out.

A "good" row will have all of the available data in the correct column for that vendor set:

Motion Distributors; 3231 Apex Drive; Dulles, Ohio 45321; (321) 542-6422(p); (321) 542-6428(f); (321) 542-6680(m); alan@motiondist.com; etc. etc.

A "bad" row will have (one or more) missing items for at least one vendor within the cell on that row.

This effects everything to the right of that missing data (where the null should have been) so all of the data becomes "shifted".

My problem is getting the data within the trouble cell to parse back out into the correct columns.

Until I have a way to parse out the data into the correct columns again, I do not know how many sets of data are actually within the trouble cell.
I already have regex descriptions for what should be in each column.

Since the trouble cell crammed full of data can have several sets of data, and the concatenated data is technically in the correct order, how do I direct Dataiku to continue parsing the trouble cell (and expanding where the nulls should be) into the correct columns until the cell is completely read?

Does Dataiku require adding enough columns to hold the parsed data first?
Can I somehow encapsulate my regex statement array into a module to pass the trouble concatenated data cell through?
My regex is ready to go in an array to process each set of data encountered within the trouble cell.

name: \^.+$\,
title: \^.+$\,
street: \^d+\s\,
city: \^.*,\s\d{5}$\,
phone: \^\(\d{3}\)\s\d{3}\-\d{4}\(p\)$\,
fax: \^\(\d{3}\)\s\d{3}\-\d{4}\(f\)$\,
mobile: \^\(\d{3}\)\s\d{3}\-\d{4}\(m\)$\,
email: \^\w+\@\w+.\w+$\

I am not clear how to get Dataiku to treat my regex as a set to parse through the trouble cell and isolate what should go into each column.

Dataiku looks amazing - - I hope this is possible - - any help would be greatly appreciated...

Answers

  • cperdigou
    cperdigou Alpha Tester, Dataiker Alumni Posts: 115 ✭✭✭✭✭✭✭
    You can probably do that in python, using a list of regexes I would:
    Check that the first item matches the first regex:
    if yes: add it to a list+go to next item+use next regex
    if not: add null to a list+stay on same item+use next regex

    And iterate, cycling through the regex list, and stopping when last item is found (fill the rest of the list with nulls until len(list)%8=0).
    Would that work?
Setup Info
    Tags
      Help me…