The Dataiku Frontrunner Awards have launched to recognize your achievements! SUBMIT YOUR ENTRY

Split multiple columns based on comma-separated values in cells

Vestlink
Level 1
Level 1
Split multiple columns based on comma-separated values in cells

Hi.

Is it possible to split multiple columns in a prepare recipe where the columns have the same prefix and where the content in a column is comma-separated without having to make multiple steps?

Example:

prefix_01   prefix_02

1,2,3          2,3,4

 

etc.

 

Best regards,

 

Nicolai

0 Kudos
6 Replies
tgb417
Neuron
Neuron

@Vestlink ,

I'm not clear that I understand exactly what you are trying to do.

In the examples you have listed it appears that you are trying to split on commas within a column and you want to make those results into an initially unknown number of additional columns. There are several ways to do this kind of thing with DSS.  You could look at the split or extract recipes steps.

You then want to be able to repeat those steps across multiple columns that share a column prefix.  (My guess is that you eventually plan to have a lot more than 2 columns or the names of the columns are not known while you are creating the recipe.)

The challenge you may be having is that the split and extract recipes steps only work on a single column at a time.  And you either don't know the names of the incoming columns or there are a large enough number of columns that hand-selecting each column for spiting would be challenging.  Is this a correct intuition on my part?

Do you know how to program in Python?  If so, it may be possible to do what you want to do in a python visual recipe step.

The code show here does not solve your problem .  It is the example code.The code show here does not solve your problem . It is the example code.

Although, I've created a few python recipe steps to do some wonderful things.  It is still a bit of a mysterious art to me how to actually create the kind of results you are describing.  (Although, I believe it might be doable.)

You can find the documentation for Python Function steps here.

Here is an extended thread on Python Visual recipe steps.

Here is a knowledge base article on the topic.

Here is an example that is similar to your problem set.

Hope this helps a little bit.  Can you clarify for others if the description above is what you are trying to do. Then others may be able to jump in here and give a bit more clarity.  Please also let us know how you get on with your challenge.  Most of us here enjoying solving puzzles like this.

--Tom
0 Kudos
Vestlink
Level 1
Level 1
Author

Hi.

Excellent!

I'll look into your suggestions and do appreciate the challenges you have outlined.

BR,

Nicolai

0 Kudos
Vestlink
Level 1
Level 1
Author

I got hold on an example of the data in mind.

 

Screenshot_6.jpg

If you look at this example.

For each column "LID_split_" the column should be split based on the comma - separated values in the cell.

 

Sorry for not providing the example earlier...

0 Kudos
tgb417
Neuron
Neuron

@Vestlink ,

Hmmmm..... I'm looking at the data you are showing here.  It almost looks like, what you are showing is an attempt upstream of this data to flatten a hierarchical maybe XML data source.

I'm not an expert in this.  However, I'm wondering if going back to the original source with something like xpath might get you a cleaner data parsing.  Here is some documentation about Dataiku's XML implementation.  

Others please feel free to jump in.

--Tom
0 Kudos
Vestlink
Level 1
Level 1
Author

Hi Tom.

Tnx for you valuable input.

 

I'll look into the links you've send.

The data source is a DB text dump :o, but would have been easier to handle if it indeed had been xml.

Tnx again

 

Nicolai

0 Kudos
Jurre
Level 3

Hi @Vestlink  (and @tgb417 !),

Possibly a bit too simplistic thinking but .. as you state the original file is a DB-text dump. If that's a simple, delimited values file the column descriptions will be on top right ? why not work on the column-specs provided there instead of the data itself ?

a rough idea what i have in mind here: 

  • fill empty slots with something structurally similar to the rest of the data to get a fully rectangular dataset, then export as csv
  • add columns in the row with columnspecs as needed. 
  • import again

Pretty sure this could be done with some custom coding directly on your DB-dump, but i'm not yet qualified to start sharing code-examples.. 

 

0 Kudos
A banner prompting to get Dataiku DSS
Public