Split multiple columns based on comma-separated values in cells

Options
Vestlink
Vestlink Registered Posts: 4 ✭✭✭

Hi.

Is it possible to split multiple columns in a prepare recipe where the columns have the same prefix and where the content in a column is comma-separated without having to make multiple steps?

Example:

prefix_01 prefix_02

1,2,3 2,3,4

etc.

Best regards,

Nicolai

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Vestlink
    ,

    I'm not clear that I understand exactly what you are trying to do.

    In the examples you have listed it appears that you are trying to split on commas within a column and you want to make those results into an initially unknown number of additional columns. There are several ways to do this kind of thing with DSS. You could look at the split or extract recipes steps.

    You then want to be able to repeat those steps across multiple columns that share a column prefix. (My guess is that you eventually plan to have a lot more than 2 columns or the names of the columns are not known while you are creating the recipe.)

    The challenge you may be having is that the split and extract recipes steps only work on a single column at a time. And you either don't know the names of the incoming columns or there are a large enough number of columns that hand-selecting each column for spiting would be challenging. Is this a correct intuition on my part?

    Do you know how to program in Python? If so, it may be possible to do what you want to do in a python visual recipe step.

    Python Code Recipie Step.jpg

    Although, I've created a few python recipe steps to do some wonderful things. It is still a bit of a mysterious art to me how to actually create the kind of results you are describing. (Although, I believe it might be doable.)

    You can find the documentation for Python Function steps here.

    Here is an extended thread on Python Visual recipe steps.

    Here is a knowledge base article on the topic.

    Here is an example that is similar to your problem set.

    Hope this helps a little bit. Can you clarify for others if the description above is what you are trying to do. Then others may be able to jump in here and give a bit more clarity. Please also let us know how you get on with your challenge. Most of us here enjoying solving puzzles like this.

  • Vestlink
    Vestlink Registered Posts: 4 ✭✭✭
    Options

    Hi.

    Excellent!

    I'll look into your suggestions and do appreciate the challenges you have outlined.

    BR,

    Nicolai

  • Vestlink
    Vestlink Registered Posts: 4 ✭✭✭
    Options

    I got hold on an example of the data in mind.

    Screenshot_6.jpg

    If you look at this example.

    For each column "LID_split_" the column should be split based on the comma - separated values in the cell.

    Sorry for not providing the example earlier...

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Vestlink
    ,

    Hmmmm..... I'm looking at the data you are showing here. It almost looks like, what you are showing is an attempt upstream of this data to flatten a hierarchical maybe XML data source.

    I'm not an expert in this. However, I'm wondering if going back to the original source with something like xpath might get you a cleaner data parsing. Here is some documentation about Dataiku's XML implementation.

    Others please feel free to jump in.

  • Vestlink
    Vestlink Registered Posts: 4 ✭✭✭
    Options

    Hi Tom.

    Tnx for you valuable input.

    I'll look into the links you've send.

    The data source is a DB text dump :o, but would have been easier to handle if it indeed had been xml.

    Tnx again

    Nicolai

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi @Vestlink
    (and @tgb417
    !),

    Possibly a bit too simplistic thinking but .. as you state the original file is a DB-text dump. If that's a simple, delimited values file the column descriptions will be on top right ? why not work on the column-specs provided there instead of the data itself ?

    a rough idea what i have in mind here:

    • fill empty slots with something structurally similar to the rest of the data to get a fully rectangular dataset, then export as csv
    • add columns in the row with columnspecs as needed.
    • import again

    Pretty sure this could be done with some custom coding directly on your DB-dump, but i'm not yet qualified to start sharing code-examples..

Setup Info
    Tags
      Help me…