Select columns in Split and Filter recipe

Options
luong_bayard
luong_bayard Registered Posts: 3

Hi,

There are many times when I prepared the dataset in the desired format and then I need to split it to 2 or more datasets based on 1 column which is not needed in the last output.

I then have to add 1 more recipe for each splitted dataset in order to remove that column so that I can have the desired schema which is a big waste.

For the filter recipe, sometimes we just need a smaller version of a dataset, vertically and horizontally, but the filter recipe only allow us to filter on rows and not select column.

It will be nice to have the capacity to select desired column in Split and Filter recipes like what we have in TopN, join, etc.

0
0 votes

New · Last Updated

Comments

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,712 Neuron
    Options

    Totally understand where this comes from but you could imagine that if you start adding options to visual recipes you end up with a really complicated visual recipe. The trade off you pay for using visiaul recipes is that you will need to perform most actions in separate recipes. If you want to do all at once or reduce the amount of recipes you should use a code recipe like Python.

  • luong_bayard
    luong_bayard Registered Posts: 3
    Options

    Hi@Turribeach
    , thanks a lot for your reply.

    I understand what you say and that is why I select between a lot of improvements in my head and choose 2 that I find have been impacting a lot of my team's everyday work.

    We don't have Spark so using Python means using the internal memory of our machine which is quite costly and not every users in my company are comfortable with coding (which is why the visual recipes was a big selling point when my company switch to DSS).

    The option to remove the split on column is quite basic and I think it is really necessary in a lot of case scenario. And the danger of not having the capacity to select column is that it can cause a failed scenario when the schema of the input table changes and when we are not alone to use the same input table, the changes request by others shoudn't impact everyone just besause we bring all of the columns everytime.

Setup Info
    Tags
      Help me…