Question about the behaviour of the "Split and Fold" prepare processor and NULL handling

Neil_B
Neil_B Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭✭

Let's say I have a table containing the following data:

IDFruitOther Random Data
1Apple, Pear, Cherryaksdhkajshda
2NULLkasdhjkasjhkas
3Watermelonajshdgjashgdjashg

If i run the Split and Fold prepare step on the Fruit column, i will get the following result:

IDFruitOther Random Data
1Appleaksdhkajshda
1Pearaksdhkajshda
1Cherryaksdhkajshda
3Watermelonajshdgjashgdjashg

Regardless of the setting of the 'keep empty chunks' checkbox, I am not able to retain the row with the null (empty) value. I find that in order to do so I need to pre-emptively run other processors to fill empty cells before running a split and fold, and then empty those cells out again after the split and fold is complete, which is quite cumbersome. Is the lack of the ability to preserve the null/empty rows by design, or is the 'keep empty chunks' checkbox not performing as expected?

The documentation for this processor does not contain any reference to this checkbox and what it actually does (https://doc.dataiku.com/dss/latest/preparation/processors/split-fold.html?highlight=split%20fold)

Prior to Dataiku, i would use a left join to a cross-applied string split of the column in SQL, where the left join preserved the empty rows.

Community, if you use this processor yourself, may I ask how you are dealing with NULL values in the cells that you split and fold? I am in search of a best practice for this that doesn't involve having to bookend each split and fold operation with two 'find and replace' processors every time.....

Dataiku folks, is there any possibility of adding NULL handling to this processor? Additionally, would it be possible to detail the exact behaviour of the 'keep empty chunks' checkbox in the processor documentation?

Best regards,

Neil

Best Answer

  • Emma
    Emma Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
    Answer ✓

    Hey @Neil_B
    ,

    That is the expected behavior of "split and fold"; it will drop rows where the value to be split are empty.

    Your workaround is our current recommended process, but I've put a feature request in on your behalf to be able to include NULL/empty values in the future.

    The "keep empty chunks" option is detailed in the "split column" recipe documentation, linking here for you: https://doc.dataiku.com/dss/latest/preparation/processors/split.html

    Hope that helps,

    Emma

Answers

  • Neil_B
    Neil_B Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭✭

    Hi @Emma
    ,

    Thanks very much for the detailed and informative response.

    I definitely appreciate this going in as a feature request. It is actually quite a common scenario for me to want to retain the NULLs when doing this kind of data prep in my work, and this would be a very welcome enhancement.

    With regard to the details of the 'split and fold' options being contained in the documentation for the separate 'split' processor, I can truthfully say that I have had need for 'split and fold' countless times in my day to day, and have never had cause to use 'split'. As such, despite what may seem like an obvious thematic connection between the two processors, I would have never thought to look at the 'split' documentation to find information on the functionality of check-boxes for 'split and fold'. If I may, I would like to make a suggestion that 'split and fold' have its documentation enhanced to contain these details, or alternately, if that is a redundancy, that the 'split and fold' documentation might encourage readers to look to the 'split' documentation for additional details on checkbox functionality.

    Best regards,

    Neil

  • Emma
    Emma Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker

    FYI @Neil_B
    , we took your feedback, and the documentation will be updated for Split and Fold when we release Dataiku 12!

  • shahas71
    shahas71 Dataiku DSS Core Designer, Registered Posts: 12

    hi @Neil_B
    & @Emma
    i'm looking for the same solution. The null cell is dropped after splitting. What is the recommended step?

    thanks in advance

Setup Info
    Tags
      Help me…