Question about the behaviour of the "Split and Fold" prepare processor and NULL handling

Solved!
Neil_B
Level 3
Question about the behaviour of the "Split and Fold" prepare processor and NULL handling

Let's say I have a table containing the following data:

IDFruitOther Random Data
1Apple, Pear, Cherryaksdhkajshda
2NULLkasdhjkasjhkas
3Watermelonajshdgjashgdjashg

 

If i run the Split and Fold prepare step on the Fruit column, i will get the following result:

IDFruitOther Random Data
1Appleaksdhkajshda
1Pearaksdhkajshda
1Cherryaksdhkajshda
3Watermelonajshdgjashgdjashg

 

Regardless of the setting of the 'keep empty chunks' checkbox, I am not able to retain the row with the null (empty) value.  I find that in order to do so I need to pre-emptively run other processors to fill empty cells before running a split and fold, and then empty those cells out again after the split and fold is complete, which is quite cumbersome.  Is the lack of the ability to preserve  the null/empty rows by design, or is the 'keep empty chunks' checkbox not performing as expected?

The documentation for this processor does not contain any reference to this checkbox and what it actually does (https://doc.dataiku.com/dss/latest/preparation/processors/split-fold.html?highlight=split%20fold)

Prior to Dataiku, i would use a left join to a cross-applied string split of the column in SQL, where the left join preserved the empty rows.

Community, if you use this processor yourself, may I ask how you are dealing with NULL values in the cells that you split and fold?  I am in search of a best practice for this that doesn't involve having to bookend each split and fold operation with two 'find and replace' processors every time.....

Dataiku folks, is there any possibility of adding NULL handling to this processor?  Additionally, would it be possible to detail the exact behaviour of the 'keep empty chunks' checkbox in the processor documentation?

Best regards,

Neil

0 Kudos
1 Solution
Emma
Dataiker

Hey @Neil_B , 

That is the expected behavior of "split and fold"; it will drop rows where the value to be split are empty.

Your workaround is our current recommended process, but I've put a feature request in on your behalf to be able to include NULL/empty values in the future. 

The "keep empty chunks" option is detailed in the "split column" recipe documentation, linking here for you: https://doc.dataiku.com/dss/latest/preparation/processors/split.html

Hope that helps, 

Emma 

View solution in original post

0 Kudos
4 Replies
Emma
Dataiker

Hey @Neil_B , 

That is the expected behavior of "split and fold"; it will drop rows where the value to be split are empty.

Your workaround is our current recommended process, but I've put a feature request in on your behalf to be able to include NULL/empty values in the future. 

The "keep empty chunks" option is detailed in the "split column" recipe documentation, linking here for you: https://doc.dataiku.com/dss/latest/preparation/processors/split.html

Hope that helps, 

Emma 

0 Kudos
Neil_B
Level 3
Author

Hi @Emma ,

Thanks very much for the detailed and informative response.  

I definitely appreciate this going in as a feature request.  It is actually quite a common scenario for me to want to retain the NULLs when doing this kind of data prep in my work, and this would be a very welcome enhancement.

With regard to the details of the 'split and fold' options being contained in the documentation for the separate 'split' processor, I can truthfully say that I have had need for 'split and fold' countless times in my day to day, and have never had cause to use 'split'.  As such, despite what may seem like an obvious thematic connection between the two processors, I would have never thought to look at the 'split' documentation to find information on the functionality of check-boxes for 'split and fold'.  If I may, I would like to make a suggestion that 'split and fold' have its documentation enhanced to contain these details, or alternately, if that is a redundancy, that the 'split and fold' documentation might encourage readers to look to the 'split' documentation for additional details on checkbox functionality.

Best regards,

Neil

0 Kudos
Emma
Dataiker

FYI @Neil_B, we took your feedback, and the documentation will be updated for Split and Fold when we release Dataiku 12! 

shahas71
Level 2

hi @Neil_B & @Emma i'm looking for the same solution. The null cell is dropped after splitting. What is the recommended step?

thanks in advance

0 Kudos

Labels

?
Labels (1)
A banner prompting to get Dataiku