Question about the behaviour of the "Split and Fold" prepare processor and NULL handling
Let's say I have a table containing the following data:
ID | Fruit | Other Random Data |
1 | Apple, Pear, Cherry | aksdhkajshda |
2 | NULL | kasdhjkasjhkas |
3 | Watermelon | ajshdgjashgdjashg |
If i run the Split and Fold prepare step on the Fruit column, i will get the following result:
ID | Fruit | Other Random Data |
1 | Apple | aksdhkajshda |
1 | Pear | aksdhkajshda |
1 | Cherry | aksdhkajshda |
3 | Watermelon | ajshdgjashgdjashg |
Regardless of the setting of the 'keep empty chunks' checkbox, I am not able to retain the row with the null (empty) value. I find that in order to do so I need to pre-emptively run other processors to fill empty cells before running a split and fold, and then empty those cells out again after the split and fold is complete, which is quite cumbersome. Is the lack of the ability to preserve the null/empty rows by design, or is the 'keep empty chunks' checkbox not performing as expected?
The documentation for this processor does not contain any reference to this checkbox and what it actually does (https://doc.dataiku.com/dss/latest/preparation/processors/split-fold.html?highlight=split%20fold)
Prior to Dataiku, i would use a left join to a cross-applied string split of the column in SQL, where the left join preserved the empty rows.
Community, if you use this processor yourself, may I ask how you are dealing with NULL values in the cells that you split and fold? I am in search of a best practice for this that doesn't involve having to bookend each split and fold operation with two 'find and replace' processors every time.....
Dataiku folks, is there any possibility of adding NULL handling to this processor? Additionally, would it be possible to detail the exact behaviour of the 'keep empty chunks' checkbox in the processor documentation?
Best regards,
Neil
Best Answer
-
Emma Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 52 Dataiker
Hey @Neil_B
,That is the expected behavior of "split and fold"; it will drop rows where the value to be split are empty.
Your workaround is our current recommended process, but I've put a feature request in on your behalf to be able to include NULL/empty values in the future.
The "keep empty chunks" option is detailed in the "split column" recipe documentation, linking here for you: https://doc.dataiku.com/dss/latest/preparation/processors/split.html
Hope that helps,
Emma
Answers
-
Neil_B Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 9 ✭✭✭✭
Hi @Emma
,Thanks very much for the detailed and informative response.
I definitely appreciate this going in as a feature request. It is actually quite a common scenario for me to want to retain the NULLs when doing this kind of data prep in my work, and this would be a very welcome enhancement.
With regard to the details of the 'split and fold' options being contained in the documentation for the separate 'split' processor, I can truthfully say that I have had need for 'split and fold' countless times in my day to day, and have never had cause to use 'split'. As such, despite what may seem like an obvious thematic connection between the two processors, I would have never thought to look at the 'split' documentation to find information on the functionality of check-boxes for 'split and fold'. If I may, I would like to make a suggestion that 'split and fold' have its documentation enhanced to contain these details, or alternately, if that is a redundancy, that the 'split and fold' documentation might encourage readers to look to the 'split' documentation for additional details on checkbox functionality.
Best regards,
Neil