My goal is to filter out observations that contain at least one missing value, in any of the columns. I noticed that the processor Remove rows where cell is empty, with its option all, comes in handy as it says: The row is considered as matching if at least one of the selected columns is empty.
However, if in the same Prepare recipe I insert other processors as well, the final result changes according to the order in which I put the processors.
I try to insert a small reproducible example below, with an initial toy dataset (txt attached) and some images to explain the weird results. I'm using Dataiku DSS 7.0.
My dataset at first looks like this:
If I create a column `newcol` that is basically the sum between `col2` and `col3` and then use the processor in question, the result after running the Prepare recipe is as expected, i.e. the final dataset will contain only the first two rows as they do not have any missing value.
Instead, if I switch the order of the two processors, i.e. I put first the Remove rows where cell is empty and then the Formula processor to compute `newcol`, I would expect the same final result. In the preview inside the Prepare recipe I indeed obtain the expected output
but when I save and run the Prepare recipe, in the output dataset I obtain zero rows, even if in the preview two rows were shown:
Is this behaviour the expected one? If yes, then I'm missing the rationale of why this would happen. Otherwise it may be a bug in the processor.
Sorry for the long example and thank you for your help.
Yes, I can reproduce your issue on v8.0.0. If you select the original four columns in the first processor, rather than All, the Prepare works OK.
So I would hazard as guess that this is a conflation problem where the new column is being created in advance of the filter executing, but since the formula hasn't run, the new cells are empty and the filter removes all rows.
It's certainly counter-intuitive and the preview really should behave in the same way. I'll raise it as a bug for investigation.
In the meantime, not using All columns in the Filter would be a workaround, if viable in your actual use-case
thank you for the possible explanation and for raising the issue as a bug.
In my use case I could use your suggestion as I have around 30 variables (it is tedious but certainly feasible in a couple of minutes). Indeed it works if I input all of my columns inside the "multiple" option.
As a sidenote, I tried to insert the regex pattern .* inside the "pattern" option (basically it takes every column name), but the obtained result still has zero rows.