Split Recipe
Hi all,
How can I edit the Split recipe so I can add the validation set to be splitted, I couldn't find a third condition that can take random percentage after filtering test and train datasets.
and if this can be done using DSS formulas, what is the syntax of the formula to split data for validation?
My test set and train set are splitted based on a certain filter
Operating system used: Windos
Answers

LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 53 Neuron
Hi @Arwamus0
,This might be easier to two with two different split recipes, one for you train/test filter and one for the random selection.
If you really want to do this in one recipe you can try first creating the filter for train/test, them using rand() and splitting <= 0.8 for example.

Thanks @LouisDHulst
I have seen a tutorial on dataiku website about this and it was showing one split for this so I wanted to try the same, how can I write the formula correctly with rand and split? I think the functions need integer ranges

LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 53 Neuron
Just using rand() <= 0.8 as your formula should be enough to give you a rough 8020 split on your data. If your dataset is big enough the split will be very close to 8020.
As an example I created a CSV with 10k rows whose values range between 1 and 1000. I then split that CSV into 3 datasets (train, val, test) and added two filters to my split recipe:
All of the other rows go to the validation set.
The row counts for the outputs are:
 Test: 1,891 (~18% of 10,000)
 Train: 6,499 (~80% of the remaining 8,109 rows)
 Val: 1,609 (~20% of the remaining 8,109 rows)
Is this what you were looking for?

I want to achieve the same result from the flow attached, when I try rand() it gives empty validation set.
I need 70% training data
15% testing data
15% validation
the testing and training are splitted based on filter (training = when L1 column is defined, and the opposite for testing).
I also wanted to ask how to find the score of the scored model (using score recipe)? Do I have to add the evaluation recipe?