Split Recipe

Arwamus0
Level 2
Split Recipe

Hi all,

 

How can I edit the Split recipe so I can add the validation set to be splitted, I couldn't find a third condition that can take random percentage after filtering test and train datasets.

and if this can be done using DSS formulas, what is the syntax of the formula to split data for validation?

My test set and train set are splitted based on a certain filter


Operating system used: Windos

0 Kudos
4 Replies
LouisDHulst

Hi @Arwamus0 , 

This might be easier to two with two different split recipes, one for you train/test filter and one for the random selection.

If you really want to do this in one recipe you can try first creating the filter for train/test, them using rand() and splitting <= 0.8 for example. 

0 Kudos
Arwamus0
Level 2
Author

Thanks @LouisDHulst 

I have seen a tutorial on dataiku website about this and it was showing one split for this so I wanted to try the same, how can I write the formula correctly with rand and split? I think the functions need integer ranges 

0 Kudos
LouisDHulst

Just using rand() <= 0.8 as your formula should be enough to give you a rough 80-20 split on your data. If your dataset is big enough the split will be very close to 80-20.

 

As an example I created a CSV with 10k rows whose values range between 1 and 1000. I then split that CSV into 3 datasets (train, val, test) and added two filters to my split recipe:

    1.  

      image.png

    2.  

      image.png

 

 

 

 

All of the other rows go to the validation set.

The row counts for the outputs are:

  • Test: 1,891 (~18% of 10,000)
  • Train: 6,499 (~80% of the remaining 8,109 rows)
  • Val: 1,609 (~20% of the remaining 8,109 rows)

Is this what you were looking for?

0 Kudos
Arwamus0
Level 2
Author

I want to achieve the same result from the flow attached, when I try rand() it gives empty validation set.

I need 70% training data

15% testing data

15% validation

the testing and training are splitted based on filter (training = when L1 column is defined, and the opposite for testing).

I also wanted to ask how to find the score of the scored model (using score recipe)? Do I have to add the evaluation recipe?

0 Kudos