Split Recipe

Options
Arwamus0
Arwamus0 Dataiku DSS Core Designer, Registered Posts: 13

Hi all,

How can I edit the Split recipe so I can add the validation set to be splitted, I couldn't find a third condition that can take random percentage after filtering test and train datasets.

and if this can be done using DSS formulas, what is the syntax of the formula to split data for validation?

My test set and train set are splitted based on a certain filter


Operating system used: Windos

Answers

  • LouisDHulst
    LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 44 Neuron
    Options

    Hi @Arwamus0
    ,

    This might be easier to two with two different split recipes, one for you train/test filter and one for the random selection.

    If you really want to do this in one recipe you can try first creating the filter for train/test, them using rand() and splitting <= 0.8 for example.

  • Arwamus0
    Arwamus0 Dataiku DSS Core Designer, Registered Posts: 13
    Options

    Thanks @LouisDHulst

    I have seen a tutorial on dataiku website about this and it was showing one split for this so I wanted to try the same, how can I write the formula correctly with rand and split? I think the functions need integer ranges

  • LouisDHulst
    LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 44 Neuron
    Options

    Just using rand() <= 0.8 as your formula should be enough to give you a rough 80-20 split on your data. If your dataset is big enough the split will be very close to 80-20.

    As an example I created a CSV with 10k rows whose values range between 1 and 1000. I then split that CSV into 3 datasets (train, val, test) and added two filters to my split recipe:

      1. image.png

      2. image.png

    All of the other rows go to the validation set.

    The row counts for the outputs are:

    • Test: 1,891 (~18% of 10,000)
    • Train: 6,499 (~80% of the remaining 8,109 rows)
    • Val: 1,609 (~20% of the remaining 8,109 rows)

    Is this what you were looking for?

  • Arwamus0
    Arwamus0 Dataiku DSS Core Designer, Registered Posts: 13
    Options

    I want to achieve the same result from the flow attached, when I try rand() it gives empty validation set.

    I need 70% training data

    15% testing data

    15% validation

    the testing and training are splitted based on filter (training = when L1 column is defined, and the opposite for testing).

    I also wanted to ask how to find the score of the scored model (using score recipe)? Do I have to add the evaluation recipe?

Setup Info
    Tags
      Help me…