The India User Group is live! Be a part of our first Indian user event: JOIN THE EVENT

Questions regarding Cross-Test and Training/Model Evaluation

rmios
Level 3
Questions regarding Cross-Test and Training/Model Evaluation

Hello,

I want to train a model using Leave-One-Group-Out Cross Validation (see, e.g. the sklearn method). Unfortunately, it seems that this is not possible in a simple way with DSS. For setting up the train/test sets, I would choose a column that contains the keys for splitting and choose single values. However, the only way to split is "Random":

Split: Randomly / For more advanced splitting, use a split recipe, and then use "Explicit extracts from two datasets" policy


Screenshot 2021-01-25 103529.png

 


What I really want is to leave one of the 8 possible values in the column out for each cross-test. From the comment I understand the way to achieve this should be to manually create the splits in different datasets. But how does that work? If I create 16 datasets (each combination of train/test split), I still can only chose one combination for the training with "explicit extracts from two datasets". How can I achieve K-fold cross-test with custom split values with DSS?

The second question is: How can I verify the training results (even with random split). How can I see the individual metrics? How can I see which rows were used for training and test respectively? In the metrics view there is a message saying: This model was evaluated using K-fold cross-test. These results are for the first fold only. Where are the detailed results for each fold?

Thank you!

0 Kudos
7 Replies
AlexandreV
Dataiker
Dataiker

Hello,
The custom K-fold option is not available for train/test split. (more info here on the train/test setting)
However, you create a custom cross validation strategy in the “hyperparameter” panel, and use the code sample for “leave one label out”.

 

Screenshot 2021-01-25 at 16.04.59.png


If you need to define a custom train/test split, you can export one of the visual Machine Learning models as a Jupyter notebook, and use this as a base. (Action > Export to Jupyter notebook)

 

How can I verify the training results (even with random split). How can I see the individual metrics? How can I see which rows were used for training and test respectively?

You can see the rows used for testing under the predicted data tab

Screenshot 2021-01-25 at 16.07.45.png

 

In the metrics view there is a message saying: This model was evaluated using K-fold cross-test. These results are for the first fold only. Where are the detailed results for each fold?

The results are for the first fold only and we do not display the results for other folds.

 

Alex

rmios
Level 3
Author

Thank you for your answer!

@AlexandreV wrote:

The custom K-fold option is not available for train/test split. (more info here on the train/test setting)

Thank you for clarifying this. Is this a planned feature? If not I would like propose it as a new feature because with the alternative (to use a Jupyter notebook) one loses most benefits of the Visual Analyses like comparing different models and doing a grid search. Quote from the exported Jupyter notebook:

That's it. It's now up to you to tune your preprocessing, your algo, and your analysis !

In my understanding it is a combination of two train/test split methods: K-fold and explicit extracts from the dataset with filter. The latter would be:

Screenshot 2021-01-25 174804.png

 
The value for != and == in the conditions would vary over the folds. Let me know if that makes any sense in your opinion.

@AlexandreV wrote:

You can see the rows used for testing under the predicted data tab

Thanks! Is there any way to see which rows were used in the train set and which rows were used in the test set?

EDIT: I found in a different analysis that the predicted data tab contains the results for the test set (train set not included) except if the training did not use a test set (then it show predictions for the full train set). This leads to an issue: When using random K-fold train/test split with e.g. 8 folds, it seems that with a large number for "Nb. records" (e.g. 100000) and a small dataset (e.g. 5000 rows), the split will be 100% train and 0% test. This is counter-intuitive because I would expect to always get 8 folds with my settings (or at least an error if it is not possible). Also, the split settings say:

Column values subset (approx. nb. records)
Randomly selects a subset of values and chooses all rows with these values, in order to obtain approximately N rows. This is useful for selecting a subset of customers, for example. Requires 2 full passes.

So, I would expect to receive 8 folds, and that DSS would first check if there are 8 unique values available. If there are only 8 unique values, I would expect that I get 8 folds no matter the size (i.e. much smaller than 100000).

 

@AlexandreV wrote:

The results are for the first fold only and we do not display the results for other folds.

Is this a planned feature? I think it would be very interesting to see what the details are, for example to understand why one fold was worse/better than another fold. Especially if the folds are manually defined it becomes more and more relevant. It seems very intransparent to no being able to have full insight over the training (what folds were used) and the results (metrics for each fold).

Daniel

0 Kudos
AlexandreV
Dataiker
Dataiker

Hello Daniel,
Thank you for your interest about this feature. The feature request has been transmitted to the product team.

 


I found in a different analysis that the predicted data tab contains the results for the test set (train set not included) except if the training did not use a test set

I don't see how you can have an empty test set, when doing a 8-fold, you train 8 times on 7/8th of the dataset and the test data is the remaining 1/8th each time.

 

it seems that with a large number for "Nb. records" (e.g. 100000) and a small dataset (e.g. 5000 rows), the split will be 100% train and 0% test

The sampling will take a subset of you dataset (so on the next steps you'll use at most 10000 rows) and is not related to the train/test splitting that occurs later. The sampling will take the 5000 lines, the folding will take 7/8 of the 5000 lines for train and 1/8th for test.

 

Alex

0 Kudos
rmios
Level 3
Author

Hello Alex,

Thank you for your answer. I think I had a misunderstanding due to the training showing the complete dataset as "train set":

Screenshot 2021-01-26 152034.png

 
I understand now, that this is showing the whole set size instead of the 7/8 value as I expected from the K-fold settings.

Thanks for submitting the feature request. To finalize my throught process, let me once more state what I would need:

  1. Ability to chose custom values for folds (i.e. value filtered folds)
  2. Ability to do the same in Deep Learning (Keras) Analysis (K-fold Cross-Test with custom fold values) - currently Keras in DSS does not have K-fold at all (not even random)
  3. Ability to see which folds were created (it would mitigate the issue that I don't know which values were chosen for the random folds, currently I am blindly using some random values from the column as folds)
  4. It would be very beneficial to see all the visualized analyses for each fold (not only first fold)

Thanks again!

rmios
Level 3
Author

Did you delete my reply from yesterday evening? 😅

0 Kudos
AlexandreV
Dataiker
Dataiker

No @rmios, It disappeared for me as well.
Will ask about it.

 

Alex

0 Kudos
rmios
Level 3
Author

Thanks! Let me know if it is lost (maybe a DB crash), then I will write up a new reply.

0 Kudos
A banner prompting to get Dataiku DSS
Public