Announcing the winners & finalists of the Dataiku Frontrunner Awards 2021! Read their inspiring stories

Classification error : Ended up with only one class in the test set. Cannot proceed

Solved!
taloh90
Level 2
Classification error : Ended up with only one class in the test set. Cannot proceed

Hello,

I encountered an error when I started training my classification models. The error is: Ended up with only one class in the test set. Cannot proceed

I have an unbalanced dataset and I would like to proceed with a classification to do comparative analysis. I have already split my dataset manually 70% for training and 30% for testing. In the design of my model, I chose them by putting a policy : Explicit extracs from two datasets.

If I understand correctly, it's a cross-validation error in the hyperparameters. I have to make a custom code to solve my problem like making Leave One Out? Or there is another solution? Because I have a very basic knowledge of python, I confess.

0 Kudos
1 Solution
AlexandreL
Dataiker
Dataiker

Hi,

The issue here seems to be that your dataset is so imbalanced that one of the folds contains only observations from one class.  Could you please tell me what's your positive class proportion ? Before going to python code you can try the following options:

- shuffle your dataset to make sure your positive observations are evenly distributed in the dataset (you can use a sort recipe, with a pre-computed random column, the formula is just rand(), then you can sort by this column)

- use less folds for your cross validations

Hoping this could solve your issue,

Alex

View solution in original post

2 Replies
AlexandreL
Dataiker
Dataiker

Hi,

The issue here seems to be that your dataset is so imbalanced that one of the folds contains only observations from one class.  Could you please tell me what's your positive class proportion ? Before going to python code you can try the following options:

- shuffle your dataset to make sure your positive observations are evenly distributed in the dataset (you can use a sort recipe, with a pre-computed random column, the formula is just rand(), then you can sort by this column)

- use less folds for your cross validations

Hoping this could solve your issue,

Alex

View solution in original post

taloh90
Level 2
Author

Hi,

Thanks for your suggested options, I did use less fold for my cross validation as you suggested and it works.

In the training part of my dataset, the proportion of my positive class is 13 while for the negative class is 4. While for the test part, the proportion is 6 for the positive and 1 for the negative.

So what I did in the design part of my classification model was to choose only the training part of my dataset and use a fold of 2. And I deployed the model with the best score in my flow, and then I ran a classification of my test dataset

0 Kudos
Labels (2)
A banner prompting to get Dataiku DSS