How to train a classification model on an imbalanced dataset

Solved!
oksanab
Level 1
How to train a classification model on an imbalanced dataset

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model  (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score  but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter? 

1 Solution
Alex_Combessie
Dataiker Alumni

Hi, 

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this,  you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

View solution in original post

11 Replies
Alex_Combessie
Dataiker Alumni

Hi, 

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this,  you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

oksanab
Level 1
Author

Alex, thank you

cwentz
Level 3

Hello @Alex_Combessie 

I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?

I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.

 

Thank you!! 

0 Kudos
tgb417

@oksanab,

Good question.

@Alex_Combessie ,

I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.

Does DSS use the class_weight parameter in scikit-learn to make this work?

If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook.  Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)

Thanks for any further insights you can share.

--Tom
0 Kudos
Samuel_R_
Dataiker

Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.

Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.

In the meantime, you can compute the class weights as:

unique_values = np.unique(train_y)
n_classes = unique_values.size
class_weight_dict = {
y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
for y in unique_values
}

 And 

clf.class_weight = class_weight_dict

before running the hyperparameter search and final train.

tgb417

@Samuel_R_ 

I'm wondering if this should in some way be added to the documentation.  I'm not sure that I was able to find a discussion on this point?

Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.

Imballance Class rebalance.jpg

 

--Tom
0 Kudos
nats12
Level 1

How can iomplement SMOTE? I have been trying but its impossible 

0 Kudos
cwentz
Level 3

@Samuel_R_ I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set. 

0 Kudos
Samuel_R_
Dataiker

Hi @cwentz @tgb417 the documentation for Dataiku 9.0 has been updated accordingly, thanks for suggesting it !

tgb417

@Samuel_R_ 

Thanks for the update on this?  Can you share a link to where this is now being covered?  

--Tom