How to train a classification model on an imbalanced dataset

oksanab · ‎04-24-2020

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter?

Alex_Combessie · ‎04-25-2020

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

View solution in original post

Alex_Combessie · ‎04-25-2020

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

oksanab · ‎04-27-2020

Alex, thank you

cwentz · ‎12-07-2020

Hello @Alex_Combessie

I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?

I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.

Thank you!!

tgb417 · ‎12-08-2020

@oksanab,

Good question.

@Alex_Combessie ,

I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.

Does DSS use the class_weight parameter in scikit-learn to make this work?

If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)

Thanks for any further insights you can share.

--Tom

Samuel_R_ · ‎12-10-2020

Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.

Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.

In the meantime, you can compute the class weights as:

unique_values = np.unique(train_y)
n_classes = unique_values.size
class_weight_dict = {
    y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
    for y in unique_values
}

And

clf.class_weight = class_weight_dict

before running the hyperparameter search and final train.

tgb417 · ‎12-10-2020

@Samuel_R_

I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?

Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.

--Tom

nats12 · ‎07-15-2021

How can iomplement SMOTE? I have been trying but its impossible

cwentz · ‎12-10-2020

@Samuel_R_ I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set.

Samuel_R_ · ‎04-15-2021

Hi @cwentz @tgb417 the documentation for Dataiku 9.0 has been updated accordingly, thanks for suggesting it !

tgb417 · ‎04-15-2021

@Samuel_R_

Thanks for the update on this? Can you share a link to where this is now being covered?

--Tom

Samuel_R_ · ‎04-15-2021

Sure, it's in two notes in https://doc.dataiku.com/dss/latest/machine-learning/supervised/settings.html in subsampling and weighting strategy.

Sign up to take part

How to train a classification model on an imbalanced dataset

How to train a classification model on an imbalanced dataset