Community Conundrum 25: Feature Visualization is now live! Read More

How to train a classification model on an imbalanced dataset

Level 1
How to train a classification model on an imbalanced dataset

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model  (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score  but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter? 

2 Replies
Dataiker
Dataiker

Hi, 

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this,  you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

Level 1
Author

Alex, thank you