How to train a classification model on an imbalanced dataset
I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.
Is there an option for SMOTE supervised filter?
Best Answer
-
Hi,
Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.
In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-records.
However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.
Alternatively, you can implement SMOTE as a preliminary Python or R recipe.
Hope it helps,
Alex
Answers
-
Alex, thank you
-
Hello @Alex_Combessie
I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?
I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.
Thank you!!
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
@oksanab
,Good question.
I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.
Does DSS use the class_weight parameter in scikit-learn to make this work?
If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)
Thanks for any further insights you can share.
-
Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.
Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.
In the meantime, you can compute the class weights as:
unique_values = np.unique(train_y)
n_classes = unique_values.size
class_weight_dict = {
y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
for y in unique_values
}And
clf.class_weight = class_weight_dict
before running the hyperparameter search and final train.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?
Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.
-
@Samuel_R_
I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set. -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Thanks for the update on this? Can you share a link to where this is now being covered?
-
Sure, it's in two notes in https://doc.dataiku.com/dss/latest/machine-learning/supervised/settings.html in subsampling and weighting strategy.
-
How can iomplement SMOTE? I have been trying but its impossible