I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.
Is there an option for SMOTE supervised filter?
Hi,
Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.
In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...
However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.
Alternatively, you can implement SMOTE as a preliminary Python or R recipe.
Hope it helps,
Alex
Hi,
Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.
In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...
However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.
Alternatively, you can implement SMOTE as a preliminary Python or R recipe.
Hope it helps,
Alex
Alex, thank you
Hello @Alex_Combessie
I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?
I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.
Thank you!!
Good question.
I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.
Does DSS use the class_weight parameter in scikit-learn to make this work?
If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)
Thanks for any further insights you can share.
Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.
Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.
In the meantime, you can compute the class weights as:
unique_values = np.unique(train_y)
n_classes = unique_values.size
class_weight_dict = {
y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
for y in unique_values
}
And
clf.class_weight = class_weight_dict
before running the hyperparameter search and final train.
I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?
Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.
@Samuel_R_ I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set.