Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.
Is there an option for SMOTE supervised filter?
Hi,
Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.
In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...
However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.
Alternatively, you can implement SMOTE as a preliminary Python or R recipe.
Hope it helps,
Alex
Hi,
Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.
In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...
However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.
Alternatively, you can implement SMOTE as a preliminary Python or R recipe.
Hope it helps,
Alex
Alex, thank you
Hello @Alex_Combessie
I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?
I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.
Thank you!!
Good question.
I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.
Does DSS use the class_weight parameter in scikit-learn to make this work?
If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)
Thanks for any further insights you can share.
Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.
Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.
In the meantime, you can compute the class weights as:
unique_values = np.unique(train_y)
n_classes = unique_values.size
class_weight_dict = {
y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
for y in unique_values
}
And
clf.class_weight = class_weight_dict
before running the hyperparameter search and final train.
I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?
Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.
How can iomplement SMOTE? I have been trying but its impossible
@Samuel_R_ I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set.
Thanks for the update on this? Can you share a link to where this is now being covered?
Sure, it's in two notes in https://doc.dataiku.com/dss/latest/machine-learning/supervised/settings.html in subsampling and weighting strategy.