This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results forĀ

- Community
- Ā»
- Discussions
- Ā»
- Using Dataiku
- Ā»

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How to train a classification model on an imbalanced dataset

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter?

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

Solutions shown first - Read whole discussion

11 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Alex, thank you

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello @Alex_Combessie

I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?

I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.

Thank you!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Good question.

I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.

Does DSS use the **class_weight** parameter in scikit-learn to make this work?

If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)

Thanks for any further insights you can share.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.

Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.

In the meantime, you can compute the class weights as:

unique_values = np.unique(train_y)

n_classes = unique_values.size

class_weight_dict = {

y: float(len(train_y)) / (n_classes * np.sum(train_y == y))

for y in unique_values

}

And

clf.class_weight = class_weight_dict

before running the hyperparameter search and final train.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?

Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

How can iomplement SMOTE? I have been trying but its impossible

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Thanks for the update on this? Can you share a link to where this is now being covered?

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content