## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results forย

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Community
- ยป
- Discussions
- ยป
- Using Dataiku
- ยป

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

How to train a classification model on an imbalanced dataset

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter?

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

Solutions shown first - Read whole discussion

11 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Note that when doing a prediction with DSS visual ML, in most algorithms, DSS will use scikit-learn's capabilities for rebalancing by setting appropriate instance weights.

In addition to this, you can use our built-in class rebalancing sampling capability: https://doc.dataiku.com/dss/latest/explore/sampling.html#class-rebalancing-approximate-number-of-rec...

However, note that such rebalancing means your performance metrics may not be representative anymore of a real-world distribution.

Alternatively, you can implement SMOTE as a preliminary Python or R recipe.

Hope it helps,

Alex

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Alex, thank you

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello @Alex_Combessie

I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?

I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.

Thank you!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Good question.

I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.

Does DSS use the **class_weight** parameter in scikit-learn to make this work?

If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)

Thanks for any further insights you can share.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.

Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.

In the meantime, you can compute the class weights as:

unique_values = np.unique(train_y)

n_classes = unique_values.size

class_weight_dict = {

y: float(len(train_y)) / (n_classes * np.sum(train_y == y))

for y in unique_values

}

And

clf.class_weight = class_weight_dict

before running the hyperparameter search and final train.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?

Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

How can iomplement SMOTE? I have been trying but its impossible

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the update on this? Can you share a link to where this is now being covered?

--Tom

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Didn't Find What You Needed?