How to train a classification model on an imbalanced dataset

Options
oksanab
oksanab Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered Posts: 2 ✭✭✭✭

I have an imbalanced dataset (most of instances are in Class-1, and only 1% of instances are labeled with Class-2). When i run my classification model (decision tree, logist regress) using ROC AUC as an accuracy measure, i receive an excellent accuracy score but very low Precision. This happens due to the imbalanced training set. How do you handle it in DataIku environment? I am new to dataiku, and still learning my ways around it.

Is there an option for SMOTE supervised filter?

Best Answer

Answers

  • oksanab
    oksanab Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered Posts: 2 ✭✭✭✭
    Options

    Alex, thank you

  • cwentz
    cwentz Dataiku DSS Core Concepts, Registered Posts: 33 ✭✭✭✭
    Options

    Hello @Alex_Combessie

    I have a question regarding this post. How do I know if the rebalancing by Scikit-Learn is working?

    I have a data set were the target is binary, 0 represents less than 8% and I would like to know that it is or is not working.

    Thank you!!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @oksanab
    ,

    Good question.

    @Alex_Combessie
    ,

    I'm trying to understand a little bit more about what lies "under the covers" in DSS when you are doing these class rebalancing.

    Does DSS use the class_weight parameter in scikit-learn to make this work?

    If so, when I export the model as Jupyter Notebook, I'm not seeing a reference to class_weight in the resulting notebook. Is this to be expected? OR does Scikit-Learn do this as a default behavior? (Therefor you do not have to reference this.)

    Thanks for any further insights you can share.

  • Samuel_R_
    Samuel_R_ Dataiker Posts: 8 Dataiker
    edited July 17
    Options

    Under the hood, DSS uses indeed the `class_weight` parameter (whenever available). Class weights are computed on the whole train set.

    Regarding Jupyter export, we are indeed missing the `class_weight` parameter and the required computation. This will be added in a future realease.

    In the meantime, you can compute the class weights as:

    unique_values = np.unique(train_y)
    n_classes = unique_values.size
    class_weight_dict = {
    y: float(len(train_y)) / (n_classes * np.sum(train_y == y))
    for y in unique_values
    }

    And

    clf.class_weight = class_weight_dict

    before running the hyperparameter search and final train.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Samuel_R_

    I'm wondering if this should in some way be added to the documentation. I'm not sure that I was able to find a discussion on this point?

    Up until now, I've felt that I had to deal with class imbalance by using the class rebalance option (under Model Design -> Train/ Test set) to deal with heavily skewed target features.

    Imballance Class rebalance.jpg

  • cwentz
    cwentz Dataiku DSS Core Concepts, Registered Posts: 33 ✭✭✭✭
    Options

    @Samuel_R_
    I am not finding any documentation on 'Class weights' parameter or where it is available. Which algorithm use the class weight? I don't want to use the rebalance option if there is already a balancing happening within the train set.

  • Samuel_R_
    Samuel_R_ Dataiker Posts: 8 Dataiker
    Options

    Hi @cwentz
    @tgb417
    the documentation for Dataiku 9.0 has been updated accordingly, thanks for suggesting it !

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Samuel_R_

    Thanks for the update on this? Can you share a link to where this is now being covered?

  • nats12
    nats12 Registered Posts: 1 ✭✭✭
    Options

    How can iomplement SMOTE? I have been trying but its impossible

Setup Info
    Tags
      Help me…