Deal with imbalanced data
Hi,
I am working on an imbalanced dataset (99% vs 1%).
1- What is the difference between using 'class rebalancing' subsampling method and using weighting strategies ?
2- Can I improve my model using both of them ?
3- How does the 'class rebalancing' subsampling method work ?
Thank you for your help.
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Welcome to the Dataiku Community.
One of the things to consider is how big is the 1%. If it's only 1 or 10 records in this group. You likely don't have enough data really to build a model. If you have hundreds or thousands of records in the 1% class you might have enough data to work with the approaches you are describing.
I'll let others jump in on the rest of the question.
--Tom
-
Thank you for your response.
Yes I have enough data to build my model (135000 samples).
I have tried both methods and still have a low f1-score
Can we use oversampling methods in Dataiku ? -
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
I’m not an expert on this front. I expect that there are others here who are more experienced in dealing with data sets that are strongly imbalanced like yours.
But my understanding is that one of the ways one might proceed is to try to smartly reduce your training set down to the the point that your minority class makes up a more significant percent of the training data. Say down as far as 2,700 to 10,000 records for training. In this smaller dataset one would use most records from the minority class in training with a few held out for validation. The challenge is getting a truly representative sample of the majority class in your data.
Once this is done their are two question in my mind.1. Can one actually build a model in these simplified conditions that finds your minority class? If no, then I’d start to look at features and data gathering first.
2. Once one has built a model with the smaller data set. The next question is this model generalizable against the larger dataset? Even though you trained on the much smaller dataset do you have a viable model?
One could test these conditions with the split recipe. Split off all of your minority class into one set and then randomly selecting partner records from the majority class to go with your minority class. Re combine these two new data sets coming up with this much smaller overall data set. Then proceed to split and Train on this much smaller data.
Once one has determined that you can build a model in these much simpler conditions. Then the various automated features we are discussing in this post would make sense to me.
I’m hoping there are others out there on the forum who can jump in here. I know that I’ve struggled with strongly unbalanced data sets in the past.