Sampling Techniques for AutoML Lab Recipe
Hi -
I am new to using the Dataiku modeling interface.
I am looking to create a two class classification algorithm using the 'AutoML' feature in the Visual ML lab section.
The data I am using has a large class imbalance (80% negative & 20% positive). Where/ how can I make sure the data is being rebalanced correctly? i.e. under sampling, oversampling, SMOTE, etc. I see the Sampling method in the 'Train / Test Set' section but the descriptions of the options are fairly vague and not clear to me. Is the 'Train / Test Set' section the proper section to configure rebalance?
Thanks for any insight.
Answers
-
LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 54 Neuron
Hi @m_schneids
,Dataiku offers some class balancing features, but doesn't include some of the more sophisticated methods like SMOTE. If you want to use SMOTE you would have to split your dataset using a Python recipe and code the sampling strategy.
This post might be interesting for you.
The two default sampling methods that would help with class imbalance are the class rebalance options. If you want to undersample so that you get balanced classes, you could find the number of rows in the minority class and then use Class Rebalance - Approximate number of records and set Column = Target var and Nb. records = #positive.
If you want to validate using a K-fold cross test, you can select the "Stratified" option to make sure the target var dist