ML basics - Hands on: Tune the model

Solved!
ManarAlmutairi
Level 2
ML basics - Hands on: Tune the model

Hello, 

I am working on the hands-on exercise for tuning a model and i need to understand something.. 

why do we need to re-balance the model? where Dataiku DSS has already divided our samples into 80% and 20% why do we need to do this step and how to make a decision to choose which sampling method i need to do?

0 Kudos
1 Solution
SeanA
Community Manager
Community Manager

Hi @ManarAlmutairi , this is a good question. In this case, Dataiku has done an 80/20 split from the first 100k rows. Are these 100k rows a good representation of the dataset?  We probably don't want to make that assumption. Maybe the first 100k records do not have any high revenue customers!

This is a particularly acute problem when we know we have a class imbalance problem (we have many more "not high revenue" customers compared to high revenue customers). And this is the reality for many different ML use cases.

So when training the model, we want to try to make sure we have a more accurate representation of high revenue customers in the train and test sets.

Class imalance is an important ML concept (whether using Dataiku or not). You might want to look for external sources to supplement your understanding. This article might be one place to start.

Dataiku

View solution in original post

1 Reply
SeanA
Community Manager
Community Manager

Hi @ManarAlmutairi , this is a good question. In this case, Dataiku has done an 80/20 split from the first 100k rows. Are these 100k rows a good representation of the dataset?  We probably don't want to make that assumption. Maybe the first 100k records do not have any high revenue customers!

This is a particularly acute problem when we know we have a class imbalance problem (we have many more "not high revenue" customers compared to high revenue customers). And this is the reality for many different ML use cases.

So when training the model, we want to try to make sure we have a more accurate representation of high revenue customers in the train and test sets.

Class imalance is an important ML concept (whether using Dataiku or not). You might want to look for external sources to supplement your understanding. This article might be one place to start.

Dataiku