ML basics - Hands on: Tune the model

ManarAlmutairi Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 20 ✭✭✭


I am working on the hands-on exercise for tuning a model and i need to understand something..

why do we need to re-balance the model? where Dataiku DSS has already divided our samples into 80% and 20% why do we need to do this step and how to make a decision to choose which sampling method i need to do?

Best Answer

  • Sean
    Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer Posts: 168 Dataiker
    Answer ✓

    Hi @ManarAlmutairi
    , this is a good question. In this case, Dataiku has done an 80/20 split from the first 100k rows. Are these 100k rows a good representation of the dataset? We probably don't want to make that assumption. Maybe the first 100k records do not have any high revenue customers!

    This is a particularly acute problem when we know we have a class imbalance problem (we have many more "not high revenue" customers compared to high revenue customers). And this is the reality for many different ML use cases.

    So when training the model, we want to try to make sure we have a more accurate representation of high revenue customers in the train and test sets.

    Class imalance is an important ML concept (whether using Dataiku or not). You might want to look for external sources to supplement your understanding. This article might be one place to start.

Setup Info
      Help me…