Hi @ManarAlmutairi
, this is a good question. In this case, Dataiku has done an 80/20 split from the first 100k rows. Are these 100k rows a good representation of the dataset? We probably don't want to make that assumption. Maybe the first 100k records do not have any high revenue customers!
This is a particularly acute problem when we know we have a class imbalance problem (we have many more "not high revenue" customers compared to high revenue customers). And this is the reality for many different ML use cases.
So when training the model, we want to try to make sure we have a more accurate representation of high revenue customers in the train and test sets.
Class imalance is an important ML concept (whether using Dataiku or not). You might want to look for external sources to supplement your understanding. This article might be one place to start.
This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.
AcceptReject