Why do we build prediction model while data is already labeled?
Hi, I'm working on the ML Quick Start and have a basic question.
"This project begins from a labeled dataset named job_postings composed of 95% real and 5% fake job postings. For the column fraudulent, values of 0 and 1 represent real and fake job postings, respectively. Your task will be to build a prediction model capable of classifying a job posting as real or fake."
Q1: While the dataset is already has an attribute "fraudulent" and why do we build a prediction model to determine fake or real?
I would understand if there's no "fraudulent" column, then build a model to determine if the posting to be fake or not..
Q2: How the "fraudulent" column in the sample dataset determined as fraudulent?
Or am I talking apple and oranges, like "fake" and "fraudulent" uses the different definition?
Thank you for your help!
Best Answer
-
Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer Posts: 168 Dataiker
Hi @Kumi
good questions! You're right that the training data is already labeled. We might want to use that labeled data to build a model for at least two reasons: 1) statistical insights into the data 2) to predict whether a new job posting is real or fake (when we don't know the true answer, also known as the ground truth). The idea is that the model can "learn" from these labeled cases of what a real and fake job posting looks like. We then can use that model to make predictions on unseen data.For Q2, the researcher providing the starting data came up with these labels. So we don't really know how the "ground truth" was actually collected. But we can imagine that the researcher collected the job postings and then actually verified one by one which postings were real and fake.
Answers
-
@SeanA
Thank you for the explanation! Glad that I asked ‐ I was scratching my head but now I got it.