Combining Training and Unlabelled Data Using Stack Recipe Cause Data Leakage
Hi,
I am following Academy Learning Path. Those content is great.
But Under ML Practitioner> Scoring Basics. > Concept Scoring Data
It was shown that Unlabelled data that we use for scoring using the model will be combined with training data using stack recipe and gone through all pre processing steps so that it is prepared to be consumed by model for prediction.
But this is definitely a bad practice as rescaling features, doing imputation etc.. combinely cause data leakage from training data to Unseen/Unlabelled data.
Could some throw some light if i am missing something
Answers
-
Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer Posts: 168 Dataiker
Hi @ajayrathode
, thanks very much for sharing your feedback! We'll take this into consideration as we look to improve the content.On one hand we certainly don't want to promote bad practices, and yet there is also a need to simplify the situation as much as possible to communicate the objectives at hand. If assuming the observations are drawn from the same distribution (an assumption yes), the rescaling and imputation issues may not be so problematic. The loss of this nuance may be worth being able to quickly demonstrate that both need the same preprocessing steps.
In general, none of the setups you'll see in the Academy are necessarily meant to be 100% realistic of real-world solutions (have a look at Dataiku Solutions for that).
-
ajayrathode Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭
Thank you for the clarification @SeanA
.
Can you suggest how we can achieve independent Data preprocessing steps.Is duplicating the data preprocessing steps in the flow for test dataset is the way to go?
-
Sean Dataiker, Alpha Tester, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer Posts: 168 Dataiker
Hi @ajayrathode
, I think the answer is one of those "it depends" situations. For many cases, the very simple approach shown in the video may be sufficient. What is required for your situation may depend on many factors like the data distributions, objectives, purpose, goals etc. I don't think there's necessarily a one-size-fits-all approach.