Combining Training and Unlabelled Data Using Stack Recipe Cause Data Leakage

ajayrathode · May 2024

Hi,

I am following Academy Learning Path. Those content is great.

But Under ML Practitioner> Scoring Basics. > Concept Scoring Data

It was shown that Unlabelled data that we use for scoring using the model will be combined with training data using stack recipe and gone through all pre processing steps so that it is prepared to be consumed by model for prediction.

But this is definitely a bad practice as rescaling features, doing imputation etc.. combinely cause data leakage from training data to Unseen/Unlabelled data.

Could some throw some light if i am missing something

Sean · May 2024

Hi @ajayrathode
, thanks very much for sharing your feedback! We'll take this into consideration as we look to improve the content.

On one hand we certainly don't want to promote bad practices, and yet there is also a need to simplify the situation as much as possible to communicate the objectives at hand. If assuming the observations are drawn from the same distribution (an assumption yes), the rescaling and imputation issues may not be so problematic. The loss of this nuance may be worth being able to quickly demonstrate that both need the same preprocessing steps.

In general, none of the setups you'll see in the Academy are necessarily meant to be 100% realistic of real-world solutions (have a look at Dataiku Solutions for that).

ajayrathode · May 2024

Thank you for the clarification @SeanA
.
Can you suggest how we can achieve independent Data preprocessing steps.

Is duplicating the data preprocessing steps in the flow for test dataset is the way to go?

Sean · June 2024

Hi @ajayrathode
, I think the answer is one of those "it depends" situations. For many cases, the very simple approach shown in the video may be sufficient. What is required for your situation may depend on many factors like the data distributions, objectives, purpose, goals etc. I don't think there's necessarily a one-size-fits-all approach.

Combining Training and Unlabelled Data Using Stack Recipe Cause Data Leakage

Answers

Categories

Setup Info

Tags