Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Combining Training and Unlabelled Data Using Stack Recipe Cause Data Leakage

ajayrathode
Level 1
Combining Training and Unlabelled Data Using Stack Recipe Cause Data Leakage

Hi, 

I am following Academy Learning Path. Those content is great.

But Under ML Practitioner> Scoring Basics. > Concept Scoring Data

It was shown that Unlabelled data that we use for scoring using the model will be combined with training data using stack recipe and gone through all pre processing steps so that it is prepared to be consumed by model for prediction. 

 

But this is definitely a bad practice as rescaling features, doing imputation etc.. combinely cause data leakage from training data to Unseen/Unlabelled data.

 

Could some throw some light if i am missing something

0 Kudos
3 Replies
SeanA
Community Manager
Community Manager

Hi @ajayrathode , thanks very much for sharing your feedback! We'll take this into consideration as we look to improve the content.

On one hand we certainly don't want to promote bad practices, and yet there is also a need to simplify the situation as much as possible to communicate the objectives at hand. If assuming the observations are drawn from the same distribution (an assumption yes), the rescaling and imputation issues may not be so problematic. The loss of this nuance may be worth being able to quickly demonstrate that both need the same preprocessing steps.

In general, none of the setups you'll see in the Academy are necessarily meant to be 100% realistic of real-world solutions (have a look at Dataiku Solutions for that). 

Dataiku
0 Kudos
ajayrathode
Level 1
Author

Thank you for the clarification @SeanA .
Can you suggest how we can achieve independent Data preprocessing steps. 

 

Is duplicating the data preprocessing steps in the flow for test dataset is the way to go?

0 Kudos
SeanA
Community Manager
Community Manager

Hi @ajayrathode , I think the answer is one of those "it depends" situations. For many cases, the very simple approach shown in the video may be sufficient. What is required for your situation may depend on many factors like the data distributions, objectives, purpose, goals etc. I don't think there's necessarily a one-size-fits-all approach. 

Dataiku
0 Kudos