Problem reproducing a prediction

DaitakuNapoleon · ‎07-02-2020

Hello, I have a problem with Dataiku's predictions, I don't know how they work. I am creating a model to get sales predictions. So far no problem, but when I try to reproduce the model, using the same variables, the same algorithm, the same file structure, the result is different from the one I had initially .The R² and the mean squared error changes also, especially with Random forest and XGboost. I don't understand why, is this normal?

Clément_Stenac · ‎07-06-2020

Hi,

There are lots of things that could play here.

The most important one by far is that Dataiku preprocesses your data. This includes numerous things, but on "normal" data, the two most important ones are dummifying categorical values, and standardizing numericals.

In order to get as close as possible to the results of Visual ML using external code, you'd need to disable as much as possible of it, by:

Only passing numerical variables to the Visual ML
Disabling standardization on all numerical variables in the Visual ML

Another thing that may be different is train/test split. In order to get more reproducible results, you'd want to pass explicit train and test sets to both Visual ML and your code

Then there is hyperparameter search. The easiest here would be to disable it entirely.

View solution in original post

Clément_Stenac · ‎07-06-2020

Hi,

There are lots of things that could play here.

The most important one by far is that Dataiku preprocesses your data. This includes numerous things, but on "normal" data, the two most important ones are dummifying categorical values, and standardizing numericals.

In order to get as close as possible to the results of Visual ML using external code, you'd need to disable as much as possible of it, by:

Only passing numerical variables to the Visual ML
Disabling standardization on all numerical variables in the Visual ML

Another thing that may be different is train/test split. In order to get more reproducible results, you'd want to pass explicit train and test sets to both Visual ML and your code

Then there is hyperparameter search. The easiest here would be to disable it entirely.

Sign up to take part

Problem reproducing a prediction

Problem reproducing a prediction