set the random state in visual ML models

Solved!
tanguy
set the random state in visual ML models

I have an ongoing project in production that I intend to replace with another project currently in development. As part of this transition, I find myself comparing a dataset that has undergone scoring from a model in each project. Initially, I anticipated the model scores to be identical or, at the very least, very similar. However, I have observed significant differences despite the fact that the underlying data provided to both models is the same.

Consequently, I am seeking a method to standardize the model training between the two projects by setting the random state. I am utilizing a random forest classifier within a visual recipe, and random forests in scikit-learn have a `random_state` attribute.

Is there a recommended approach to achieve this?


Operating system used: Redhat 8

0 Kudos
1 Solution
Young-Sang_Lee
Dataiker

Hi,

DSS already comes with a Random Seed that ensures reproducibility when training multiple times on the same input dataset. This is used when splitting the DSS input dataset, hyper-parameter search, and in the algorithm to make sure you get the same results when retrained with the same data. 

image.png

A straightforward method to confirm this is by creating two sample projects from the DSS tutorial and training a random forest model with the same input. Also, you can search in the training logs for random_seed. 

Now coming back to your project, you may need to check in which DSS version the model was trained. The core Python packages used to train the model might have been updated( i.e. scikit-learn, numpy ) and this may be causing the different results. 

Can you try retraining both models in both projects after making sure you have selected the same code environment in Visual Analyses>Design>Advanced>Runtime environment>Code environment?

Cheers,

YSL

View solution in original post

1 Reply
Young-Sang_Lee
Dataiker

Hi,

DSS already comes with a Random Seed that ensures reproducibility when training multiple times on the same input dataset. This is used when splitting the DSS input dataset, hyper-parameter search, and in the algorithm to make sure you get the same results when retrained with the same data. 

image.png

A straightforward method to confirm this is by creating two sample projects from the DSS tutorial and training a random forest model with the same input. Also, you can search in the training logs for random_seed. 

Now coming back to your project, you may need to check in which DSS version the model was trained. The core Python packages used to train the model might have been updated( i.e. scikit-learn, numpy ) and this may be causing the different results. 

Can you try retraining both models in both projects after making sure you have selected the same code environment in Visual Analyses>Design>Advanced>Runtime environment>Code environment?

Cheers,

YSL

Labels

?
Labels (2)

Setup info

?
A banner prompting to get Dataiku