Survey banner
The Dataiku Community is moving to a new home! Some short term disruption starting next week: LEARN MORE

Using Scenarios to automatically retrain models

Level 1
Using Scenarios to automatically retrain models


I'm new to Dataiku and the community and I'm using Dataiku online.  Documentation indicates that scenarios can be used to "Automate the retraining of โ€œsaved modelsโ€ on a regular basis, and only activate the new version if the performance is improved".  This is exactly what I need to set up but I can't seem to find examples showing the specific steps I would setup in scenarios to accomplish this.   Attaching my current flow (I got an error pasting a screenshot ).  Any references to  documentation / working examples would be great.

Thank you







Operating system used: windows

Operating system used: windows

0 Kudos
6 Replies

Hi @StephenEaster ,

Thanks for posting here. Sharing the solution we implemented on your end with the Community. 

Added a custom python step in a scenario :

import dataiku

# Define variables necessary to run this code (Please assign your own environment's values)
ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model

# client is a DSS API client.
client = dataiku.api_client()
p = client.get_project(dataiku.default_project_key())

# Retrieve existing ML task to retrain the model
mltask = p.get_ml_task(ANALYSIS_ID, ML_TASK_ID)

# Wait for the ML task to be ready

# Start train and wait for it to be complete

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

# Iterating through all the existing algorithms to determine which one has the best AUC score or other metrics r2,auc, f1 etc.
actual_metric = "r2"

temp_auc = 0
for id in ids:
        details = mltask.get_trained_model_details(id)
        algorithm = details.get_modeling_settings()["algorithm"]
        auc = details.get_performance_metrics()[actual_metric]

        if auc > temp_auc:
            best_model = id
            print("Better model identified")

        print("Algorithm=%s actual_metric=%s" % (algorithm, auc))

# Let's compare the "best" model of the newly trained model vs the existing model to see which is better

details = mltask.get_trained_model_details(best_model)
auc = details.get_performance_metrics()[actual_metric]

# We'll need to pull the current model ID from project variables and retrieve the model info
vars = p.get_variables()

    current_model = vars["standard"]["current_model"]
    current_model = SAVED_MODEL_ID

current_details = mltask.get_trained_model_details(current_model)
current_auc = current_details.get_performance_metrics()[actual_metric]

# Let's deploy the model with the best AUC score (either new or existing)
if auc > current_auc:
    model_to_deploy = best_model
    model_to_deploy = current_model

print("Model to deploy identified: " + model_to_deploy)

# Update project variables to reflect the new model ID that is being deployed
vars["standard"]["current_model"] = model_to_deploy

# Deploy the model to the Flow
ret = mltask.redeploy_to_flow(model_to_deploy, recipe_name = TRAINING_RECIPE_NAME, activate = True)


The assumption here is the model is already trained and winning model was deployed to the flow. To obtain the required variables  :


DKt95bYof-9VjT2PjtHmiuRb1Yoo0hvAiw (1).pngYqNMHh57PtP-BnagYuEHBS6avDSSQXdPmw (1).png


The Saved Model ID e.g that one that is already deployer Screenshot 2022-06-29 at 23.05.46.png


Level 3


Hi Alex,

Looking to leverage the solution you've provided above to do automated weekly retraining of a model. Is there a way to make the value for 'SAVED_MODEL_ID' more dynamic? I ask because I worry that explicitly coding in the 'saved_model_id' won't be effective in future retrainings, if a new model ID performs better somewhere down the line. 

Here is an example:

- Initially deployed model ID is model_123

- On initial retraining step, all potential models are evaluated against model_123

- Model_456 performs better than model_123 and is automatically deployed

- In the following week of retraining, the R-squared values would be compared against the initially deployed model (123) instead of the newly deployed model (456). 


I appreciate your help in advance!

0 Kudos

The example provided uses a project variable to determine the new best model.
So it should update to model 456 and during the next it will check against model 456 and not 123.

# Update project variables to reflect the new model ID that is being deployed
vars["standard"]["current_model"] = model_to_deploy

0 Kudos
Level 3


Thanks, Alex. 

If I run this first block of code the following week, won't all the project variables revert back to what is coded here?

# Define variables necessary to run this code (Please assign your own environment's values)
ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
0 Kudos

It will only revert to initial defined hard coded value if the variable is not present


    current_model = vars["standard"]["current_model"]
    current_model = SAVED_MODEL_ID


0 Kudos
Level 3

I got it now. Thank you for clarifying!

0 Kudos

Setup info

Tags (1)
A banner prompting to get Dataiku