Using Scenarios to automatically retrain models

StephenEaster
Level 1
Using Scenarios to automatically retrain models

Hi,

I'm new to Dataiku and the community and I'm using Dataiku online.  Documentation indicates that scenarios can be used to "Automate the retraining of โ€œsaved modelsโ€ on a regular basis, and only activate the new version if the performance is improved".  This is exactly what I need to set up but I can't seem to find examples showing the specific steps I would setup in scenarios to accomplish this.   Attaching my current flow (I got an error pasting a screenshot ).  Any references to  documentation / working examples would be great.

Thank you

 

 
 

 

 

 

 

Operating system used: windows


Operating system used: windows

0 Kudos
1 Reply
AlexT
Dataiker

Hi @StephenEaster ,

Thanks for posting here. Sharing the solution we implemented on your end with the Community. 

Added a custom python step in a scenario :

import dataiku

# Define variables necessary to run this code (Please assign your own environment's values)
ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model

# client is a DSS API client.
client = dataiku.api_client()
p = client.get_project(dataiku.default_project_key())

# Retrieve existing ML task to retrain the model
mltask = p.get_ml_task(ANALYSIS_ID, ML_TASK_ID)

# Wait for the ML task to be ready
mltask.wait_guess_complete()

# Start train and wait for it to be complete
mltask.start_train()
mltask.wait_train_complete()

# Get the identifiers of the trained models
# There will be 3 of them because Logistic regression and Random forest were default enabled
ids = mltask.get_trained_models_ids()

# Iterating through all the existing algorithms to determine which one has the best AUC score or other metrics r2,auc, f1 etc.
actual_metric = "r2"

temp_auc = 0
for id in ids:
        details = mltask.get_trained_model_details(id)
        algorithm = details.get_modeling_settings()["algorithm"]
        auc = details.get_performance_metrics()[actual_metric]

        if auc > temp_auc:
            best_model = id
            print("Better model identified")

        print("Algorithm=%s actual_metric=%s" % (algorithm, auc))

# Let's compare the "best" model of the newly trained model vs the existing model to see which is better

details = mltask.get_trained_model_details(best_model)
auc = details.get_performance_metrics()[actual_metric]

# We'll need to pull the current model ID from project variables and retrieve the model info
vars = p.get_variables()

try:
    current_model = vars["standard"]["current_model"]
except:
    current_model = SAVED_MODEL_ID

current_details = mltask.get_trained_model_details(current_model)
current_auc = current_details.get_performance_metrics()[actual_metric]

# Let's deploy the model with the best AUC score (either new or existing)
if auc > current_auc:
    model_to_deploy = best_model
else: 
    model_to_deploy = current_model

print("Model to deploy identified: " + model_to_deploy)

# Update project variables to reflect the new model ID that is being deployed
vars["standard"]["current_model"] = model_to_deploy
p.set_variables(vars)

# Deploy the model to the Flow
ret = mltask.redeploy_to_flow(model_to_deploy, recipe_name = TRAINING_RECIPE_NAME, activate = True)

 

The assumption here is the model is already trained and winning model was deployed to the flow. To obtain the required variables  :

 

DKt95bYof-9VjT2PjtHmiuRb1Yoo0hvAiw (1).pngYqNMHh57PtP-BnagYuEHBS6avDSSQXdPmw (1).png

 

The Saved Model ID e.g that one that is already deployer Screenshot 2022-06-29 at 23.05.46.png

Regards,