Using Scenarios to automatically retrain models
![StephenEaster](https://us.v-cdn.net/6038231/uploads/Dataiku/nAvatar13.png)
Hi,
I'm new to Dataiku and the community and I'm using Dataiku online. Documentation indicates that scenarios can be used to "Automate the retraining of “saved models” on a regular basis, and only activate the new version if the performance is improved". This is exactly what I need to set up but I can't seem to find examples showing the specific steps I would setup in scenarios to accomplish this. Attaching my current flow (I got an error pasting a screenshot ). Any references to documentation / working examples would be great.
Thank you
Operating system used: windows
Operating system used: windows
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 DataikerOptions
Hi @StephenEaster
,Thanks for posting here. Sharing the solution we implemented on your end with the Community.
Added a custom python step in a scenario :import dataiku # Define variables necessary to run this code (Please assign your own environment's values) ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model # client is a DSS API client. client = dataiku.api_client() p = client.get_project(dataiku.default_project_key()) # Retrieve existing ML task to retrain the model mltask = p.get_ml_task(ANALYSIS_ID, ML_TASK_ID) # Wait for the ML task to be ready mltask.wait_guess_complete() # Start train and wait for it to be complete mltask.start_train() mltask.wait_train_complete() # Get the identifiers of the trained models # There will be 3 of them because Logistic regression and Random forest were default enabled ids = mltask.get_trained_models_ids() # Iterating through all the existing algorithms to determine which one has the best AUC score or other metrics r2,auc, f1 etc. actual_metric = "r2" temp_auc = 0 for id in ids: details = mltask.get_trained_model_details(id) algorithm = details.get_modeling_settings()["algorithm"] auc = details.get_performance_metrics()[actual_metric] if auc > temp_auc: best_model = id print("Better model identified") print("Algorithm=%s actual_metric=%s" % (algorithm, auc)) # Let's compare the "best" model of the newly trained model vs the existing model to see which is better details = mltask.get_trained_model_details(best_model) auc = details.get_performance_metrics()[actual_metric] # We'll need to pull the current model ID from project variables and retrieve the model info vars = p.get_variables() try: current_model = vars["standard"]["current_model"] except: current_model = SAVED_MODEL_ID current_details = mltask.get_trained_model_details(current_model) current_auc = current_details.get_performance_metrics()[actual_metric] # Let's deploy the model with the best AUC score (either new or existing) if auc > current_auc: model_to_deploy = best_model else: model_to_deploy = current_model print("Model to deploy identified: " + model_to_deploy) # Update project variables to reflect the new model ID that is being deployed vars["standard"]["current_model"] = model_to_deploy p.set_variables(vars) # Deploy the model to the Flow ret = mltask.redeploy_to_flow(model_to_deploy, recipe_name = TRAINING_RECIPE_NAME, activate = True)
The assumption here is the model is already trained and winning model was deployed to the flow. To obtain the required variables :
The Saved Model ID e.g that one that is already deployer
Regards,
-
Hi Alex,
Looking to leverage the solution you've provided above to do automated weekly retraining of a model. Is there a way to make the value for 'SAVED_MODEL_ID' more dynamic? I ask because I worry that explicitly coding in the 'saved_model_id' won't be effective in future retrainings, if a new model ID performs better somewhere down the line.
Here is an example:
- Initially deployed model ID is model_123
- On initial retraining step, all potential models are evaluated against model_123
- Model_456 performs better than model_123 and is automatically deployed
- In the following week of retraining, the R-squared values would be compared against the initially deployed model (123) instead of the newly deployed model (456).
I appreciate your help in advance!
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 DataikerOptions
The example provided uses a project variable to determine the new best model.
So it should update to model 456 and during the next it will check against model 456 and not 123.
# Update project variables to reflect the new model ID that is being deployed
vars["standard"]["current_model"] = model_to_deploy
p.set_variables(vars) -
Thanks, Alex.
If I run this first block of code the following week, won't all the project variables revert back to what is coded here?
# Define variables necessary to run this code (Please assign your own environment's values) ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 DataikerOptions
It will only revert to initial defined hard coded value if the variable is not present
try: current_model = vars["standard"]["current_model"] except: current_model = SAVED_MODEL_ID
-
I got it now. Thank you for clarifying!