How to fully automate model retraining on the most up-to-date training data?
We are trying to build an automated pipeline (via a Scenario) that, among other things, involves retraining our main classification model each time the Scenario is run. Ideally, this retraining should happen on freshly-updated training data (the training dataset is refreshed/recalculated earlier in the same Scenario). However, we have found that, in practice, the model is not retrained on the most up-to-date version of the training data. Instead, it seems to "cache" (?) earlier training data (possibly the data that was first used to create/train the model, not 100% sure about that) and retrain on that, even though fresher data is available.
When I try to retrain the model manually (by going to the corresponding Analysis and clicking the "Train" button in the top-right corner), I see the following warning (see attached screenshot): "Dataset <dataset name> was updated since (on <updated timestamp>)". This is accompanied by a checkbox labeled "Drop existing sets, recompute new ones". Checking this box and proceeding with the manually-started retraining temporarily fixes the issue and forces the model to retrain on the updated training data.
Is there any way to accomplish this "dataset refresh" in an automated fashion/as part of the Scenario? It seems odd to me that the "default" behavior when retraining is to not use the most up-to-date version of the training set...
For further context, in our Scenario, we are retraining the model using a Python step, the most salient parts of which I have copied below:
import dataiku import pandas as pd import numpy as np client = dataiku.api_client() p = client.get_project(dataiku.default_project_key()) # gets a reference to the current project instance variables = p.get_variables() ml_task = p.get_ml_task(...) ... # re-train and re-deploy model ids = ml_task.train() model_to_deploy = ids[0] ml_task.redeploy_to_flow(model_to_deploy, '<train_recipe_name>', '<model_name>', True)
While we need the Python step for other business/use-case reasons, I have tried adding a Build/Train step for the model in question both before and after the custom Python step; neither approach worked.
I appreciate any help/guidance anyone can provide.
Answers
-
cedwards036 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2 ✭
It looks like the fix described in https://community.dataiku.com/t5/Using-Dataiku/API-analogous-for-quot-Drop-existing-sets-recompute-new-ones/td-p/14422 still works as of 11.1.1, not sure if there is a better/more "official" way to do this however.