How to fully automate model retraining on the most up-to-date training data?

cedwards036
Level 1
How to fully automate model retraining on the most up-to-date training data?

We are trying to build an automated pipeline (via a Scenario) that, among other things, involves retraining our main classification model each time the Scenario is run. Ideally, this retraining should happen on freshly-updated training data (the training dataset is refreshed/recalculated earlier in the same Scenario). However, we have found that, in practice, the model is not retrained on the most up-to-date version of the training data. Instead, it seems to "cache" (?) earlier training data (possibly the data that was first used to create/train the model, not 100% sure about that) and retrain on that, even though fresher data is available. 

When I try to retrain the model manually (by going to the corresponding Analysis and clicking the "Train" button in the top-right corner), I see the following warning (see attached screenshot): "Dataset <dataset name> was updated since (on <updated timestamp>)". This is accompanied by a checkbox labeled "Drop existing sets, recompute new ones". Checking this box and proceeding with the manually-started retraining temporarily fixes the issue and forces the model to retrain on the updated training data.

Is there any way to accomplish this "dataset refresh" in an automated fashion/as part of the Scenario? It seems odd to me that the "default" behavior when retraining is to not use the most up-to-date version of the training set...

For further context, in our Scenario, we are retraining the model using a Python step, the most salient parts of which I have copied below:

 

import dataiku
import pandas as pd
import numpy as np

client = dataiku.api_client()
p = client.get_project(dataiku.default_project_key())  # gets a reference to the current project instance
variables = p.get_variables()
ml_task = p.get_ml_task(...)

...

# re-train and re-deploy model
ids = ml_task.train()
model_to_deploy = ids[0]
ml_task.redeploy_to_flow(model_to_deploy, '<train_recipe_name>', '<model_name>', True)

 

 

While we need the Python step for other business/use-case reasons, I have tried adding a Build/Train step for the model in question both before and after the custom Python step; neither approach worked.

I appreciate any help/guidance anyone can provide.

 

0 Kudos
1 Reply
cedwards036
Level 1
Author

It looks like the fix described in https://community.dataiku.com/t5/Using-Dataiku/API-analogous-for-quot-Drop-existing-sets-recompute-n... still works as of 11.1.1, not sure if there is a better/more "official" way to do this however.

0 Kudos

Labels

?
Labels (3)
A banner prompting to get Dataiku