How to fully automate model retraining on the most up-to-date training data?

cedwards036
cedwards036 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2
edited July 16 in Using Dataiku

We are trying to build an automated pipeline (via a Scenario) that, among other things, involves retraining our main classification model each time the Scenario is run. Ideally, this retraining should happen on freshly-updated training data (the training dataset is refreshed/recalculated earlier in the same Scenario). However, we have found that, in practice, the model is not retrained on the most up-to-date version of the training data. Instead, it seems to "cache" (?) earlier training data (possibly the data that was first used to create/train the model, not 100% sure about that) and retrain on that, even though fresher data is available.

When I try to retrain the model manually (by going to the corresponding Analysis and clicking the "Train" button in the top-right corner), I see the following warning (see attached screenshot): "Dataset <dataset name> was updated since (on <updated timestamp>)". This is accompanied by a checkbox labeled "Drop existing sets, recompute new ones". Checking this box and proceeding with the manually-started retraining temporarily fixes the issue and forces the model to retrain on the updated training data.

Is there any way to accomplish this "dataset refresh" in an automated fashion/as part of the Scenario? It seems odd to me that the "default" behavior when retraining is to not use the most up-to-date version of the training set...

For further context, in our Scenario, we are retraining the model using a Python step, the most salient parts of which I have copied below:

import dataiku
import pandas as pd
import numpy as np

client = dataiku.api_client()
p = client.get_project(dataiku.default_project_key())  # gets a reference to the current project instance
variables = p.get_variables()
ml_task = p.get_ml_task(...)

...

# re-train and re-deploy model
ids = ml_task.train()
model_to_deploy = ids[0]
ml_task.redeploy_to_flow(model_to_deploy, '<train_recipe_name>', '<model_name>', True)

While we need the Python step for other business/use-case reasons, I have tried adding a Build/Train step for the model in question both before and after the custom Python step; neither approach worked.

I appreciate any help/guidance anyone can provide.

Answers

Setup Info
    Tags
      Help me…