How to fully automate model retraining on the most up-to-date training data?

Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 2
edited July 2024 in Using Dataiku

We are trying to build an automated pipeline (via a Scenario) that, among other things, involves retraining our main classification model each time the Scenario is run. Ideally, this retraining should happen on freshly-updated training data (the training dataset is refreshed/recalculated earlier in the same Scenario). However, we have found that, in practice, the model is not retrained on the most up-to-date version of the training data. Instead, it seems to "cache" (?) earlier training data (possibly the data that was first used to create/train the model, not 100% sure about that) and retrain on that, even though fresher data is available.

When I try to retrain the model manually (by going to the corresponding Analysis and clicking the "Train" button in the top-right corner), I see the following warning (see attached screenshot): "Dataset <dataset name> was updated since (on <updated timestamp>)". This is accompanied by a checkbox labeled "Drop existing sets, recompute new ones". Checking this box and proceeding with the manually-started retraining temporarily fixes the issue and forces the model to retrain on the updated training data.

Is there any way to accomplish this "dataset refresh" in an automated fashion/as part of the Scenario? It seems odd to me that the "default" behavior when retraining is to not use the most up-to-date version of the training set...

For further context, in our Scenario, we are retraining the model using a Python step, the most salient parts of which I have copied below:

import dataiku
import pandas as pd
import numpy as np

client = dataiku.api_client()
p = client.get_project(dataiku.default_project_key())  # gets a reference to the current project instance
variables = p.get_variables()
ml_task = p.get_ml_task(...)

...

# re-train and re-deploy model
ids = ml_task.train()
model_to_deploy = ids[0]
ml_task.redeploy_to_flow(model_to_deploy, '<train_recipe_name>', '<model_name>', True)

While we need the Python step for other business/use-case reasons, I have tried adding a Build/Train step for the model in question both before and after the custom Python step; neither approach worked.

I appreciate any help/guidance anyone can provide.

Answers

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.