Dectect Model drifting for clustering

haoxian · ‎09-16-2020

Hello,

I would like to know if we can make surveillance on clustering models to detect and handle model drifiting. For example, for Kmeans, if we can detect its drifting if we have new data? And then relanch automatically the retrain of Kmeans(with hyperparameter tuning for the number of clusters K).

So far, I tried to use Scenario, but I cannot integrate my metrics for model for scoring new data(dectect new outliers with centroid and previous existed data.) And I cannot redo the visual analysis automaically but only retrain the model with new data without change hyperparameters defined during deployment of model.

Could you please help me?

Thank you very much!

S

simonamaggio · ‎09-22-2020

Hi @haoxian,

Data Drifting

You can handle data drifting comparing two datasets, a training dataset and a new dataset, regardless of the model you are training. You could use the model drift plugin (see last two sections 'Recipe: compute drift between two datasets' and 'Custom recipe: retrieve most recent drift score'). The plugin drift score gives you a hint about how the data is changing and you can add checks on its value to trigger re-training. What kind of data are you dealing with? The drift plugin supports text and structured datasets.

Hyper-parameter tuning

As you pointed out, you can't change the hyper-parameter for the retraining of a deployed model. But if you already have a custom metric and a custom automatic HP-tuning given old and new data, I'd suggest you create a custom clustering recipe, through the Dataiku ML API, taking an input hyper-parameter K, that is updated by your HP-tuning recipe after a drift alert has been triggered.

Don't hesitate to give us more details about the task you are addressing, hopefully we can help you to build the needed pipeline.

Best,

Simona

View solution in original post

simonamaggio · ‎09-22-2020

Hi @haoxian,

Data Drifting

You can handle data drifting comparing two datasets, a training dataset and a new dataset, regardless of the model you are training. You could use the model drift plugin (see last two sections 'Recipe: compute drift between two datasets' and 'Custom recipe: retrieve most recent drift score'). The plugin drift score gives you a hint about how the data is changing and you can add checks on its value to trigger re-training. What kind of data are you dealing with? The drift plugin supports text and structured datasets.

Hyper-parameter tuning

As you pointed out, you can't change the hyper-parameter for the retraining of a deployed model. But if you already have a custom metric and a custom automatic HP-tuning given old and new data, I'd suggest you create a custom clustering recipe, through the Dataiku ML API, taking an input hyper-parameter K, that is updated by your HP-tuning recipe after a drift alert has been triggered.

Don't hesitate to give us more details about the task you are addressing, hopefully we can help you to build the needed pipeline.

Best,

Simona

Sign up to take part

Dectect Model drifting for clustering

Dectect Model drifting for clustering