Discover the winners & finalists of the 2022 Dataiku Frontrunner Awards!READ THEIR USE CASES

Train the model and automatically deploy the winning algorithm based on the metrics defined.

Level 3
Train the model and automatically deploy the winning algorithm based on the metrics defined.

Hi Team,

I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML. 

To better illustrate the scenario, I am giving an example below.

Dataset 1 ---> Clustering model A  (winning model is at K = 10) ---> deploy the model for k = 10.

Dataset 2 ---> Clustering model A  (winning model is at K = 3) ---> deploy the model for k = 3.

Dataset 3 ---> Clustering model A  (winning model is at K = 5) ---> deploy the model for k = 5.

At the end I want to merge all the outputs of the deployed model. 


Thank you in advance !

1 Reply



The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.


You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.

import dataiku
from dataiku import pandasutils as pdu
from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator
import pandas as pd
import numpy as np

client= dataiku.api_client()
project= client.get_project('PROJECT_KEY')

#Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets
ml_dataset = "INPUT_DATASET_NAME"

#Create a new ML clustering task on input dataset
mltask = project.create_clustering_ml_task(
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='KMEANS', # Template to use for setting default parameters

settings = mltask.get_settings()

#Set K and other clustering settings
algorithm_settings = settings.get_algorithm_settings('KMEANS')
algorithm_settings['k'] = [3,5,6]

settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP'
settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05
settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2

#Turn features on/off
for feature in features:

#Train ML task

ids = mltask.get_trained_models_ids()

#Find highest scoring cluster 
count = 0
for id in ids:
    details = mltask.get_trained_model_details(id)
    algorithm = details.get_modeling_settings()

    #Replace with your chosen metric
    performance = details.get_performance_metrics()['silhouette']

    scores.append({"model_id": id,
                   "performance": performance,
                   "count": count})

scores = sorted(scores, key=lambda k: k['performance'], reverse=True) 

best_model_id = scores[0]['model_id']

#Deploy the best model to the flow
ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset)
model_id = ret["savedModelId"]

#Use a scoring recipe with the training dataset and best model to generate cluster labels
builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project)

builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP")

cluster_recipe =