Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

Train the model and automatically deploy the winning algorithm based on the metrics defined.

Tsurapaneni
Level 3
Level 3
Train the model and automatically deploy the winning algorithm based on the metrics defined.

Hi Team,

I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML. 

To better illustrate the scenario, I am giving an example below.

Dataset 1 ---> Clustering model A  (winning model is at K = 10) ---> deploy the model for k = 10.

Dataset 2 ---> Clustering model A  (winning model is at K = 3) ---> deploy the model for k = 3.

Dataset 3 ---> Clustering model A  (winning model is at K = 5) ---> deploy the model for k = 5.

At the end I want to merge all the outputs of the deployed model. 

 

Thank you in advance !

0 Kudos
1 Reply
pmasiphelps
Dataiker
Dataiker

Hi,

 

The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.

 

You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.

import dataiku
from dataiku import pandasutils as pdu
from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator
import pandas as pd
import numpy as np

client= dataiku.api_client()
project= client.get_project('PROJECT_KEY')

#Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets
ml_dataset = "INPUT_DATASET_NAME"

#Create a new ML clustering task on input dataset
mltask = project.create_clustering_ml_task(
    input_dataset=ml_dataset,
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='KMEANS', # Template to use for setting default parameters
    wait_guess_complete=True
)

settings = mltask.get_settings()

#Set K and other clustering settings
algorithm_settings = settings.get_algorithm_settings('KMEANS')
algorithm_settings['k'] = [3,5,6]

settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP'
settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05
settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2

settings.save()

#Turn features on/off
for feature in features:
    settings.reject_feature(feature)
settings.use_feature('FEATURE_1')
settings.use_feature('FEATURE_2')
settings.save()


#Train ML task
mltask.start_train()
mltask.wait_train_complete()

ids = mltask.get_trained_models_ids()

#Find highest scoring cluster 
scores=[]
count = 0
for id in ids:
    details = mltask.get_trained_model_details(id)
    algorithm = details.get_modeling_settings()

    #Replace with your chosen metric
    performance = details.get_performance_metrics()['silhouette']

    scores.append({"model_id": id,
                   "performance": performance,
                   "count": count})
    print('-----------Performance-----------')
    print(performance)
    count+=1

scores = sorted(scores, key=lambda k: k['performance'], reverse=True) 

best_model_id = scores[0]['model_id']

#Deploy the best model to the flow
ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset)
model_id = ret["savedModelId"]

#Use a scoring recipe with the training dataset and best model to generate cluster labels
builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project)
builder.with_input_model(best_model_id)
builder.with_input(ml_dataset)

builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP")

cluster_recipe = builder.build()
cluster_recipe.compute_schema_updates().apply()
cluster_recipe.run()

 

Best,

Pat

0 Kudos
A banner prompting to get Dataiku DSS