Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi Team,
I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML.
To better illustrate the scenario, I am giving an example below.
Dataset 1 ---> Clustering model A (winning model is at K = 10) ---> deploy the model for k = 10.
Dataset 2 ---> Clustering model A (winning model is at K = 3) ---> deploy the model for k = 3.
Dataset 3 ---> Clustering model A (winning model is at K = 5) ---> deploy the model for k = 5.
At the end I want to merge all the outputs of the deployed model.
Thank you in advance !
Hi,
The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.
You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.
import dataiku
from dataiku import pandasutils as pdu
from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator
import pandas as pd
import numpy as np
client= dataiku.api_client()
project= client.get_project('PROJECT_KEY')
#Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets
ml_dataset = "INPUT_DATASET_NAME"
#Create a new ML clustering task on input dataset
mltask = project.create_clustering_ml_task(
input_dataset=ml_dataset,
ml_backend_type='PY_MEMORY', # ML backend to use
guess_policy='KMEANS', # Template to use for setting default parameters
wait_guess_complete=True
)
settings = mltask.get_settings()
#Set K and other clustering settings
algorithm_settings = settings.get_algorithm_settings('KMEANS')
algorithm_settings['k'] = [3,5,6]
settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP'
settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05
settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2
settings.save()
#Turn features on/off
for feature in features:
settings.reject_feature(feature)
settings.use_feature('FEATURE_1')
settings.use_feature('FEATURE_2')
settings.save()
#Train ML task
mltask.start_train()
mltask.wait_train_complete()
ids = mltask.get_trained_models_ids()
#Find highest scoring cluster
scores=[]
count = 0
for id in ids:
details = mltask.get_trained_model_details(id)
algorithm = details.get_modeling_settings()
#Replace with your chosen metric
performance = details.get_performance_metrics()['silhouette']
scores.append({"model_id": id,
"performance": performance,
"count": count})
print('-----------Performance-----------')
print(performance)
count+=1
scores = sorted(scores, key=lambda k: k['performance'], reverse=True)
best_model_id = scores[0]['model_id']
#Deploy the best model to the flow
ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset)
model_id = ret["savedModelId"]
#Use a scoring recipe with the training dataset and best model to generate cluster labels
builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project)
builder.with_input_model(best_model_id)
builder.with_input(ml_dataset)
builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP")
cluster_recipe = builder.build()
cluster_recipe.compute_schema_updates().apply()
cluster_recipe.run()
Best,
Pat