Train the model and automatically deploy the winning algorithm based on the metrics defined.
Hi Team,
I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML.
To better illustrate the scenario, I am giving an example below.
Dataset 1 ---> Clustering model A (winning model is at K = 10) ---> deploy the model for k = 10.
Dataset 2 ---> Clustering model A (winning model is at K = 3) ---> deploy the model for k = 3.
Dataset 3 ---> Clustering model A (winning model is at K = 5) ---> deploy the model for k = 5.
At the end I want to merge all the outputs of the deployed model.
Thank you in advance !
Answers
-
Hi,
The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.
You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.
import dataiku from dataiku import pandasutils as pdu from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator import pandas as pd import numpy as np client= dataiku.api_client() project= client.get_project('PROJECT_KEY') #Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets ml_dataset = "INPUT_DATASET_NAME" #Create a new ML clustering task on input dataset mltask = project.create_clustering_ml_task( input_dataset=ml_dataset, ml_backend_type='PY_MEMORY', # ML backend to use guess_policy='KMEANS', # Template to use for setting default parameters wait_guess_complete=True ) settings = mltask.get_settings() #Set K and other clustering settings algorithm_settings = settings.get_algorithm_settings('KMEANS') algorithm_settings['k'] = [3,5,6] settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP' settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05 settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2 settings.save() #Turn features on/off for feature in features: settings.reject_feature(feature) settings.use_feature('FEATURE_1') settings.use_feature('FEATURE_2') settings.save() #Train ML task mltask.start_train() mltask.wait_train_complete() ids = mltask.get_trained_models_ids() #Find highest scoring cluster scores=[] count = 0 for id in ids: details = mltask.get_trained_model_details(id) algorithm = details.get_modeling_settings() #Replace with your chosen metric performance = details.get_performance_metrics()['silhouette'] scores.append({"model_id": id, "performance": performance, "count": count}) print('-----------Performance-----------') print(performance) count+=1 scores = sorted(scores, key=lambda k: k['performance'], reverse=True) best_model_id = scores[0]['model_id'] #Deploy the best model to the flow ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset) model_id = ret["savedModelId"] #Use a scoring recipe with the training dataset and best model to generate cluster labels builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project) builder.with_input_model(best_model_id) builder.with_input(ml_dataset) builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP") cluster_recipe = builder.build() cluster_recipe.compute_schema_updates().apply() cluster_recipe.run()
Best,
Pat