Train the model and automatically deploy the winning algorithm based on the metrics defined.

Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi Team,

I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML.

To better illustrate the scenario, I am giving an example below.

Dataset 1 ---> Clustering model A (winning model is at K = 10) ---> deploy the model for k = 10.

Dataset 2 ---> Clustering model A (winning model is at K = 3) ---> deploy the model for k = 3.

Dataset 3 ---> Clustering model A (winning model is at K = 5) ---> deploy the model for k = 5.

At the end I want to merge all the outputs of the deployed model.

Thank you in advance !

Answers

  • pmasiphelps
    pmasiphelps Dataiker, Dataiku DSS Core Designer, Registered Posts: 33 Dataiker
    edited July 17

    Hi,

    The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.

    You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.

    import dataiku
    from dataiku import pandasutils as pdu
    from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator
    import pandas as pd
    import numpy as np
    
    client= dataiku.api_client()
    project= client.get_project('PROJECT_KEY')
    
    #Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets
    ml_dataset = "INPUT_DATASET_NAME"
    
    #Create a new ML clustering task on input dataset
    mltask = project.create_clustering_ml_task(
        input_dataset=ml_dataset,
        ml_backend_type='PY_MEMORY', # ML backend to use
        guess_policy='KMEANS', # Template to use for setting default parameters
        wait_guess_complete=True
    )
    
    settings = mltask.get_settings()
    
    #Set K and other clustering settings
    algorithm_settings = settings.get_algorithm_settings('KMEANS')
    algorithm_settings['k'] = [3,5,6]
    
    settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP'
    settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05
    settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2
    
    settings.save()
    
    #Turn features on/off
    for feature in features:
        settings.reject_feature(feature)
    settings.use_feature('FEATURE_1')
    settings.use_feature('FEATURE_2')
    settings.save()
    
    
    #Train ML task
    mltask.start_train()
    mltask.wait_train_complete()
    
    ids = mltask.get_trained_models_ids()
    
    #Find highest scoring cluster 
    scores=[]
    count = 0
    for id in ids:
        details = mltask.get_trained_model_details(id)
        algorithm = details.get_modeling_settings()
    
        #Replace with your chosen metric
        performance = details.get_performance_metrics()['silhouette']
    
        scores.append({"model_id": id,
                       "performance": performance,
                       "count": count})
        print('-----------Performance-----------')
        print(performance)
        count+=1
    
    scores = sorted(scores, key=lambda k: k['performance'], reverse=True) 
    
    best_model_id = scores[0]['model_id']
    
    #Deploy the best model to the flow
    ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset)
    model_id = ret["savedModelId"]
    
    #Use a scoring recipe with the training dataset and best model to generate cluster labels
    builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project)
    builder.with_input_model(best_model_id)
    builder.with_input(ml_dataset)
    
    builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP")
    
    cluster_recipe = builder.build()
    cluster_recipe.compute_schema_updates().apply()
    cluster_recipe.run()
    

    Best,

    Pat

Setup Info
    Tags
      Help me…