Using Scenarios to automatically retrain models

StephenEaster
StephenEaster Registered Posts: 1 ✭✭✭

Hi,

I'm new to Dataiku and the community and I'm using Dataiku online. Documentation indicates that scenarios can be used to "Automate the retraining of “saved models” on a regular basis, and only activate the new version if the performance is improved". This is exactly what I need to set up but I can't seem to find examples showing the specific steps I would setup in scenarios to accomplish this. Attaching my current flow (I got an error pasting a screenshot ). Any references to documentation / working examples would be great.

Thank you


Operating system used: windows


Operating system used: windows

Tagged:

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker
    edited July 2024

    Hi @StephenEaster
    ,

    Thanks for posting here. Sharing the solution we implemented on your end with the Community.

    Added a custom python step in a scenario :

    import dataiku
    
    # Define variables necessary to run this code (Please assign your own environment's values)
    ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
    ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
    SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
    TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
    
    # client is a DSS API client.
    client = dataiku.api_client()
    p = client.get_project(dataiku.default_project_key())
    
    # Retrieve existing ML task to retrain the model
    mltask = p.get_ml_task(ANALYSIS_ID, ML_TASK_ID)
    
    # Wait for the ML task to be ready
    mltask.wait_guess_complete()
    
    # Start train and wait for it to be complete
    mltask.start_train()
    mltask.wait_train_complete()
    
    # Get the identifiers of the trained models
    # There will be 3 of them because Logistic regression and Random forest were default enabled
    ids = mltask.get_trained_models_ids()
    
    # Iterating through all the existing algorithms to determine which one has the best AUC score or other metrics r2,auc, f1 etc.
    actual_metric = "r2"
    
    temp_auc = 0
    for id in ids:
            details = mltask.get_trained_model_details(id)
            algorithm = details.get_modeling_settings()["algorithm"]
            auc = details.get_performance_metrics()[actual_metric]
    
            if auc > temp_auc:
                best_model = id
                print("Better model identified")
    
            print("Algorithm=%s actual_metric=%s" % (algorithm, auc))
    
    # Let's compare the "best" model of the newly trained model vs the existing model to see which is better
    
    details = mltask.get_trained_model_details(best_model)
    auc = details.get_performance_metrics()[actual_metric]
    
    # We'll need to pull the current model ID from project variables and retrieve the model info
    vars = p.get_variables()
    
    try:
        current_model = vars["standard"]["current_model"]
    except:
        current_model = SAVED_MODEL_ID
    
    current_details = mltask.get_trained_model_details(current_model)
    current_auc = current_details.get_performance_metrics()[actual_metric]
    
    # Let's deploy the model with the best AUC score (either new or existing)
    if auc > current_auc:
        model_to_deploy = best_model
    else: 
        model_to_deploy = current_model
    
    print("Model to deploy identified: " + model_to_deploy)
    
    # Update project variables to reflect the new model ID that is being deployed
    vars["standard"]["current_model"] = model_to_deploy
    p.set_variables(vars)
    
    # Deploy the model to the Flow
    ret = mltask.redeploy_to_flow(model_to_deploy, recipe_name = TRAINING_RECIPE_NAME, activate = True)

    The assumption here is the model is already trained and winning model was deployed to the flow. To obtain the required variables :

    DKt95bYof-9VjT2PjtHmiuRb1Yoo0hvAiw (1).pngYqNMHh57PtP-BnagYuEHBS6avDSSQXdPmw (1).png

    The Saved Model ID e.g that one that is already deployer Screenshot 2022-06-29 at 23.05.46.png

    Regards,

  • COREY
    COREY Registered Posts: 18 ✭✭✭✭

    @AlexT


    Hi Alex,

    Looking to leverage the solution you've provided above to do automated weekly retraining of a model. Is there a way to make the value for 'SAVED_MODEL_ID' more dynamic? I ask because I worry that explicitly coding in the 'saved_model_id' won't be effective in future retrainings, if a new model ID performs better somewhere down the line.

    Here is an example:

    - Initially deployed model ID is model_123

    - On initial retraining step, all potential models are evaluated against model_123

    - Model_456 performs better than model_123 and is automatically deployed

    - In the following week of retraining, the R-squared values would be compared against the initially deployed model (123) instead of the newly deployed model (456).

    I appreciate your help in advance!

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker

    The example provided uses a project variable to determine the new best model.
    So it should update to model 456 and during the next it will check against model 456 and not 123.

    # Update project variables to reflect the new model ID that is being deployed
    vars["standard"]["current_model"] = model_to_deploy
    p.set_variables(vars)


  • COREY
    COREY Registered Posts: 18 ✭✭✭✭
    edited July 2024

    @AlexT

    Thanks, Alex.

    If I run this first block of code the following week, won't all the project variables revert back to what is coded here?

    # Define variables necessary to run this code (Please assign your own environment's values)
    ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
    ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
    SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
    TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,349 Dataiker
    edited July 2024

    It will only revert to initial defined hard coded value if the variable is not present

    try:
        current_model = vars["standard"]["current_model"]
    except:
        current_model = SAVED_MODEL_ID

  • COREY
    COREY Registered Posts: 18 ✭✭✭✭

    I got it now. Thank you for clarifying!

  • COREY
    COREY Registered Posts: 18 ✭✭✭✭

    @Alexandru

    Hi Alex - I'm in the process of applying this retraining code to other scenarios that we have in place, but I'm starting to receive warnings and failures on the retraining step that I haven't seen before and I'm not sure I quite understand the cause.

    The log tail says the following:

    2024-09-17 12:17:12,223 459023 INFO [Child] opened stderr
    2024-09-17 12:17:12,223 459023 INFO [Child] about to close other fd
    2024-09-17 12:17:12,223 459023 INFO [Child] closed other fd
    2024-09-17 12:17:12,223 459023 INFO [Child] chdired
    2024-09-17 12:17:12,223 459023 INFO setting username=dssuser_Corey_A_7de0ab31 uid=1031 gid=1001
    2024-09-17 12:17:12,225 459023 INFO [Child] dropped privileges
    2024-09-17 12:17:12,226 459023 INFO [Child] Checking access to DKUINSTALLDIR and DIP_HOME directories
    2024-09-17 12:17:12,226 459023 INFO [Child] Executing: /data/dataiku/dss_data/code-envs/python/BruceC_python_nba/bin/python : /data/dataiku/dss_data/code-envs/python/BruceC_python_nba/bin/python -u /data/dataiku/dss_data/scenarios/NEW_CUSTOMER_LTV_SCORING/New_Customer_LTV_Test_Scenario/2024-09-17-12-06-22-321/custom-step-Step #0-H: Retrain model/script.py
    Better model identified
    Algorithm=EXTRA_TREES actual_metric=0.6899158968011293
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.49536019840601764
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.6067963591367262
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.5081919720514062
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.5197821734190259
    Better model identified
    Algorithm=EXTRA_TREES actual_metric=0.674242617744111
    Traceback (most recent call last):
    File "/data/dataiku/dss_data/scenarios/NEW_CUSTOMER_LTV_SCORING/New_Customer_LTV_Test_Scenario/2024-09-17-12-06-22-321/custom-step-Step #0-H: Retrain model/script.py", line 34, in <module>
    auc = details.get_performance_metrics()[actual_metric]
    KeyError: 'r2'

    The step log says this:

    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers]  - Closing worker pool pool-kpgpnlf6e4ghxl5b
    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Unregistered worker pool: pool-kpgpnlf6e4ghxl5b
    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [dku.engine] - Successfully closed WorkerPool
    [2024/09/17-12:17:28.984] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Closing worker pool pool-zjtneqhoh1djphlm
    [2024/09/17-12:17:28.984] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Unregistered worker pool: pool-zjtneqhoh1djphlm
    [2024/09/17-12:17:28.985] [AsyncCloser-861] [INFO] [dku.engine] - Successfully closed WorkerPool

    I appreciate your help in advance.

Setup Info
    Tags
      Help me…