Using Scenarios to automatically retrain models

Tags
Registered Posts: 1 ✭✭✭

Hi,

I'm new to Dataiku and the community and I'm using Dataiku online. Documentation indicates that scenarios can be used to "Automate the retraining of “saved models” on a regular basis, and only activate the new version if the performance is improved". This is exactly what I need to set up but I can't seem to find examples showing the specific steps I would setup in scenarios to accomplish this. Attaching my current flow (I got an error pasting a screenshot ). Any references to documentation / working examples would be great.

Thank you


Operating system used: windows


Operating system used: windows

Welcome!

It looks like you're new here. Sign in or register to get started.

Answers

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,270 Dataiker
    edited July 2024

    Hi @StephenEaster
    ,

    Thanks for posting here. Sharing the solution we implemented on your end with the Community.

    Added a custom python step in a scenario :

    import dataiku
    
    # Define variables necessary to run this code (Please assign your own environment's values)
    ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
    ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
    SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
    TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
    
    # client is a DSS API client.
    client = dataiku.api_client()
    p = client.get_project(dataiku.default_project_key())
    
    # Retrieve existing ML task to retrain the model
    mltask = p.get_ml_task(ANALYSIS_ID, ML_TASK_ID)
    
    # Wait for the ML task to be ready
    mltask.wait_guess_complete()
    
    # Start train and wait for it to be complete
    mltask.start_train()
    mltask.wait_train_complete()
    
    # Get the identifiers of the trained models
    # There will be 3 of them because Logistic regression and Random forest were default enabled
    ids = mltask.get_trained_models_ids()
    
    # Iterating through all the existing algorithms to determine which one has the best AUC score or other metrics r2,auc, f1 etc.
    actual_metric = "r2"
    
    temp_auc = 0
    for id in ids:
            details = mltask.get_trained_model_details(id)
            algorithm = details.get_modeling_settings()["algorithm"]
            auc = details.get_performance_metrics()[actual_metric]
    
            if auc > temp_auc:
                best_model = id
                print("Better model identified")
    
            print("Algorithm=%s actual_metric=%s" % (algorithm, auc))
    
    # Let's compare the "best" model of the newly trained model vs the existing model to see which is better
    
    details = mltask.get_trained_model_details(best_model)
    auc = details.get_performance_metrics()[actual_metric]
    
    # We'll need to pull the current model ID from project variables and retrieve the model info
    vars = p.get_variables()
    
    try:
        current_model = vars["standard"]["current_model"]
    except:
        current_model = SAVED_MODEL_ID
    
    current_details = mltask.get_trained_model_details(current_model)
    current_auc = current_details.get_performance_metrics()[actual_metric]
    
    # Let's deploy the model with the best AUC score (either new or existing)
    if auc > current_auc:
        model_to_deploy = best_model
    else: 
        model_to_deploy = current_model
    
    print("Model to deploy identified: " + model_to_deploy)
    
    # Update project variables to reflect the new model ID that is being deployed
    vars["standard"]["current_model"] = model_to_deploy
    p.set_variables(vars)
    
    # Deploy the model to the Flow
    ret = mltask.redeploy_to_flow(model_to_deploy, recipe_name = TRAINING_RECIPE_NAME, activate = True)

    The assumption here is the model is already trained and winning model was deployed to the flow. To obtain the required variables :

    DKt95bYof-9VjT2PjtHmiuRb1Yoo0hvAiw (1).pngYqNMHh57PtP-BnagYuEHBS6avDSSQXdPmw (1).png

    The Saved Model ID e.g that one that is already deployer Screenshot 2022-06-29 at 23.05.46.png

    Regards,

  • Registered Posts: 18 ✭✭✭✭

    @AlexT


    Hi Alex,

    Looking to leverage the solution you've provided above to do automated weekly retraining of a model. Is there a way to make the value for 'SAVED_MODEL_ID' more dynamic? I ask because I worry that explicitly coding in the 'saved_model_id' won't be effective in future retrainings, if a new model ID performs better somewhere down the line.

    Here is an example:

    - Initially deployed model ID is model_123

    - On initial retraining step, all potential models are evaluated against model_123

    - Model_456 performs better than model_123 and is automatically deployed

    - In the following week of retraining, the R-squared values would be compared against the initially deployed model (123) instead of the newly deployed model (456).

    I appreciate your help in advance!

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,270 Dataiker

    The example provided uses a project variable to determine the new best model.
    So it should update to model 456 and during the next it will check against model 456 and not 123.

    # Update project variables to reflect the new model ID that is being deployed
    vars["standard"]["current_model"] = model_to_deploy
    p.set_variables(vars)


  • Registered Posts: 18 ✭✭✭✭
    edited July 2024

    @AlexT

    Thanks, Alex.

    If I run this first block of code the following week, won't all the project variables revert back to what is coded here?

    # Define variables necessary to run this code (Please assign your own environment's values)
    ANALYSIS_ID = 'XXXXXXX' # The identifier of the visual analysis containing the desired ML task
    ML_TASK_ID = 'XXXXXXX' # The identifier of the desired ML task
    SAVED_MODEL_ID = 'S-RESPONSEMODELING-ngsDEF4J-1656449809436' # The identifier of the saved model when initially running the scenario will use variables later
    TRAINING_RECIPE_NAME = 'train_Predict_Revenue_NDays_Company__regression_' # Name of the training recipe to update for redeploying the model
  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,270 Dataiker
    edited July 2024

    It will only revert to initial defined hard coded value if the variable is not present

    try:
        current_model = vars["standard"]["current_model"]
    except:
        current_model = SAVED_MODEL_ID

  • Registered Posts: 18 ✭✭✭✭

    I got it now. Thank you for clarifying!

  • Registered Posts: 18 ✭✭✭✭

    @Alexandru

    Hi Alex - I'm in the process of applying this retraining code to other scenarios that we have in place, but I'm starting to receive warnings and failures on the retraining step that I haven't seen before and I'm not sure I quite understand the cause.

    The log tail says the following:

    2024-09-17 12:17:12,223 459023 INFO [Child] opened stderr
    2024-09-17 12:17:12,223 459023 INFO [Child] about to close other fd
    2024-09-17 12:17:12,223 459023 INFO [Child] closed other fd
    2024-09-17 12:17:12,223 459023 INFO [Child] chdired
    2024-09-17 12:17:12,223 459023 INFO setting username=dssuser_Corey_A_7de0ab31 uid=1031 gid=1001
    2024-09-17 12:17:12,225 459023 INFO [Child] dropped privileges
    2024-09-17 12:17:12,226 459023 INFO [Child] Checking access to DKUINSTALLDIR and DIP_HOME directories
    2024-09-17 12:17:12,226 459023 INFO [Child] Executing: /data/dataiku/dss_data/code-envs/python/BruceC_python_nba/bin/python : /data/dataiku/dss_data/code-envs/python/BruceC_python_nba/bin/python -u /data/dataiku/dss_data/scenarios/NEW_CUSTOMER_LTV_SCORING/New_Customer_LTV_Test_Scenario/2024-09-17-12-06-22-321/custom-step-Step #0-H: Retrain model/script.py
    Better model identified
    Algorithm=EXTRA_TREES actual_metric=0.6899158968011293
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.49536019840601764
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.6067963591367262
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.5081919720514062
    Better model identified
    Algorithm=XGBOOST_REGRESSION actual_metric=0.5197821734190259
    Better model identified
    Algorithm=EXTRA_TREES actual_metric=0.674242617744111
    Traceback (most recent call last):
    File "/data/dataiku/dss_data/scenarios/NEW_CUSTOMER_LTV_SCORING/New_Customer_LTV_Test_Scenario/2024-09-17-12-06-22-321/custom-step-Step #0-H: Retrain model/script.py", line 34, in <module>
    auc = details.get_performance_metrics()[actual_metric]
    KeyError: 'r2'

    The step log says this:

    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers]  - Closing worker pool pool-kpgpnlf6e4ghxl5b
    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Unregistered worker pool: pool-kpgpnlf6e4ghxl5b
    [2024/09/17-12:17:26.193] [AsyncCloser-861] [INFO] [dku.engine] - Successfully closed WorkerPool
    [2024/09/17-12:17:28.984] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Closing worker pool pool-zjtneqhoh1djphlm
    [2024/09/17-12:17:28.984] [AsyncCloser-861] [INFO] [com.dataiku.dip.analysis.ml.distributed.workers] - Unregistered worker pool: pool-zjtneqhoh1djphlm
    [2024/09/17-12:17:28.985] [AsyncCloser-861] [INFO] [dku.engine] - Successfully closed WorkerPool

    I appreciate your help in advance.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.