Getting Gini variable importance via API?

Options
Jason
Jason Registered Posts: 29 ✭✭✭✭✭

Looking at this post for guidance: https://community.dataiku.com/t5/Using-Dataiku/How-to-get-Variable-Importance-from-Model/m-p/3589

led me to this documentation: https://developer.dataiku.com/latest/api-reference/python/ml.html#exploration-of-results

where there is documented a function called: compute_shapley_feature_importance()

When I look at my model (xgboost binary classifier), I see an option for Shapley as well as Gini importance. Because my model is only given a single variable (a numeric vector of length 3501) the Shapley importance always says the array is 100% important (thanks Captain Obvious), but the Gini importance actually shows me the importance of the various element numbers in my vector.

I would like to access the Gini importance via the API so I can visualize this data (I want to graph the vector, then use the Gini importance to highlight the important parts of the vector with a vertical reference line). Sadly there is no documentation that I can find that explains how to access the Gini importance. This request is further complicated by the fact that my model is partitioned, so I actually want to access each partition's variable importance.

I've googled it, and come up empty handed.

Can anybody lend a hand and point me to some documentation?

Thanks,

-Jason


Operating system used: Red Hat

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17 Answer ✓
    Options

    Hi @Jason
    ,
    Your probably comes from the fact that the model is partitioned:

    https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/24412#M9446


    Are you able to retrieve the feature importance with something like this on a non-partitioned model?

    import dataiku
    import pandas as pd
    
    analysis_id="r1111"
    ml_task_id='q1111'
    trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2'
    
    client = dataiku.api_client()
    project = client.get_project(dataiku.default_project_key())
    analysis = project.get_analysis(analysis_id)
    ml_task = analysis.get_ml_task(ml_task_id)
    #trained_model_ids = ml_task.get_trained_models_ids()
    
    trained_model_detail = ml_task.get_trained_model_details(trained_model_id)
    
    feature_importance = trained_model_detail.get_raw()
    if 'iperf' in feature_importance.keys():
        raw_importance = feature_importance.get("iperf").get("rawImportance")
    else:
        raw_importance = feature_importance.get("perf").get("variables_importance")
    
    feature_importance_df = pd.DataFrame(raw_importance)
    


    Thanks

Answers

  • Jason
    Jason Registered Posts: 29 ✭✭✭✭✭
    Options

    I have successfully retrieved them from a non-partitioned model in the past.... I have not tried with this set of models, nor since I've upgraded to version 12. Here's to hoping this is added to the API in the future. Thanks!

Setup Info
    Tags
      Help me…