Getting Gini variable importance via API?

Jason · ‎02-21-2024

Looking at this post for guidance: https://community.dataiku.com/t5/Using-Dataiku/How-to-get-Variable-Importance-from-Model/m-p/3589

led me to this documentation: https://developer.dataiku.com/latest/api-reference/python/ml.html#exploration-of-results

where there is documented a function called: compute_shapley_feature_importance()

When I look at my model (xgboost binary classifier), I see an option for Shapley as well as Gini importance. Because my model is only given a single variable (a numeric vector of length 3501) the Shapley importance always says the array is 100% important (thanks Captain Obvious), but the Gini importance actually shows me the importance of the various element numbers in my vector.

I would like to access the Gini importance via the API so I can visualize this data (I want to graph the vector, then use the Gini importance to highlight the important parts of the vector with a vertical reference line). Sadly there is no documentation that I can find that explains how to access the Gini importance. This request is further complicated by the fact that my model is partitioned, so I actually want to access each partition's variable importance.

I've googled it, and come up empty handed.

Can anybody lend a hand and point me to some documentation?

Thanks,

-Jason

Operating system used: Red Hat

AlexT · ‎02-21-2024

Hi @Jason ,
Your probably comes from the fact that the model is partitioned:

https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/...

Are you able to retrieve the feature importance with something like this on a non-partitioned model?

import dataiku
import pandas as pd

analysis_id="r1111"
ml_task_id='q1111'
trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2'

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
analysis = project.get_analysis(analysis_id)
ml_task = analysis.get_ml_task(ml_task_id)
#trained_model_ids = ml_task.get_trained_models_ids()

trained_model_detail = ml_task.get_trained_model_details(trained_model_id)

feature_importance = trained_model_detail.get_raw()
if 'iperf' in feature_importance.keys():
    raw_importance = feature_importance.get("iperf").get("rawImportance")
else:
    raw_importance = feature_importance.get("perf").get("variables_importance")

feature_importance_df = pd.DataFrame(raw_importance)

Thanks

View solution in original post

AlexT · ‎02-21-2024

Hi @Jason ,
Your probably comes from the fact that the model is partitioned:

https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/...

Are you able to retrieve the feature importance with something like this on a non-partitioned model?

import dataiku
import pandas as pd

analysis_id="r1111"
ml_task_id='q1111'
trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2'

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
analysis = project.get_analysis(analysis_id)
ml_task = analysis.get_ml_task(ml_task_id)
#trained_model_ids = ml_task.get_trained_models_ids()

trained_model_detail = ml_task.get_trained_model_details(trained_model_id)

feature_importance = trained_model_detail.get_raw()
if 'iperf' in feature_importance.keys():
    raw_importance = feature_importance.get("iperf").get("rawImportance")
else:
    raw_importance = feature_importance.get("perf").get("variables_importance")

feature_importance_df = pd.DataFrame(raw_importance)

Thanks

Jason · ‎03-07-2024

I have successfully retrieved them from a non-partitioned model in the past.... I have not tried with this set of models, nor since I've upgraded to version 12. Here's to hoping this is added to the API in the future. Thanks!

Getting Gini variable importance via API?

Getting Gini variable importance via API?

Labels

Partitioning

Python

Sign up to take part

Getting Gini variable importance via API?

Getting Gini variable importance via API?

Labels

Partitioning

Python