Getting Gini variable importance via API?
Looking at this post for guidance: https://community.dataiku.com/t5/Using-Dataiku/How-to-get-Variable-Importance-from-Model/m-p/3589
led me to this documentation: https://developer.dataiku.com/latest/api-reference/python/ml.html#exploration-of-results
where there is documented a function called: compute_shapley_feature_importance()
When I look at my model (xgboost binary classifier), I see an option for Shapley as well as Gini importance. Because my model is only given a single variable (a numeric vector of length 3501) the Shapley importance always says the array is 100% important (thanks Captain Obvious), but the Gini importance actually shows me the importance of the various element numbers in my vector.
I would like to access the Gini importance via the API so I can visualize this data (I want to graph the vector, then use the Gini importance to highlight the important parts of the vector with a vertical reference line). Sadly there is no documentation that I can find that explains how to access the Gini importance. This request is further complicated by the fact that my model is partitioned, so I actually want to access each partition's variable importance.
I've googled it, and come up empty handed.
Can anybody lend a hand and point me to some documentation?
Thanks,
-Jason
Operating system used: Red Hat
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,212 Dataiker
Hi @Jason
,
Your probably comes from the fact that the model is partitioned:
https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/24412#M9446
Are you able to retrieve the feature importance with something like this on a non-partitioned model?import dataiku import pandas as pd analysis_id="r1111" ml_task_id='q1111' trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2' client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) analysis = project.get_analysis(analysis_id) ml_task = analysis.get_ml_task(ml_task_id) #trained_model_ids = ml_task.get_trained_models_ids() trained_model_detail = ml_task.get_trained_model_details(trained_model_id) feature_importance = trained_model_detail.get_raw() if 'iperf' in feature_importance.keys(): raw_importance = feature_importance.get("iperf").get("rawImportance") else: raw_importance = feature_importance.get("perf").get("variables_importance") feature_importance_df = pd.DataFrame(raw_importance)
Thanks
Answers
-
I have successfully retrieved them from a non-partitioned model in the past.... I have not tried with this set of models, nor since I've upgraded to version 12. Here's to hoping this is added to the API in the future. Thanks!