Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Looking at this post for guidance: https://community.dataiku.com/t5/Using-Dataiku/How-to-get-Variable-Importance-from-Model/m-p/3589
led me to this documentation: https://developer.dataiku.com/latest/api-reference/python/ml.html#exploration-of-results
where there is documented a function called: compute_shapley_feature_importance()
When I look at my model (xgboost binary classifier), I see an option for Shapley as well as Gini importance. Because my model is only given a single variable (a numeric vector of length 3501) the Shapley importance always says the array is 100% important (thanks Captain Obvious), but the Gini importance actually shows me the importance of the various element numbers in my vector.
I would like to access the Gini importance via the API so I can visualize this data (I want to graph the vector, then use the Gini importance to highlight the important parts of the vector with a vertical reference line). Sadly there is no documentation that I can find that explains how to access the Gini importance. This request is further complicated by the fact that my model is partitioned, so I actually want to access each partition's variable importance.
I've googled it, and come up empty handed.
Can anybody lend a hand and point me to some documentation?
Thanks,
-Jason
Operating system used: Red Hat
Hi @Jason ,
Your probably comes from the fact that the model is partitioned:
https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/...
Are you able to retrieve the feature importance with something like this on a non-partitioned model?
import dataiku
import pandas as pd
analysis_id="r1111"
ml_task_id='q1111'
trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2'
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
analysis = project.get_analysis(analysis_id)
ml_task = analysis.get_ml_task(ml_task_id)
#trained_model_ids = ml_task.get_trained_models_ids()
trained_model_detail = ml_task.get_trained_model_details(trained_model_id)
feature_importance = trained_model_detail.get_raw()
if 'iperf' in feature_importance.keys():
raw_importance = feature_importance.get("iperf").get("rawImportance")
else:
raw_importance = feature_importance.get("perf").get("variables_importance")
feature_importance_df = pd.DataFrame(raw_importance)
Thanks
Hi @Jason ,
Your probably comes from the fact that the model is partitioned:
https://community.dataiku.com/t5/Using-Dataiku/Retrieve-feature-importance-from-partioned-model/m-p/...
Are you able to retrieve the feature importance with something like this on a non-partitioned model?
import dataiku
import pandas as pd
analysis_id="r1111"
ml_task_id='q1111'
trained_model_id='A-PROJECT-KEY-rYUdqksI-qZZ8xzfM-s5-pp1-m2'
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
analysis = project.get_analysis(analysis_id)
ml_task = analysis.get_ml_task(ml_task_id)
#trained_model_ids = ml_task.get_trained_models_ids()
trained_model_detail = ml_task.get_trained_model_details(trained_model_id)
feature_importance = trained_model_detail.get_raw()
if 'iperf' in feature_importance.keys():
raw_importance = feature_importance.get("iperf").get("rawImportance")
else:
raw_importance = feature_importance.get("perf").get("variables_importance")
feature_importance_df = pd.DataFrame(raw_importance)
Thanks
I have successfully retrieved them from a non-partitioned model in the past.... I have not tried with this set of models, nor since I've upgraded to version 12. Here's to hoping this is added to the API in the future. Thanks!