About Shapley calculations
Hi,
we're facing big problems with Shapley calculation with a customer. I've found some useful documentation at https://doc.dataiku.com/dss/latest/machine-learning/supervised/explanations.html and a few messages in the community (https://community.dataiku.com/t5/Using-Dataiku/SHAP-Shapley-values-in-Dataiku/m-p/22241, https://community.dataiku.com/t5/Using-Dataiku/Interpretation-of-Shapley-values-in-Dataiku/m-p/7233, https://community.dataiku.com/t5/General-Discussion/Individual-Explanations/m-p/15378).
Because of big performance issue (calculation is taking more than 16 hours on the local DSS), we're trying to figure out another way to workaround this. The PDF referred to in the documentation is saying that:
"If the model does not provide feature importances, they are computed by training a random
forest surrogate model and using its feature importances" and this is our main problem with trying to reproduce the same Shapley. Which kind of model is used in this case ? Is there any further reference ?
Thanks. Rgds.
Giuseppe
Best Answer
-
Hi,
If the model does not expose feature importance from which to compute the most impactful columns to take into account for computing Shapley value estimation, DSS makes a surrogate model using a random forrest regressor (100 trees, max depth 5, subsample of max 1000 rows) and uses the feature importance of this surrogate model.
Answers
-
gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron
Hi Adrien,
thanks for your quick response. Rgds.
Giuseppe