About Shapley calculations

Solved!
gnaldi62
About Shapley calculations

Hi,

  we're facing big problems with Shapley calculation with a customer. I've found some useful documentation at https://doc.dataiku.com/dss/latest/machine-learning/supervised/explanations.html and a few messages in the community (https://community.dataiku.com/t5/Using-Dataiku/SHAP-Shapley-values-in-Dataiku/m-p/22241, https://com...

Because of big performance issue (calculation is taking more than 16 hours on the local DSS), we're trying to figure out another way to workaround this. The PDF referred to in the documentation is saying that:

"If the model does not provide feature importances, they are computed by training a random
forest surrogate model and using its feature importances" and this is our main problem with trying to reproduce the same Shapley. Which kind of model is used in this case ? Is there any further reference ?

Thanks. Rgds.

Giuseppe

0 Kudos
1 Solution
AdrienL
Dataiker

Hi,

If the model does not expose feature importance from which to compute the most impactful columns to take into account for computing Shapley value estimation, DSS makes a surrogate model using a random forrest regressor (100 trees, max depth 5, subsample of max 1000 rows) and uses the feature importance of this surrogate model.

View solution in original post

0 Kudos
2 Replies
AdrienL
Dataiker

Hi,

If the model does not expose feature importance from which to compute the most impactful columns to take into account for computing Shapley value estimation, DSS makes a surrogate model using a random forrest regressor (100 trees, max depth 5, subsample of max 1000 rows) and uses the feature importance of this surrogate model.

0 Kudos
gnaldi62
Author

Hi Adrien,

  thanks for your quick response. Rgds.

Giuseppe

0 Kudos