About Shapley calculations

gnaldi62 · ‎08-07-2023

Hi,

we're facing big problems with Shapley calculation with a customer. I've found some useful documentation at https://doc.dataiku.com/dss/latest/machine-learning/supervised/explanations.html and a few messages in the community (https://community.dataiku.com/t5/Using-Dataiku/SHAP-Shapley-values-in-Dataiku/m-p/22241, https://com...

Because of big performance issue (calculation is taking more than 16 hours on the local DSS), we're trying to figure out another way to workaround this. The PDF referred to in the documentation is saying that:

"If the model does not provide feature importances, they are computed by training a random
forest surrogate model and using its feature importances" and this is our main problem with trying to reproduce the same Shapley. Which kind of model is used in this case ? Is there any further reference ?

Thanks. Rgds.

Giuseppe

AdrienL · ‎08-07-2023

Hi,

If the model does not expose feature importance from which to compute the most impactful columns to take into account for computing Shapley value estimation, DSS makes a surrogate model using a random forrest regressor (100 trees, max depth 5, subsample of max 1000 rows) and uses the feature importance of this surrogate model.

View solution in original post

AdrienL · ‎08-07-2023

Hi,

If the model does not expose feature importance from which to compute the most impactful columns to take into account for computing Shapley value estimation, DSS makes a surrogate model using a random forrest regressor (100 trees, max depth 5, subsample of max 1000 rows) and uses the feature importance of this surrogate model.

gnaldi62 · ‎08-07-2023

Hi Adrien,

thanks for your quick response. Rgds.

Giuseppe

Sign up to take part

About Shapley calculations

About Shapley calculations