SHAP (Shapley values) in Dataiku

Solved!
jxh
Level 2
SHAP (Shapley values) in Dataiku

I am still new to Dataiku and am learning about its capabilities and am wondering if there is a way to replicate the global interpretability of the SHAP algorithm in Dataiku. I am familiar with SHAP in Python and have used it to show the positive and negative relationships of predictors with the target variable across all of our data.

How can I replicate this interpretation in Dataiku? I see the ability to calculate Shapley values at the individual row level (Individual explanations tab) or calculate them in the Interactive scoring tab which shows the impact of changing a feature.

I want to be able to output an interpretation similar to a summary_plot of shap in Python that shows the impact on model output by feature overall across all data points to show which features are positively or negatively correlated with the target variable based on the SHAP value.

1 Solution
louisplt
Dataiker

Hello @jxh,

Be aware that Dataiku DSS doesn't use the package SHAP, but its own homemade algorithm to compute the Shapley values. This means you cannot output directly the summary plot you are used to with SHAP.

You can use the scoring recipe with the option "Compute individual explanations" to compute the Shapley values on all the input rows. Then to compute the feature importance you can average the absolute Shapley values per feature across the data (this is how feature importance is computed with the SHAP package).

Hope this helps

Louis

View solution in original post

8 Replies
CoreyS
Dataiker Alumni

Hi @jxh while you wait for a more complete response, I was wondering if you had the opportunity to look at this post: Interpretation of Shapley values in Dataiku 

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos
jxh
Level 2
Author

I did have a chance to look at this thread but it's not immediately clear to me how to get to the result I'm looking for. Do I select "Compute individual explanations" when scoring the dataset I want to apply the model to? Or is there a way to find the feature importance of the model in the model training under Visual Analysis?

0 Kudos
louisplt
Dataiker

Hello @jxh,

Be aware that Dataiku DSS doesn't use the package SHAP, but its own homemade algorithm to compute the Shapley values. This means you cannot output directly the summary plot you are used to with SHAP.

You can use the scoring recipe with the option "Compute individual explanations" to compute the Shapley values on all the input rows. Then to compute the feature importance you can average the absolute Shapley values per feature across the data (this is how feature importance is computed with the SHAP package).

Hope this helps

Louis

jxh
Level 2
Author

Do you have further insight on how sub chunk size, number of Monte Carlo steps, and the "Use input as explanation basis" affects the results that come out of this option? I can't find much documentation on it.

louisplt
Dataiker

Hello @jxh,

- Sub chunk size is used to reduce the memory footprint of the algorithms. If you increase it, it would be faster but you could run out of memory.

- The higher the number of Monte Carlo steps, the more accurate the Shapley values produced. Increasing this number will slow the computation.

- When computing the explanations the Shapley algorithm needs sample rows to modify the rows to explain and see the impact of those modifications on the prediction. Usually a sample of the test set is used as sample rows, but you can check the option "Use input as explanation basis" to use a sample of the input dataset (the one to be scored) instead. The impact of this option on the output is very difficult to predict, it depends on the input data compared the test set. Unless you specifically need this behavior I suggest you keep this option deactivated.

Hope this is clearer now

Louis

Marlan

Hi @jxh,

I would really like to see this feature added to Dataiku as well. I've used the Shap package a lot with hand coded models and have found it to be very useful gaining insight into a model. It's one of a couple of things I miss the most when using Dataiku's visual ML functionality. (Another is not being able to automatically stratify when splitting the train/test sets.)

If you'd be up for it, you could copy and paste your post (with perhaps a bit of editing) over to the Product Ideas section where it can be considered more formally for a future enhancement to Dataiku. I'd certainly vote for it!

Marlan

jxh
Level 2
Author

Agree on the stratified sampling as well. It looks like there are some useful functionalities that are currently missing on the Dataiku platform that could be very useful to include.

AshleyW
Dataiker

Hi, 

Updating this thread to let you know about 'Universal Feature Importance' which was shipped with version 12, which might be what you're looking for. Here's a short video with some details. 

Cheers, 

Ashley

0 Kudos