SHAP (Shapley values) in Dataiku

jxh
jxh Registered Posts: 6 ✭✭✭✭

I am still new to Dataiku and am learning about its capabilities and am wondering if there is a way to replicate the global interpretability of the SHAP algorithm in Dataiku. I am familiar with SHAP in Python and have used it to show the positive and negative relationships of predictors with the target variable across all of our data.

How can I replicate this interpretation in Dataiku? I see the ability to calculate Shapley values at the individual row level (Individual explanations tab) or calculate them in the Interactive scoring tab which shows the impact of changing a feature.

I want to be able to output an interpretation similar to a summary_plot of shap in Python that shows the impact on model output by feature overall across all data points to show which features are positively or negatively correlated with the target variable based on the SHAP value.

Best Answer

  • louisplt
    louisplt Dataiker Posts: 21 Dataiker
    Answer ✓

    Hello @jxh
    ,

    Be aware that Dataiku DSS doesn't use the package SHAP, but its own homemade algorithm to compute the Shapley values. This means you cannot output directly the summary plot you are used to with SHAP.

    You can use the scoring recipe with the option "Compute individual explanations" to compute the Shapley values on all the input rows. Then to compute the feature importance you can average the absolute Shapley values per feature across the data (this is how feature importance is computed with the SHAP package).

    Hope this helps

    Louis

Answers

  • CoreyS
    CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭

    Hi @jxh
    while you wait for a more complete response, I was wondering if you had the opportunity to look at this post:

    I hope this helps!

  • jxh
    jxh Registered Posts: 6 ✭✭✭✭

    I did have a chance to look at this thread but it's not immediately clear to me how to get to the result I'm looking for. Do I select "Compute individual explanations" when scoring the dataset I want to apply the model to? Or is there a way to find the feature importance of the model in the model training under Visual Analysis?

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron

    Hi @jxh
    ,

    I would really like to see this feature added to Dataiku as well. I've used the Shap package a lot with hand coded models and have found it to be very useful gaining insight into a model. It's one of a couple of things I miss the most when using Dataiku's visual ML functionality. (Another is not being able to automatically stratify when splitting the train/test sets.)

    If you'd be up for it, you could copy and paste your post (with perhaps a bit of editing) over to the Product Ideas section where it can be considered more formally for a future enhancement to Dataiku. I'd certainly vote for it!

    Marlan

  • jxh
    jxh Registered Posts: 6 ✭✭✭✭

    Agree on the stratified sampling as well. It looks like there are some useful functionalities that are currently missing on the Dataiku platform that could be very useful to include.

  • jxh
    jxh Registered Posts: 6 ✭✭✭✭

    Do you have further insight on how sub chunk size, number of Monte Carlo steps, and the "Use input as explanation basis" affects the results that come out of this option? I can't find much documentation on it.

  • louisplt
    louisplt Dataiker Posts: 21 Dataiker

    Hello @jxh
    ,

    - Sub chunk size is used to reduce the memory footprint of the algorithms. If you increase it, it would be faster but you could run out of memory.

    - The higher the number of Monte Carlo steps, the more accurate the Shapley values produced. Increasing this number will slow the computation.

    - When computing the explanations the Shapley algorithm needs sample rows to modify the rows to explain and see the impact of those modifications on the prediction. Usually a sample of the test set is used as sample rows, but you can check the option "Use input as explanation basis" to use a sample of the input dataset (the one to be scored) instead. The impact of this option on the output is very difficult to predict, it depends on the input data compared the test set. Unless you specifically need this behavior I suggest you keep this option deactivated.

    Hope this is clearer now

    Louis

  • Ashley
    Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker

    Hi,

    Updating this thread to let you know about 'Universal Feature Importance' which was shipped with version 12, which might be what you're looking for. Here's a short video with some details.

    Cheers,

    Ashley

Setup Info
    Tags
      Help me…