## Sign up to take part

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

This website uses cookies. By clicking OK, you consent to the use of cookies. Read our cookie policy.

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Survey banner

The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Registered users can ask their own questions, contribute to discussions, and be part of the Community!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Solved!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Interpretation of Shapley values in Dataiku

Hello Dataiku users 🙂

I have a question regarding the interpretation of Shapley values provided by Dataiku for a classification problem. From what I have read about this method, the shapley values for each of the features in our model, and for a given individual, gives its contribution in explaining the difference between the probability predicted for this individual and the mean probability. Thus, the sum of the shapley values should be equal to the difference between the predicted probability of that individual and the mean probability.

I would like to better understand how Dataiku calculates these shapley values in details. It is said in Dataiku's documentation that for classification problems these values are log odds ratios of the calculated probabilities, and i assume it is not the difference between the log odds of the predicted probability and the log odd of the mean probability, as is the case for the partial dependency plot.

I need your help on this. And if you also have other suggestions on how to get the best out of these shapley values, I'm interested :).

Sincerely,

2 Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

To compute the Shapley values DSS draws some random samples (called background rows). They can be drawn from the test set of the model or from the dataset to score (if the option "Use input as explanation basis" is checked in the scoring recipe settings).

This constant "diff" must be equal to the mean of natural-log-odds. But this mean is not compute on all the dataset you scored but only on the "background rows".

If you want to verify this, you could make sure that the dataset you want to score is the background rows. For that you could filter the dataset to keep only 100 rows and then make sure in the scoring recipe that the Monte Carlo steps are set to 100 and the option "Use input as explanation basis" is enabled.

About the outliers in the "Diff" column like Passenger 258, I don't have any explanations for now. I was not able to reproduce this behavior on another dataset. I will have a closer look.

Best regards,

Louis Pouillot

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Solutions shown first - Read whole discussion

12 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

As you say, for classifications, the Shapley values are computed on the log odd scale. Therefore, the sum of the Shapley values should be equal to the difference between the **log-odd of** predicted probability of that individual and the mean of **the log-odds of** probabilities. The algorithm is explained in the documentation: https://doc.dataiku.com/dss/7.0/machine-learning/supervised/explanations.html

There are many ways to use Shapley values. For example:

- you can present them to the final user to support decision making,

- you can look at the Shapley values of the extreme predictions to understand how your model works,

- you can look at Shapley values of large errors to understand why your model is making errors,

- you can average the Shapley values (multiplied by the sign of the difference between the prediction and the average prediction) over all samples to obtain feature importances

- you can cluster the Shapley values to understand which groups of samples are explained in the same way to get a better understand of your model.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@nomont Thank you very much for your answer and especially for your very relevant recommendations on how to use the Shapley values.

However, I'm afraid that the sum of the Shapley values calculated by Dataiku is not equal to the difference between "the difference between the **log-odd of** predicted probability of that individual and the mean of **the log-odds of** probabilities". When i did the calculation i found that **the sum of Shapley values are greater than this difference, and can't figure out why**.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I see 2 possibilities:

- The mean is computed on a subset of the test dataset (if simple split is selected) or of the training dataset (if K-fold evaluation is). It is different from the mean computed on the full test (reps. training) dataset. It is also different from the mean computed on the scored dataset. Note that the difference should be the same for all rows.

- The mean of the log-odds of probabilities is different from the log-odds of the mean of probabilities.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@nomont Thank you for your response.

Concerning the first possibility:

- I used the k fold cross validation. I made a simple test which consists in making the difference between the sum of the shapley values and the logg odd of the predicted probability for each individual. This difference is not constant between individuals. That's why I'm asking these questions.

Sincerely.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

OK. Can you send me a project in which I can see what happens?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

What do you need to have in the project ? Because of our data privacy policy i can't share all the project with you.

Sincerely

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@nomont I tried the model on titanic dataset, and come up with the same conclusion as before.

Please find attached the project.

Sincerely.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello @noureini,

Thank you for sharing this project.

After looking at this project, I managed to have the constant "diff" you are talking about by changing the log10 to ln in the prepare recipe step.

Best regards,

Louis Pouillot

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear @louisplt ,

Thank you very much for your response, I really appreciate.

Yes indeed using the natural logarithm gives the solution. It seems that the constant "Diff" is therefore **the mean ln-odd-ratio of probabilities** used to assess the difference between individual prediction and what the model gives if we don't have any information on that particular individual.

**It seems that the constant diff is in fact the ln-odd-ratio of mean probabilities and not the mean of ln-odd-ratio of probabilities** ? Can you confirme this?

In the project the constant "Diff" is equal to "-0.285" for the vast majority of individuals, but for some other it is equal to is equal to "2.51" (see Passenger Id=258, 521,538). What is particular with these observations?

Sincerely yours,

Have a great day.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

To compute the Shapley values DSS draws some random samples (called background rows). They can be drawn from the test set of the model or from the dataset to score (if the option "Use input as explanation basis" is checked in the scoring recipe settings).

This constant "diff" must be equal to the mean of natural-log-odds. But this mean is not compute on all the dataset you scored but only on the "background rows".

If you want to verify this, you could make sure that the dataset you want to score is the background rows. For that you could filter the dataset to keep only 100 rows and then make sure in the scoring recipe that the Monte Carlo steps are set to 100 and the option "Use input as explanation basis" is enabled.

About the outliers in the "Diff" column like Passenger 258, I don't have any explanations for now. I was not able to reproduce this behavior on another dataset. I will have a closer look.

Best regards,

Louis Pouillot

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Didn't Find What You Needed?