Hello Dataiku users 🙂
I have a question regarding the interpretation of Shapley values provided by Dataiku for a classification problem. From what I have read about this method, the shapley values for each of the features in our model, and for a given individual, gives its contribution in explaining the difference between the probability predicted for this individual and the mean probability. Thus, the sum of the shapley values should be equal to the difference between the predicted probability of that individual and the mean probability.
I would like to better understand how Dataiku calculates these shapley values in details. It is said in Dataiku's documentation that for classification problems these values are log odds ratios of the calculated probabilities, and i assume it is not the difference between the log odds of the predicted probability and the log odd of the mean probability, as is the case for the partial dependency plot.
I need your help on this. And if you also have other suggestions on how to get the best out of these shapley values, I'm interested :).
As you say, for classifications, the Shapley values are computed on the log odd scale. Therefore, the sum of the Shapley values should be equal to the difference between the log-odd of predicted probability of that individual and the mean of the log-odds of probabilities. The algorithm is explained in the documentation: https://doc.dataiku.com/dss/7.0/machine-learning/supervised/explanations.html
There are many ways to use Shapley values. For example:
- you can present them to the final user to support decision making,
- you can look at the Shapley values of the extreme predictions to understand how your model works,
- you can look at Shapley values of large errors to understand why your model is making errors,
- you can average the Shapley values (multiplied by the sign of the difference between the prediction and the average prediction) over all samples to obtain feature importances
- you can cluster the Shapley values to understand which groups of samples are explained in the same way to get a better understand of your model.
@nomont Thank you very much for your answer and especially for your very relevant recommendations on how to use the Shapley values.
However, I'm afraid that the sum of the Shapley values calculated by Dataiku is not equal to the difference between "the difference between the log-odd of predicted probability of that individual and the mean of the log-odds of probabilities". When i did the calculation i found that the sum of Shapley values are greater than this difference, and can't figure out why.
I see 2 possibilities:
- The mean is computed on a subset of the test dataset (if simple split is selected) or of the training dataset (if K-fold evaluation is). It is different from the mean computed on the full test (reps. training) dataset. It is also different from the mean computed on the scored dataset. Note that the difference should be the same for all rows.
- The mean of the log-odds of probabilities is different from the log-odds of the mean of probabilities.
@nomont Thank you for your response.
Concerning the first possibility:
- I used the k fold cross validation. I made a simple test which consists in making the difference between the sum of the shapley values and the logg odd of the predicted probability for each individual. This difference is not constant between individuals. That's why I'm asking these questions.
Thank you for sharing this project.
After looking at this project, I managed to have the constant "diff" you are talking about by changing the log10 to ln in the prepare recipe step.
Dear @louisplt ,
Thank you very much for your response, I really appreciate.
Yes indeed using the natural logarithm gives the solution. It seems that the constant "Diff" is therefore the mean ln-odd-ratio of probabilities used to assess the difference between individual prediction and what the model gives if we don't have any information on that particular individual.
It seems that the constant diff is in fact the ln-odd-ratio of mean probabilities and not the mean of ln-odd-ratio of probabilities ? Can you confirme this?
In the project the constant "Diff" is equal to "-0.285" for the vast majority of individuals, but for some other it is equal to is equal to "2.51" (see Passenger Id=258, 521,538). What is particular with these observations?
Have a great day.
To compute the Shapley values DSS draws some random samples (called background rows). They can be drawn from the test set of the model or from the dataset to score (if the option "Use input as explanation basis" is checked in the scoring recipe settings).
This constant "diff" must be equal to the mean of natural-log-odds. But this mean is not compute on all the dataset you scored but only on the "background rows".
If you want to verify this, you could make sure that the dataset you want to score is the background rows. For that you could filter the dataset to keep only 100 rows and then make sure in the scoring recipe that the Monte Carlo steps are set to 100 and the option "Use input as explanation basis" is enabled.
About the outliers in the "Diff" column like Passenger 258, I don't have any explanations for now. I was not able to reproduce this behavior on another dataset. I will have a closer look.
Quick follow-up: for passenger 258, the difference comes from the fact that probabilities are clipped when lower than 0.01 or higher 0.99 before the neperian logit is computed.