When interpreting the output of predictive data, make sure to perform every step carefully. To satisfy users, you need to provide easy to understand insights. You can do so by improving the process of your model. It’s okay to concentrate on simple models rather than complex ones. For instance, linear models will help you with easy interpretation. However, with excessive amounts of data, using complex models comes with numerous benefits. With such a model, you can bring your forefront trade-off to accurate and interpretability output. You can choose from numerous different methods to solve complex issues. However, these solutions do not imply how these methods relate to each other. Furthermore, there is no data to back why up why one method is better than the other.
SHAP construction gains inspiration from the previous unified framework. This new approach to the SHAP framework uses Shapely values. Below, you can understand the definition of SHAP and how you can implement the concept with the Python package.
What is SHAP?
Shapley Additive exPlanations or SHAP is an approach used in game theory. With SHAP, you can explain the output of your machine learning model. This model connects the local explanation of the optimal credit allocation with the help of Shapely values. This approach is highly effective with game theory.
How you can calculate the Shapely Values
SHAP is a featured value of average marginal contribution among all the combinations of the feature that are possible. Below, we will discuss how SHAP or Shapely Additive exPlanations is becoming a popular technique in machine learning. We can understand the concept with the following example:
We can consider the points that a team scores in every match of a season. Suppose we want to find the average score of Player A and his contribution as a team score in a match. For that, we need to find the contribution of Player A in the partnership of Player B and Player C.
NOTE
While you perform the experiment, you need to ensure the following conditions about the matches:
- Before performing the experiment, you need to assume that players’ trial is already complete
- You also need to assume that each player can perform in at least one match so we can base the result on relevant data.
- There should be a match in which one player is not available while the other two are available.
- Below you will find just an example. You can take any metric according to the tournament’s ranking. Below you will find the total points as the metric:
Step 1: When Player A is not playing, but Player B and C are playing in a combination.
In this condition, we need to take the average points of the matches. You need to remember that Player A is not playing, so the average will only include the scores of Player B, and C.You can also take a single random sample to find the answer. In this example, you will assume an average total score equals 60 points.
Step 2: When Player C is not playing but Player A and B are playing in a combination.
Now we will consider the average of Player A and B, while Player C is not playing. Suppose the total score of the team is 90 points.
As all the players performed in any one of the matches, we can now find the total points of Player A by subtracting 85 to 65. The answer would be 30 points. You can also perform the experiment by calculating the average of the experiment multiple times and finding the difference.
Implementation of Codes
First, you need to import all the necessary libraries with the help of the following codes:
import pandas as pd
import numpy as np
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import tree
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings(‘ignore’)
Pre-process the Data after Reading
The following example is about Real Estate. However, you can use any dataset to find the output with this method. This is just an example, so imputation and pre-processing are not important. But when you are going through an original test, you need to follow the complete procedure:
Read the data
data = pd.read_csv(‘data.csv’)# Remove features with high null values
data.drop([‘PoolQC’, ‘MiscFeature’, ‘Fence’, ‘FireplaceQu’,
‘LotFrontage’], inplace=True, axis=1)# Drop null values
data.dropna(inplace=True)# Prepare X and Y
X = pd.get_dummies(data)
X.drop([‘SalePrice’], inplace=True, axis=1)
y = data[‘SalePrice’]
Fit your Model
In this step you need to fit the model with the dataset:
model = XGBRegressor(n_estimators=1000, max_depth=10, learning_rate=0.001)# Fit the Model
model.fit(X, y)
Important Features of Shap Values
Now, you need to use the SHAP library. This is the most powerful library available. Check the plots they are offering.
• First, you need to start a JS visualization code in your library.
load JS visualization code to notebook
shap.initjs()
• Now you can explain the prediction of your model.
• You can start by collecting the SHAP values and the explainer
shap_values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
Plot the Results
Force Plotting
i = 5
shap.force_plot(explainer.expected_value, shap_values[i], features=X.iloc[i], feature_names=X.columns)
Conclusion
With the help of the above explanation, you can view features that contribute to find the output of your model and push the base value. The base value is the average output of the model that we receive with the help of training data.