Explaination of SHAP value from XGBoost - python

I had fitted a XGBoost model for binary classification. I am trying to understand the fitted model and trying to use SHAP to explain the prediction.
However, I get confused by the force plot generated by SHAP. I expected the output value should be smaller than 0 as the predicted probability is less than 0.5. However, the SHAP value shows 8.12.
Below are my code to generate the result.
import shap
import xgboost as xgb
import json
from scipy.sparse import load_npz
print('Version of SHAP: {}'.format(shap.__version__))
print('Version of XGBoost: {}'.format(xgb.__version__))
Version of SHAP: 0.39.0
Version of XGBoost: 1.4.1
# Read the data
X = load_npz('test_data.npz')
X_dmatrix = xgb.DMatrix(X)
# Read the selected features
with open('feature_list.json', 'r') as file:
feature_list = json.load(file)
feature_names = [f'Feature {x:04d}' for x in range(len(feature_list))]
# Read the XGBoost model
xgb_model = xgb.Booster()
xgb_model.load_model('xgboost.json')
# Model prediction
model_pred_detail = xgb_model.predict(X_dmatrix, pred_contribs=True)
model_pred_prob = xgb_model.predict(X_dmatrix)
model_pred_detail.shape
(7887, 501)
# Random select a case
xid=4549
print('Predict proba: {:.04f}'.format(model_pred_prob[xid]))
Predict proba: 0.2292
# Doing SHAP way (https://github.com/slundberg/shap)
explainer = shap.Explainer(xgb_model, feature_names=feature_names, algorithm='tree')
shap_values = explainer(X.toarray())
shap.plots.force(shap_values[xid])
However, I get another plot if I use the SHAP value from XGBoost library which looks similar to my expectation.
shap.force_plot(
model_pred_detail[xid, -1], # From XGBoost.Booster.predict with pred_contribs=True
model_pred_detail[xid, 0:-1], # From XGBoost.Booster.predict with pred_contribs=True
feature_names=feature_names,
features=X[xid].toarray()
)
Why does this happen? Which one should be the correct SHAP value to explain the XGBoost model?
Thank you for your help.
Follow up with the reply from #sergey-bushmanov
Since I cannot share my own data, I reproduce the situation with open dataset from Kaggle.
Here is the code for model training:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
import xgboost as xgb
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from matplotlib import pyplot
import io
from scipy.sparse import save_npz
# parameter setting
class_weight = 10
minNgramLength = 1
maxNgramLength = 3
binary = False
min_df = 20
# Convert to fix the problem of encoding
with open('Corona_NLP_train.csv', 'rb') as file:
csv_file = file.read()
csv_file2 = csv_file.decode('utf-8', errors='replace')
# Read and split data
df_note = pd.read_csv(io.StringIO(csv_file2), encoding='utf-8')
df_note['label'] = np.where(df_note['Sentiment'].str.contains('negative', flags=re.I), 0, 1)
df_train, df_test = train_test_split(df_note, test_size=0.2, random_state=42)
# Tokenization
vectorizer = CountVectorizer(max_df=0.98,
min_df=min_df,
binary=binary,
ngram_range=(minNgramLength, maxNgramLength))
vectorizer.fit(df_train['OriginalTweet'])
X_train = vectorizer.transform(df_train['OriginalTweet']).astype(float)
y_train = df_train['label'].astype(float).reset_index(drop=True)
last_params ={
'lambda': 0.00016096144192346114,
'alpha': 0.057770973181367063,
'eta': 0.19258319097144733,
'gamma': 0.40032424821976653,
'max_depth': 9,
'min_child_weight': 5,
'subsample': 0.31304772813494836,
'colsample_bytree': 0.4214452441229869,
'objective': 'binary:logistic',
'verbosity': 0,
'n_estimators': 400
}
classifierCV = xgb.XGBClassifier(**last_params, importance_type='gain')
classifierCV.fit(X_train, y_train, sample_weight=w_train)
# Get the features
feature_names = vectorizer.get_feature_names()
# save model
classifierCV.get_booster().save_model('xgboost.json')
# Save features
import json
with open('feature_list.json', 'w') as file:
file.write(json.dumps({y:x for x, y in enumerate(feature_names)}))
# save data
save_npz('test_data.npz', X_train)
The problem is still here with this model.

Which one should be the correct SHAP value to explain the XGBoost model?
Let's make a guess you have a binary classification at hand. Then, what you're getting in your 2nd example is indeed the right decomposition of raw SHAP values:
In [1]: from scipy.special import expit
In [2]: expit(-1.21)
Out[2]: 0.22970105095339813
Note, .2297 is close to what you see in your:
Predict proba: 0.2292
As for:
Why does this happen?
most probably you have a typo somewhere, but to be sure you have to provide a fully reproducible example, including your data, because codewise both ways calculating SHAP values are correct.

Related

How to use a GradientBoostingRegressor in scikit-learn with 3 output dimensions

I am trying to map 13-dimensional input data to 3-dimensional output data by using RandomForest and GradientBoostingRegressor of scikit-learn. While for the RandomForest regressor this works fine, I get a ValueError for the GradientBoostingRegressor stating ValueError: y should be a 1d array, got an array of shape (16127, 3) instead.
I don't really understand why I get this error when using GradientBoostingRegressor and not when using the RandomForestRegressor. As far as I understand, both of them use decision trees as a weak learner and combine them to get a good result. Of course I know that I could transform the 3-dimensional output-labels to a 1-dimensional array but this does not make sense as i want to map to a 3-dimensional output-vector. Any idea how I can do this using the GradientBoostingRegressor?
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Read data from csv files
Input_data_features = pd.read_csv("C:/Users/wi9632/Desktop/TestData_InputFeatures.csv", sep=';')
Input_data_labels = pd.read_csv("C:/Users/wi9632/Desktop/TestData_OutputLabels.csv", sep=';')
Input_data_features = Input_data_features.values
Input_data_labels = Input_data_labels.values
# standardize input features X and output labels Y
scaler_standardized_X = StandardScaler()
Input_data_features = scaler_standardized_X.fit_transform(Input_data_features)
scaler_standardized_Y = StandardScaler()
Input_data_labels = scaler_standardized_Y.fit_transform(Input_data_labels)
# Split dataset into train, validation, an test
index_X_Train_End = int(0.7 * len(Input_data_features))
index_X_Validation_End = int(0.9 * len(Input_data_features))
X_train = Input_data_features[0: index_X_Train_End]
X_valid = Input_data_features[index_X_Train_End: index_X_Validation_End]
X_test = Input_data_features[index_X_Validation_End:]
Y_train = Input_data_labels[0: index_X_Train_End]
Y_valid = Input_data_labels[index_X_Train_End: index_X_Validation_End]
Y_test = Input_data_labels[index_X_Validation_End:]
#Define a random forest model and train it
model_randomForest = RandomForestRegressor( )
model_randomForest.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_randomForest = model_randomForest.predict(X_test)
print(f"Random Forest Prediction: {Y_pred_randomForest}")
#Define a gradient boosting model and train it (-->Here I get the ValueError)
model_gradientBoosting = GradientBoostingRegressor( )
model_gradientBoosting.fit(X_train, Y_train)
#Predict the test data with Random Forest
Y_pred_gradientBoosting = model_gradientBoosting.predict(X_test)
print(f"Gradient Boosting Prediction: {Y_pred_gradientBoosting}")
Here is the test data: https://filetransfer.io/data-package/ABCrGPzt#link
Reminder: As I could not solve my problem, I would like to remind you on this question. Does anybody have an idea how to tackle this problem?
RandomForestRegressor supports multi output regression, see docs. GradientBoostingRegressor does not.
You can use MultiOutputRegressor + GradientBoostingRegressor for the problem. See this answer.
from sklearn.multioutput import MultiOutputRegressor
params = {'n_estimators': 5000, 'max_depth': 4, 'min_samples_split': 2, 'min_samples_leaf': 2}
estimator = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
estimator.fit(train_data,train_targets)

How to properly use SMOTE for data balancing

I wanted to know if it is required to use SMOTE only after splitting test and train dataset. I used smote after train_test_split for Churn prediction, but haven't got any significant improvement pre or post SMOTE. Below is my entire code using smote. Not sure where the issue is. I wanted to know if I used SMOTE properly.
Below is the code
import pandas as pd
import numpy as np
from datetime import timedelta,datetime,date
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from numpy import percentile
tel_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
tel_data.info()
tel_data.isnull().sum()
num = {"No":0,"Yes":1}
tel_data = tel_data.replace({"Churn":num})
# also total charges seem to be object. coverting to integer
tel_data['TotalCharges'] = pd.to_numeric(tel_data['TotalCharges'])
tel_data.head(2)
tel_data['Churn'].value_counts()
plt.figure(figsize=(6,5))
sns.countplot(tel_data['Churn'])
plt.show()
# using pd.to_numeric to convert the TotalCharges column to numeric will help us see the null values
tel_data.TotalCharges = pd.to_numeric(tel_data.TotalCharges, errors="coerce")
tel_data.isnull().sum()
# deleting the rows with null values
tel_data = tel_data.dropna(axis=0)
# encoding all categorical variables using one hot encoding
tel_data = pd.get_dummies(tel_data,drop_first=True,columns=['gender','Partner','Dependents',
'PhoneService','MultipleLines','InternetService',
'OnlineSecurity','OnlineBackup','DeviceProtection',
'TechSupport','StreamingTV','StreamingMovies',
'Contract','PaperlessBilling','PaymentMethod'])
# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['customerID','Churn'],axis=1)
y = tel_data['Churn']
# performing feature selection using chi2 test
from sklearn.feature_selection import chi2
chi_scores = chi2(X,y)
print('chi_values:',chi_scores[0],'\n')
print('p_values:',chi_scores[1])
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = False , inplace = True)
plt.figure(figsize=(12,8))
p_values.plot.bar()
plt.show()
tel_data.drop(['PhoneService_Yes','gender_Male','MultipleLines_No phone service','MultipleLines_Yes','customerID'],axis=1,inplace=True)
tel_data.head(2)
# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['Churn'],axis=1)
y = tel_data['Churn']
# import sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb
from sklearn.metrics import accuracy_score
# splitting into train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=y)
model_xgb_1 = xgb.XGBClassifier(n_estimators=100,
learning_rate=0.3,
max_depth=5,
random_state=42 )
xgbmod = model_xgb_1.fit(X_train,y_train)
# checking accuracy of training data
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgbmod.score(X_train, y_train)))
y_xgb_pred = trn_xgbmod.predict(X_test)
print(classification_report(y_test, y_xgb_pred))
from imblearn.over_sampling import SMOTE
smote_preprocess = SMOTE(random_state=42)
X_train_resampled,y_train_resampled = smote_preprocess.fit_resample(X_train,y_train)
model_xgb_smote = xgb.XGBClassifier(n_estimators=100,
learning_rate=0.3,
max_depth=5,
random_state=42 )
xgbmod_smote = model_xgb_smote.fit(X_train_resampled,y_train_resampled)
# checking accuracy of training data
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgbmod_smote.score(X_train_resampled,y_train_resampled)))
y_xgb_pred_smote = xgbmod_smote.predict(X_test)
print(classification_report(y_test, y_xgb_pred_smote))

Using Hyper-parameters from H2O to re-build XGBoost in Sklearn gives Difference Performance in Python

After using H2O Python Module AutoML, it is found that XGBoost is on the top of the Leaderboard. Then what I was trying to do is to extract the hyper-parameters from the H2O XGBoost and replicate it in the XGBoost Sklearn API. However, the performance is different between these 2 approaches:
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report
import xgboost as xgb
import scikitplot as skplt
import h2o
from h2o.automl import H2OAutoML
import numpy as np
import pandas as pd
h2o.init()
iris = datasets.load_iris()
X = iris.data
y = iris.target
data = pd.DataFrame(np.concatenate([X, y[:,None]], axis=1))
data.columns = iris.feature_names + ['target']
data = data.sample(frac=1)
# data.shape
train_df = data[:120]
test_df = data[120:]
# Import a sample binary outcome train/test set into H2O
train = h2o.H2OFrame(train_df)
test = h2o.H2OFrame(test_df)
# Identify predictors and response
x = train.columns
y = "target"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
m = h2o.get_model([mid for mid in model_ids if "XGBoost" in mid][0])
# m.params.keys()
Performance of H2O Xgboost
skplt.metrics.plot_confusion_matrix(test_df['target'],
m.predict(test).as_data_frame()['predict'],
normalize=False)
Replicate in XGBoost Sklearn API:
mapping_dict = {
"booster": "booster",
"colsample_bylevel": "col_sample_rate",
"colsample_bytree": "col_sample_rate_per_tree",
"gamma": "min_split_improvement",
"learning_rate": "learn_rate",
"max_delta_step": "max_delta_step",
"max_depth": "max_depth",
"min_child_weight": "min_rows",
"n_estimators": "ntrees",
"nthread": "nthread",
"reg_alpha": "reg_alpha",
"reg_lambda": "reg_lambda",
"subsample": "sample_rate",
"seed": "seed",
# "max_delta_step": "score_tree_interval",
# 'missing': None,
# 'objective': 'binary:logistic',
# 'scale_pos_weight': 1,
# 'silent': 1,
# 'base_score': 0.5,
}
parameter_from_water = {}
for item in mapping_dict.items():
parameter_from_water[item[0]] = m.params[item[1]]['actual']
# parameter_from_water
xgb_clf = xgb.XGBClassifier(**parameter_from_water)
xgb_clf.fit(train_df.drop('target', axis=1), train_df['target'])
Performance of Sklearn XGBoost:
(always worse than H2O in all examples I tried.)
skplt.metrics.plot_confusion_matrix(test_df['target'],
xgb_clf.predict(test_df.drop('target', axis=1) ),
normalize=False);
Anything obvious that I missed?
When you use H2O auto ml with the following lines of code :
aml = H2OAutoML(max_models=10, seed=1, nfolds = 3,
keep_cross_validation_predictions=True,
exclude_algos = ["GLM", "DeepLearning", "DRF", "GBM"])
aml.train(x=x, y=y, training_frame=train)
you use the option nfolds = 3, which means each algorithm will be trained three times using 2 thirds of the data as training and one third as validation. This allows the algorithm to be more stable and sometimes have better performance than if you only give your entire training dataset in one go.
This is what you do when you train your XGBoost using fit(). Even though you have the same algorithm (XGBoost) with the same hyperparameters, you don't use the training set the same way H2O does. Hence the difference in your confusion matrices !
If you want to have the same performance when copying the best model, you can change the parameter H2OAutoML(..., nfolds = 0)
Furthermore there H2O's takes into account approximately 60 different parameters, you missed a few important ones in your dictionary like the min_child_weight. So your xgboost is not exactly the same as your H2O which could explain the differences in performance

How to implement a Hyperparameter using a MNIST dataset

I am currently running a program in Jupyter notebook to classify MNIST dataset.
I am trying to use the KNN classifier to do this and it is taking more than an hour to run. I am new to classifiers and hyper-parameters and there does not seem to be any decent tutorials on how to properly implement one of them. Could anyone give me some tips on how to use a hyper-parameter for this classification? I have searched and seen GridSearchCv and RandomizedSearchCV. From viewing their examples it seems they select different attribute names and alter to the ones necessary for their code. I do not understand how this can be done for MNIST dataset if the data is just handwritten digits. Seeing that there is only digits could there not be a need for a hyperparameter in this case? This is my code that I am currently still running. Thank you for any help you can provide.
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
def save_fig(fig_id, tight_layout=True):
image_dir = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
if not os.path.exists(image_dir):
os.makedirs(image_dir)
path = os.path.join(image_dir, fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=300)
def sort_by_target(mnist):
reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
mnist.data[:60000] = mnist.data[reorder_train]
mnist.target[:60000] = mnist.target[reorder_train]
mnist.data[60000:] = mnist.data[reorder_test + 60000]
mnist.target[60000:] = mnist.target[reorder_test + 60000]
try:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist["data"], mnist["target"]
mnist.data.shape
X, y = mnist["data"], mnist["target"]
X.shape
y.shape
#select and display some digit from the dataset
import matplotlib
import matplotlib.pyplot as plt
some_digit_index = 7201
some_digit = X[some_digit_index]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
save_fig("some_digit_plot")
plt.show()
#print some digit's label
print('The ground truth label for the digit above is: ',y[some_digit_index])
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
#random shuffle
import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
The most popular hyperparameter for KNN would be n_neighbors, that is, how many nearest neighbors you consider to assign a label to a new point. By default, it is set to 5, but it might not be the best choice. Therefore it is often better to find what the best choice is for your specific problem.
This is how you would find the optimal hyperparameter for your example:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {"n_neighbors" : [3,5,7]}
KNN=KNeighborsClassifier()
grid=GridSearchCV(KNN, param_grid = param_grid , cv = 5, scoring = 'accuracy', return_train_score = False)
grid.fit(X_train,y_train)
What this does is comparing the performance of your KNN model with the different values of n_neighbors that you set. Then when you do:
print(grid.best_score_)
print(grid.best_params_)
it will show you what the best performance score was, and for which choice of parameters it was achieved.
All this has nothing to do with the fact that you are using MNIST data. You could use this approach for any other classification task, as long as you think KNN might be a sensible choice for your task (which might be arguable for image classification). The only thing that would change from one task to another is the optimal value of the hyperparameters.
PS: I would advise not to use the y_multilabel terminology as this might refer to a specific classification task where each data point could have several labels, which is not the case in MNIST (each image represents only one digit at at time).

Python Machine Learning

I am trying to predict my server load however I am getting a below 10% accuracy. I am using Linear Regression to predict the data, can anyhow one help me out?
ps, the csv file contain date and time so i convert both to integer. Not sure am i doing it right
These are my Codes:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import mpl_toolkits
import imp
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
data = pd.read_csv(".....\\Machine_Learning_Serious\\Server_Prediction\\testing_server.csv")
describe = data.describe()
data_cleanup = {"Timestamp":{'AM': 0, 'PM': 1},
"Function":{'ccpl_db01': 0, 'ccpl_fin01': 1, 'ccpl_web01': 2},
"Type": {'% Disk Time': 0, 'CPU Load': 1, 'DiskFree%_C:': 2, 'DiskFree%_D:': 3, 'DiskFree%_E:': 4, 'FreeMemory': 5, 'IIS Current Connections': 6, 'Processor Queue Length': 7, 'SQL_Buffer cache hit ratio': 8, 'SQL_User Connections': 9}}
data.replace(data_cleanup,inplace = True)
final_data = data.head()
#print(final_data)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
labels = data['Data']
train1 = data.drop(['Data'], axis = 1)
from sklearn.model_selection import train_test_split
from sklearn import ensemble
x_train , x_test , y_train , y_test = train_test_split(train1, labels, test_size = 0.25, random_state = 2)
#clf = ensemble.GradientBoostingRegressor(n_estimators= 400 , max_depth = 5,min_samples_split = 2, learning_rate = 0.1,loss='ls')
fitting = reg.fit(x_train,y_train)
score = reg.score(x_test,y_test)
The main objective is to predict the correct load but right now I am way too off.
Maybe do some Exploratory data analysis first to see if you can figure out a pattern between your target variable and features?
It would also be good to extract some features from your date/time variables rather than using them as integers (like weekday_or_not, seasons etc.)
You can also try transforming your features (log, sqrt) to see if the score improves.
I would also suggest that you try a simple randomforest/xgboost model to check how that performs against the linear regression model

Categories