SHAP: How do I interpret expected values for force_plot? - python

I'm trying to create a force_plot for my Random Forest model that has two classes (1 and 2), but I am a bit confused about the parameters for the force_plot.
I have two different force_plot parameters I can provide the following:
shap.force_plot(explainer.expected_value[0], shap_values[0], choosen_instance, show=True, matplotlib=True)
expected and shap values: 0
shap.force_plot(explainer.expected_value[1], shap_values[1], choosen_instance, show=True, matplotlib=True)
expected and shap values: 1
So my questions are:
When creating the force_plot, I must supply expected_value. For my model I have two expected values: [0.20826239 0.79173761], how do I know which to use? My understanding of expected value is that it is the average prediction of my model on train data. Are there two values because I have both class_1 and class_2? So for class_1, the average prediction is 0.20826239 and class_2, it is 0.79173761?
The next parameter is the shap_values, for my chosen instance:
index B G R Prediction
113833 107 119 237 2
I get the following SHAP_values:
[array([[ 0.01705462, -0.01812987, 0.23416978]]),
array([[-0.01705462, 0.01812987, -0.23416978]])]
I don't quite understand why I get two sets of SHAP values? Is one for class_1 and one for class_2? I have been trying to compare the images I attached, given both sets of SHAP values and expected value but I can't really explain what is going on in terms of the prediction.

Let's try reproducible:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap.maskers import Independent
from scipy.special import expit, logit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
Then, your SHAP expected values are:
masker = Independent(data = X_train)
explainer = TreeExplainer(model, data=masker)
ev = explainer.expected_value
ev
array([0.35468973, 0.64531027])
This is what your model would predict on average given background dataset (fed to explainer above):
model.predict_proba(masker.data).mean(0)
array([0.35468973, 0.64531027])
Then, if you have a datapoint of interest:
data_to_explain = X_train[[0]]
model.predict_proba(data_to_explain)
array([[0.00470234, 0.99529766]])
You can achieve exactly the same with SHAP values:
sv = explainer.shap_values(data_to_explain)
np.array(sv).sum(2).ravel()
array([-0.34998739, 0.34998739])
Note, they are symmetrical, because what increase chances towards class 1 decreases chances for 0 by the same amount.
With base values and SHAP values, the probabilities (or chances for a data point to end up in leaf 0 or 1) are:
ev + np.array(sv).sum(2).ravel()
array([0.00470234, 0.99529766])
Note, this is same as model predictions.

Related

How can i generate random n-dimensional dataset for my regression task?

I need to generate a random n-dimensional dataset having m tuples. The first four dimensions are expected to be correlated with the ground truth vector y and the remaining ones are to be arbitrarily generated. I will use the dataset for my regression task using Scikit-learn. How can I generate this data?
for example:
A dataset where
tuple size(m)=10000
and
dimension size(n)=100
After that, I need to split the dataset such that randomly selected 70% tuples are used for training while 30% tuples are used for testing.
PS: I have found this code in sci-kit learn but I am not sure if I can use it. How can I translate this into my problem?
x, y, coef = datasets.make_regression(n_samples=100,#number of samples
n_features=1,#number of features
n_informative=1,#number of useful features
noise=10,#bias and standard deviation of the guassian noise
coef=True,#true coefficient used to generated the data
random_state=0) #set for same data points for each run
make_regression will do your job indeed:
from sklearn import datasets
# your example values:
m = 10000
n = 100
random_seed = 42 # for reproducibility, exact value does not matter
X, y = datasets.make_regression(n_samples=m, n_features=n, n_informative=4, random_state = random_seed)
And train_test_split for splitting:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed) # does not need to be the same random_seed
In both methods, random_seed is set only for reproducibility - its exact value does not matter, neither need it be the same value in both.

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

setting a range for sklearn linear regression prediction

My issue is similar to this question. But I didn't get the answer. I need further clarification.
I am using sklearn linear regression prediction -for the first time- to add more data points to my dataset. Adding more data points will help me identify outliers more accurately. I have built my model and got the predictions but I want the model to return predicted points with a certain range. Is it possible to achieve this?
I would like to predict values in a column called 'delivery_fee'.
The values in the column starts from 3 and increases steadily until it reaches 27.
The last value in the column and it comes right after 27 is 47.
I would like the model to predict values between 27 and 47.
my code:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
#create a copy of the dataframe
delivery_linreg = outlierFileNew.copy()
le = preprocessing.LabelEncoder()
delivery_linreg['branch_code'] = le.fit_transform(delivery_linreg['branch_code'])
#select all columns in the datframe except for delivery_fee
x = delivery_linreg[[x for x in delivery_linreg.columns if x != 'delivery_fee']]
#selecting delivery_fee as the column to be predicted
y = delivery_linreg.delivery_fee
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
#fitting simple linear regression to training set
linreg = LinearRegression()
linreg.fit(x_train,y_train)
delivery_predict = linreg.predict(x_test)
My model returns values that range from 4 to 17. Which is not the range I want. Any suggestions on how to change the predicted range?
Thank you,

Sklearn Linear Regression giving me inaccurate accurate reading?

I have an excel file with contains 3 simple columns: Unit Price, Quantity Sold, and Total. The Total is simply obtained by multiplying unit price with quantity. So i set up a simple sklearn linear regression algorithm code to predict this value:
import numpy as np
import pandas as pd
from sklearn import *
data = pd.read_excel("Sample.xls")[["Units", "Unit Cost", "Total"]]
predict = "Total"
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
I ran the print function 3 times and it gave me the following results:
-1.517292267629657
0.9167778505657976
0.15292336331892453
Why am i getting this and not a 100% accuracy? The model should recognise that the prediction is simply multiplication of the frist column with the second
Linear regression doesn't work that way. Plot a graph using matplotlib and see if you get straight line for training data. If your inputs are X1 and X2 your output is X1*X2 and not of the form aX1+bX2. Model is predicting correctly, What is wrong is the model application itself.

SVM (probabilistic model) on Python3+ : Almost simmilar probability for all the cases

I am using following code to run SVM (probabilistic model) on Python3+.
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import svm
from time import time
dum_data = pd.read_csv("dummy_variable_file.csv")
train_cols = dum_data.columns[5:]
target = dum_data.columns[4]
x = dum_data[train_cols]
y = dum_data[target]
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=442, train_size=0.70)
clf = svm.SVC(probability = True)
model = clf.fit(x_train,y_train)
output_predict_test = model.predict(x_test)
output_proba_test = model.predict_proba(x_test)
output_decision_test = model.decision_function(x_test)
My data set has 0.24mln rows and 38 dummy features (Have taken care of dummy trap). I am running this code on 70% of the base (train base). And applying it on 30% of base (test base). The code is running fine. NO ERROR there.
Now the problems that I am facing are as follows:
In output_proba_test the model is giving me almost same probabilities for all the 71k cases (either of two probabilities for prediction of "1": 0.856 or 0.144). My target variable has two values 1(purchase) & 0(Not purchase).
The process of model fit is taking more than 7hrs
If anybody can explain why this is happening? I have run the same code for Iris Data and it is giving me proper probability distribution. But for my data it is giving me hard time. Again, just to reiterate: Target variable: "0" & "1"; independent variable: 37 dummy variables with the values either "1" or "0". I have taken care of the dummy trap. In my overall dataset (0.24mln) number of 1s are 39k. In my train data set number of 1s are 26k out of 0.16mln total rows.

Categories