I want to perform Residual analysis, and i know that residuals equal the observed values minus the predicted ones. But i don't know should i calculate residuals from the training set or the test set ?
Should i use this:
import statsmodels.api as sm
# Making predictions
lm = sm.OLS(y_train,X_train).fit()
y_pred = lm.predict(X_train)
resid = y_train - y_pred.to_frame('price')
OR this:
import statsmodels.api as sm
# Making predictions
lm = sm.OLS(y_train,X_train).fit()
y_pred = lm.predict(X_test)
resid = y_test- y_pred.to_frame('price')
The residual error should be computed from the actual values (expected outcome) of the test set y_test and the predicted values by the fitted model for X_test. The model is fitted to the training set and then its accuracy is tested on the test set. This is how I see it intuitively, the main reason in the first place to formally call the two datasets as train (for training) and then for testing (test).
Specifically, use the second case
resid = y_test- y_pred.to_frame('price')
Related
I'm trying to create a force_plot for my Random Forest model that has two classes (1 and 2), but I am a bit confused about the parameters for the force_plot.
I have two different force_plot parameters I can provide the following:
shap.force_plot(explainer.expected_value[0], shap_values[0], choosen_instance, show=True, matplotlib=True)
expected and shap values: 0
shap.force_plot(explainer.expected_value[1], shap_values[1], choosen_instance, show=True, matplotlib=True)
expected and shap values: 1
So my questions are:
When creating the force_plot, I must supply expected_value. For my model I have two expected values: [0.20826239 0.79173761], how do I know which to use? My understanding of expected value is that it is the average prediction of my model on train data. Are there two values because I have both class_1 and class_2? So for class_1, the average prediction is 0.20826239 and class_2, it is 0.79173761?
The next parameter is the shap_values, for my chosen instance:
index B G R Prediction
113833 107 119 237 2
I get the following SHAP_values:
[array([[ 0.01705462, -0.01812987, 0.23416978]]),
array([[-0.01705462, 0.01812987, -0.23416978]])]
I don't quite understand why I get two sets of SHAP values? Is one for class_1 and one for class_2? I have been trying to compare the images I attached, given both sets of SHAP values and expected value but I can't really explain what is going on in terms of the prediction.
Let's try reproducible:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap.maskers import Independent
from scipy.special import expit, logit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
Then, your SHAP expected values are:
masker = Independent(data = X_train)
explainer = TreeExplainer(model, data=masker)
ev = explainer.expected_value
ev
array([0.35468973, 0.64531027])
This is what your model would predict on average given background dataset (fed to explainer above):
model.predict_proba(masker.data).mean(0)
array([0.35468973, 0.64531027])
Then, if you have a datapoint of interest:
data_to_explain = X_train[[0]]
model.predict_proba(data_to_explain)
array([[0.00470234, 0.99529766]])
You can achieve exactly the same with SHAP values:
sv = explainer.shap_values(data_to_explain)
np.array(sv).sum(2).ravel()
array([-0.34998739, 0.34998739])
Note, they are symmetrical, because what increase chances towards class 1 decreases chances for 0 by the same amount.
With base values and SHAP values, the probabilities (or chances for a data point to end up in leaf 0 or 1) are:
ev + np.array(sv).sum(2).ravel()
array([0.00470234, 0.99529766])
Note, this is same as model predictions.
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
I am in the process of building a regression model that will eventually be used by other users. This model serves to predict flower temperature by using multiple atmospheric variables such as air temperature, humidity, solar radiation, wind, etc.
After much doodling around, I've come to notice that a 2nd degree polynomial regression through SKlearn gives a good RMSE for both my training and testing data. However, since there are over 36 coefficients collinearity occurs and according to a comment on this post : https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia, collinearity would disturbe the beta and so the RMSE I am getting would be improper.
I've heard that perhaps I should standardize in order to remove collinearity or use an orthogonal decomposition but I don't know which would be better. In any case, i've tried standardizing my x variables and when I compute the RMSE for my training and testing data, I get the same RMSE for the training data but a different RMSE for the testing data.
Here is the code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def OpenFile(ThePath):
path = Location + ThePath
Prepared_df = pd.read_csv(path, sep=',', encoding='utf-8')
Prepared_df = Prepared_df.loc[:, ~Prepared_df.columns.str.contains('^Unnamed')]
return(Prepared_df)
def EvaluateRegression(Test_data,Predict_data):
MAE = np.round(metrics.mean_absolute_error(Test_data, Predict_data),3)
MSE = np.round(metrics.mean_squared_error(Test_data, Predict_data),3)
RMSE = np.round(np.sqrt(metrics.mean_squared_error(Test_data, Predict_data)),3)
print('Mean absolute error :',MAE)
print('Mean square error :',MSE)
print('RMSE :',RMSE)
return MAE,MSE,RMSE
#Read files ------------------------------------------------------------------------------------------------------------
Location = 'C:\\Users\...'
#Training data
File_Station_day = 'Flower_Station_data_day.csv' #X training data
File_TD = 'Flower_Y_data_day.csv' #Y training data
Chosen_Air = OpenFile(File_Station_day)
Day_TC = OpenFile(File_TD)
#Testing data
File_Fluke_Station= 'Fluke_Station_data.csv' #X testing data
File_Fluke = 'Flower_Fluke_data.csv' #Y testing data
Chosen_Air_Fluke = OpenFile(File_Fluke)
Fluke_Station = OpenFile(File_Fluke_Station)
#Prepare data --------------------------------------------------------------------------------------------------------
y_train = Day_TC
y_test = Fluke_data
#Get the desired atmospheric variables
Air_cols = ['MAXTemp_data', 'MINTemp_data', 'Humidity', 'Precipitation', 'Pression', 'Arti_InSW', 'sin_time'] #Specify the desired atmospheriv variables
X_train = Chosen_Air[Air_cols]
X_test = Chosen_Air_Fluke[Air_cols]
#If not standardizing
poly = PolynomialFeatures(degree=2)
linear_poly = LinearRegression()
X_train_rdy = poly.fit_transform(X_train)
linear_poly.fit(X_train_rdy,y_train)
X_test_rdy = poly.fit_transform(X_test)
Input_model= linear_poly
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test, Input_model.predict(X_test_rdy))
#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 0.652
Mean square error : 0.569
RMSE : 0.754
#If standardizing
std = StandardScaler()
X_train_std = pd.DataFrame(std.fit_transform(X_train),columns = Air_cols)
X_test_std = pd.DataFrame(std.fit_transform(X_test),columns = Air_cols)
poly = PolynomialFeatures(degree=2)
linear_poly_std = LinearRegression()
X_train_std_rdy = poly.fit_transform(X_train_std)
linear_poly_std.fit(X_train_std_rdy,y_train)
X_test_std_rdy = poly.fit_transform(X_test_std)
Input_model= linear_poly_std
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_std_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test, Input_model.predict(X_test_std_rdy))
#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 10.901
Mean square error : 304.53
RMSE : 17.451
Why is the RMSE i am getting for the standardize testing data be so different than the non-standardize one? Perhaps the way i'm doing this is no good at all? Please let me know if I should attach the files to the post.
Thank you for your time!
IIRC, at least you should not call poly.fit_transform twice – you do it same way as with regression model – fit once with train data, transform later with test. Now you're re-training scaler (which probably gives you different mean/std), but apply same regression model.
Side note: your code is rather hard to read/debug, and it easily lead to simple typos/mistakes. I suggest you wrapping training logic inside single function, and optionally using sklearn pipelines. This will make testing scaler [un]commenting single line, literally.