10-fold cross-validation and obtaining RMSE - python

I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable

There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10

So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)

Related

Compute cross validation for multi-variate linear regression

I am training different models for a regression problem. Since i want to find the best model between the choices, i wanted to perform a cross validation with k = 20, to characterize the MSE of the models, and statistically determine what model is the better between them.
The problem has got multiple dependant variables, and i would like to determinate the MSE separately for both dependant variables, but cross_val_score doesnt let me do that explicitely.
Here is some example code of one of my models:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x_test)
mse = mean_squared_error(scaler2.inverse_transform(y_test), scaler2.inverse_transform(y_pred), multioutput="raw_values")
How can i iterate training on the k times corresponding to the k models trained and tested in a k fold cross validation?
Scikit provides a Kfold but it is just a way to specify the number of folds, and it doesnt actually returns the training and test folds, so i can't think a way to actually train different models using kfold cross validation theory. Plus, i would need to evaluate MSE seprately on each dependant variable since it's a multiple regression problem
You can use Scikit Learn KFold Cross Validation with just a simple for loop.
And here is a example testing 5-fold cross validation on bayes classifer:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k)
res = []
for train_index , test_index in kf.split(X_train_concat):
X_train_kf , X_test_kf = X_train_concat[train_index,:],X_train_concat[test_index,:]
y_train_kf , y_test_kf = y_train_concat[train_index] , y_train_concat[test_index]
X_train = np.append(X_train_concat, np.reshape(y_train_concat, (len(y_train_concat),1)), axis=1)
W_bayes = trainBayes(X_train)
y_pred = predict(X_test_kf, W_bayes)
mis_classification = len(y_pred)-np.count_nonzero(y_pred == y_test_kf)
e = (mis_classification / y_test_kf.shape[0]) * 100
res.append(e)
avg_res = sum(res)/k
print('Result of each fold - {}'.format(res))
print('Avg result : {}'.format(avg_res))
For more check this

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Function for finding ROC score of a feature after randomly shuffling training target data is not acting random

I am trying to write a function that will give the average ROC score of 10 logistic regression classifiers that are each trained on a different random shuffling of the training target data for one feature at a time. (for the purpose of comparing against the non shuffled roc score) But I am getting very strange and non random results for each roc score.
I have tried using np.random.shuffle instead of pd.sample and got the same result
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
def shuffled_roc(df, feature):
df = df.sample(frac=1, random_state=0)
x = df[feature][np.isfinite(df[feature])].copy()
y = df['target'][np.isfinite(df[feature])].copy()
x_train = x.iloc[:int(0.8*len(x))]
y_train = y.iloc[:int(0.8*len(x))]
x_test = x.iloc[int(0.8*len(x)):]
y_test = y.iloc[int(0.8*len(x)):]
y_train_shuffled = y_train.sample(frac=1).reset_index(drop=True)
rocs = []
for i in range(10):
y_train_shuffled = y_train_shuffled.sample(frac=1).reset_index(drop=True)
lr = LogisticRegression(solver = 'lbfgs').fit(x_train.values.reshape(-1,1), y_train_shuffled)
roc = metrics.roc_auc_score(y_test, lr.predict_proba(x_test.values.reshape(-1,1))[:,1])
rocs.append(roc)
print(rocs)
return np.mean(rocs)
shuffled_roc(df_accident, 'target_suspension_count')
I expect 10 different values for the 10 roc scores but instead I get
[0.7572317596566523, 0.24276824034334765, 0.24276824034334765, 0.7572317596566523, 0.7572317596566523, 0.7572317596566523, 0.24276824034334765, 0.7572317596566523, 0.7572317596566523, 0.24276824034334765]

coefficient from logistic regression to write function in python

I just completed logistic regression. The data can be downloaded from below link:
pleas click this link to download the data
Below is the code to logistic regression.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
scaler = StandardScaler()
data = pd.read_csv('data.csv')
dataX = data.drop('outcome',axis =1).values.astype(float)
X = scaler.fit_transform(dataX)
dataY = data[['outcome']]
Y = dataY.values
X_train,X_test,y_train,y_test = train_test_split (X,Y,test_size = 0.25, random_state = 33)
lr = LogisticRegression()
lr.fit(X_train,y_train)
# Predict the probability of the testing samples to belong to 0 or 1 class
predicted_probs = lr.predict_proba(X_test)
print(predicted_probs[0:3])
print(lr.coef_)
i can print the coefficient of logistic regression and i can compute the probability of an event to occur 1 or 0.
When I write a python function using those coefficients and compute the probability to occur 1. I am not getting answer as compared using this :lr.predict_proba(X_test)
the function i wrote is as follow:
def xG(bodyPart,shotQuality,defPressure,numDefPlayers,numAttPlayers,shotdist,angle,chanceRating,type):
coeff = [0.09786083,2.30523761, -0.05875112,0.07905136,
-0.1663424 ,-0.73930942,-0.10385882,0.98845481,0.13175622]
return (coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type)
I got the weird answer. I knew sth wrong in the function calculation.
May i seek your advice as I am new to machine learning and statistics.
I think you missed the intercept_ in your xG. You can retrieve it from lr.intercept_ and it should be summed in the final formula:
return 1/(1+e**(-(intercept + coeff[0]*bodyPart+ coeff[1]*shotQuality+coeff[2]*defPressure+coeff[3]*numDefPlayers+coeff[4]*numAttPlayers+coeff[5]*shotdist+ coeff[6]*angle+coeff[7]*chanceRating+coeff[8]*type))

Python Dataframe: Different RMSE Score Every Time I Run Random Forest Regressor

I currently run a random forest model using the following code. I set a random_state equal to 100.
from sklearn.cross_validation import train_test_split
X_train_RIA_INST_PWM, X_test_RIA_INST_PWM, y_train_RIA_INST_PWM, y_test_RIA_INST_PWM = train_test_split(X_RIA_INST_PWM, Y_RIA_INST_PWM, test_size=0.3, random_state = 100)
# Random Forest Regressor for RIA_INST_PWM accounts
import numpy as np
from sklearn.ensemble import RandomForestRegressor
regressor_RIA_INST_PWM = RandomForestRegressor(n_estimators=100, min_samples_split = 10)
regressor_RIA_INST_PWM.fit(X_RIA_INST_PWM, Y_RIA_INST_PWM)
print ("R^2 for training set:"),
print (regressor_RIA_INST_PWM.score(X_train_RIA_INST_PWM, y_train_RIA_INST_PWM))
print ('-'*50)
print ("R^2 for test set:"),
print (regressor_RIA_INST_PWM.score(X_test_RIA_INST_PWM, y_test_RIA_INST_PWM))
And then I use the following code to calculate the prediction values.
def predict_AUM(df, features, regressor):
# Reset index for later merge of predicted target values with Account IDs
df.reset_index();
# Set predictor variables
X_Predict = df[features]
# Clean inputs
X_Predict = X_Predict.replace([np.inf, -np.inf], np.nan)
X_Predict = X_Predict.fillna(0)
# Predict Current_AUM
Y_AUM_Snapshot_1yr_Predict = regressor.predict(X_Predict)
df['PREDICTED_SPAN'] = Y_AUM_Snapshot_1yr_Predict
return df
df_EVENT5_20 = predict_AUM(df_EVENT5_19, dfzip_features_AUM_RIA_INST_PWM, regressor_RIA_INST_PWM)
Finally, I calculate the RMSE of my results:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df_EVENT5_20['SPAN_DAYS'], df_EVENT5_20['PREDICTED_SPAN']))
rmse
Each time I run my code ... my RMSE changes. It has varied from 7.75 to 16.4 Why is this happening? And how can I have the same RMSE each time I run the code? Additionally, how do I optimize my model for RMSE?
You only seeded the train_test_split which makes sure that the random allocation of the data to train and test set is reproducible.
As the name suggests RandomForestRegressor also contains parts in the algorithm that rely on random numbers (e.g., specifically different parts of data or different features for training individual decision trees). If you want reproducible results, you need to seed it as well. For that you need to initilize it with random_state like that:
regressor_RIA_INST_PWM = RandomForestRegressor(
n_estimators=100,
min_samples_split = 10,
random_state=100
)

Categories