I am in the process of building a regression model that will eventually be used by other users. This model serves to predict flower temperature by using multiple atmospheric variables such as air temperature, humidity, solar radiation, wind, etc.
After much doodling around, I've come to notice that a 2nd degree polynomial regression through SKlearn gives a good RMSE for both my training and testing data. However, since there are over 36 coefficients collinearity occurs and according to a comment on this post : https://stats.stackexchange.com/questions/29781/when-conducting-multiple-regression-when-should-you-center-your-predictor-varia, collinearity would disturbe the beta and so the RMSE I am getting would be improper.
I've heard that perhaps I should standardize in order to remove collinearity or use an orthogonal decomposition but I don't know which would be better. In any case, i've tried standardizing my x variables and when I compute the RMSE for my training and testing data, I get the same RMSE for the training data but a different RMSE for the testing data.
Here is the code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
def OpenFile(ThePath):
path = Location + ThePath
Prepared_df = pd.read_csv(path, sep=',', encoding='utf-8')
Prepared_df = Prepared_df.loc[:, ~Prepared_df.columns.str.contains('^Unnamed')]
return(Prepared_df)
def EvaluateRegression(Test_data,Predict_data):
MAE = np.round(metrics.mean_absolute_error(Test_data, Predict_data),3)
MSE = np.round(metrics.mean_squared_error(Test_data, Predict_data),3)
RMSE = np.round(np.sqrt(metrics.mean_squared_error(Test_data, Predict_data)),3)
print('Mean absolute error :',MAE)
print('Mean square error :',MSE)
print('RMSE :',RMSE)
return MAE,MSE,RMSE
#Read files ------------------------------------------------------------------------------------------------------------
Location = 'C:\\Users\...'
#Training data
File_Station_day = 'Flower_Station_data_day.csv' #X training data
File_TD = 'Flower_Y_data_day.csv' #Y training data
Chosen_Air = OpenFile(File_Station_day)
Day_TC = OpenFile(File_TD)
#Testing data
File_Fluke_Station= 'Fluke_Station_data.csv' #X testing data
File_Fluke = 'Flower_Fluke_data.csv' #Y testing data
Chosen_Air_Fluke = OpenFile(File_Fluke)
Fluke_Station = OpenFile(File_Fluke_Station)
#Prepare data --------------------------------------------------------------------------------------------------------
y_train = Day_TC
y_test = Fluke_data
#Get the desired atmospheric variables
Air_cols = ['MAXTemp_data', 'MINTemp_data', 'Humidity', 'Precipitation', 'Pression', 'Arti_InSW', 'sin_time'] #Specify the desired atmospheriv variables
X_train = Chosen_Air[Air_cols]
X_test = Chosen_Air_Fluke[Air_cols]
#If not standardizing
poly = PolynomialFeatures(degree=2)
linear_poly = LinearRegression()
X_train_rdy = poly.fit_transform(X_train)
linear_poly.fit(X_train_rdy,y_train)
X_test_rdy = poly.fit_transform(X_test)
Input_model= linear_poly
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test, Input_model.predict(X_test_rdy))
#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 0.652
Mean square error : 0.569
RMSE : 0.754
#If standardizing
std = StandardScaler()
X_train_std = pd.DataFrame(std.fit_transform(X_train),columns = Air_cols)
X_test_std = pd.DataFrame(std.fit_transform(X_test),columns = Air_cols)
poly = PolynomialFeatures(degree=2)
linear_poly_std = LinearRegression()
X_train_std_rdy = poly.fit_transform(X_train_std)
linear_poly_std.fit(X_train_std_rdy,y_train)
X_test_std_rdy = poly.fit_transform(X_test_std)
Input_model= linear_poly_std
print('Regression: For train')
MAE, MSE, RMSE = EvaluateRegression(y_train, Input_model.predict(X_train_std_rdy))
#For testing data
print('Regression: For test')
MAE, MSE, RMSE = EvaluateRegression(y_test, Input_model.predict(X_test_std_rdy))
#Output:
Regression: For train
Mean absolute error : 0.391
Mean square error : 0.256
RMSE : 0.506
Regression: For test
Mean absolute error : 10.901
Mean square error : 304.53
RMSE : 17.451
Why is the RMSE i am getting for the standardize testing data be so different than the non-standardize one? Perhaps the way i'm doing this is no good at all? Please let me know if I should attach the files to the post.
Thank you for your time!
IIRC, at least you should not call poly.fit_transform twice – you do it same way as with regression model – fit once with train data, transform later with test. Now you're re-training scaler (which probably gives you different mean/std), but apply same regression model.
Side note: your code is rather hard to read/debug, and it easily lead to simple typos/mistakes. I suggest you wrapping training logic inside single function, and optionally using sklearn pipelines. This will make testing scaler [un]commenting single line, literally.
Related
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(regressor, X, y, scoring='neg_mean_absolute_error',
cv=cv, n_jobs=-1)
np.mean(np.abs(scores))
regressor is the fitted model, X is the independent features and y is the dependent feature. Is the code right? Also I'm confused can rmse be bigger than 100? I'm getting values such as 121 from some regression models. Is rmse used to tell you how good your model is in general or only to tell you how good your model is compared to other models?
rmse = 121
The RMSE value can be calculated using sklearn.metrics as follows:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test, predictions)
rmse = math.sqrt(mse)
print('RMSE: %f' % rmse)
In terms of the interpretation, you need to compare RMSE to the mean of your test data to determine the model accuracy. Standard errors are a measure of how accurate the mean of a given sample is likely to be compared to the true population mean.
For instance, an RMSE of 5 compared to a mean of 100 is a good score, as the RMSE size is quite small relative to the mean.
On the other hand, an RMSE of 5 compared to a mean of 2 would not be a good result - the mean estimate is too wide compared to the test mean.
If you want RMSE, why are you using mean absolute error for scoring? Change it to this:
scores = cross_val_score(regressor, X, y, scoring = 'neg_mean_squared_error',
cv = cv, n_jobs = -1)
Since, RMSE is the square root of mean squared error, we have to do this:
np.mean(np.sqrt(np.abs(scores)))
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
import pandas
with open('aotiz.csv', 'r') as csvfile:
aotiz = pandas.read_csv(csvfile)
test = aotiz.loc[16:7000]
# Generate the train set with the rest of the data.
train = aotiz.loc[7000:7006]
x_columns = distance_columns
y_column = ["PM2.5"]
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn import metrics
knn = KNeighborsRegressor(n_neighbors=6)
# Fit the model on the training data.
knn.fit(train[x_columns], train[y_column])
# Make point predictions on the test set using the fit model.
predictions = knn.predict(test[x_columns])
actual = test[y_column]
mse = (((predictions - actual) ** 2).sum()) / len(predictions)
print(mse)
I'm trying to know how to get the accuracy of this model from scikit-learn. For the moment I could only get the mean squared error, but how do I compare the 'actual' and the 'predictions' sets to see with the percentage the error that I have from the 'actual' list.