Scaling gives me worse result (lower R-squared) in Linear Regression - python

(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link

It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?
Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:
A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable

Related

Python: how to rescale to the original values after scaling with scikit-learn?

I am doing some regressions with python using scikit-learn and statsmodels.
I am using the function preprocessing to normalize the values and statsmodels to do the modeling part.
from sklearn import preprocessing
scale = preprocessing.StandardScaler().fit_transform
import statsmodels.api as sm
Let's say that my date is stored as a dataframe with a target variable target_var and some predictors predictor_vars this is what I am doing
def extract_scale_features(df,target_var,predictor_vars):
return scale(df[predictor_vars].values),scale(df[target_var].values.reshape(-1, 1))
def run_OLS(df,target_var, predictor_vars):
X,y = extract_scale_features(df,target_var,predictor_vars)
estimator = sm.OLS(y, sm.add_constant(X)).fit()
return estimator, y
Now I can compare the normalized values between the prediction and the target variable.
est_ols, y = run_OLS(df,target_var,predictor_vars)
plt.scatter(est_ols.predict(), y)
How can rescale such values to the original ones to compare them?

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Negative accuracy in linear regression

My linear regression model has negative coefficient of determination R².
How can this happen? Any idea is helpful.
Here is my dataset:
year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
The code of the LinearRegression model is as follows:
import pandas as pd
from sklearn.linear_model import LinearRegression
data =pd.read_csv("data.csv", header=None )
data = data.drop(0,axis=0)
X=data[0]
Y=data[1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)
lm = LinearRegression()
lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))
Y_pred = lm.predict(X_test.values.reshape(-1,1))
accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)
print(accuracy)
output
-3592622948027972.5
Here is the formula of the R² score:
\hat{y_i} is the predictor of the i-th observation y_i and \bar{y} is the mean of all observations.
Therefore, a negative R² means that if someone knew the mean of your y_test sample and always used it as a "prediction", this "prediction" would be more accurate than your model.
Moving on to your dataset (thanks to #Prayson W. Daniel for the convenient loading script), let us have a quick look at your data.
df.population.plot()
It looks like a logarithmic transformation could help.
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
Now let us perform a linear regression using OpenTURNS.
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
Output:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
This is an almost exact fit.
EDIT
As suggested by #Prayson W. Daniel, here is the model fit after it is transformed back to the original scale.
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
Output:
R2 score in original scale = 0.9979032805107133
Sckit-learn’s LinearRegression scores uses 𝑅2 score. A negative 𝑅2 means that the model fitted your data extremely bad. Since 𝑅2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then 𝑅2 is negative when the model fits worse than a horizontal line.
𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2, then 𝑅2 will be negative.
reasons and ways to correct it
Problem 1: You are performing a random split of time-series data. Random split will ignore the temporal dimension.
Solution: Preserve time flow (See code below)
Problem 2: Target values are so large.
Solution: Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.
Here is a code example. Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
For those interested in making it better, here is a way to read that dataset
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
Results:

Predict future values after using polynomial regression in python

I'm currently using TensorFlow and SkLearn to to try to make a model that can predict the amount of sales for a certain product, X, based on the outdoor temperature in celcius.
I took my datasets for the temperature and set it equal to the x variable, and the amount of sales to as a y variable. As seen on the picture below, there is some sort of correlation between the temperature and the amount of sales:
First and foremost, I tried to do linear regression to see how well it'd fit. This is the code for that:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train) #fit tries to fit the x variable and y variable.
#Let's try to plot it out.
y_pred = model.predict(x_train)
plt.scatter(x_train,y_train)
plt.plot(x_train,y_pred,'r')
plt.legend(['Predicted Line', 'Observed data'])
plt.show()
This resulted in a predicted line that had a pretty poor fit:
A very nice feature from sklearn however is that you can try to predict an value based on a temperature, so if I were to write
model.predict(15)
i'd get the output
array([6949.05567873])
This is exactly what I want, I just wanted to line to fit better so instead I tried polynoimal regression with sklearn by doing following:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this actually predicts x...?
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
The line seems to fit better now:
My problem is not that I can't use new_model.predict(x) since it'll result in "ValueError: shapes (1,1) and (8,) not aligned: 1 (dim 1) != 8 (dim 0)". I understand that this is because I'm using a 8-degree polynomium, but is there any way for me to predict the y-axsis based on ONE temperature using the polynomial regression model?
Try using new_model.predict([x**a for a in range(1,9)])
or according to your previously used code, you can do new_model.predict(poly.fit_transform(x))
Since you fit a line
y = ax^1 + bx^2 + ... + h*x^8
you, need to transform your input in the same manner i.e. turn it into a polynomial without the intercept and slope terms. This was what you passed into Linear Regression training function. It learns the slope terms for that polynomial. The plot you've shown only contains the x^1 term you indexed into (x_new[:,0]) which means that the data you're using has more columns.
One last note: always make sure your training data and future/validation data undergo the same preprocessing steps to ensure your model works.
Here's some detail :
Let's start by running your code, on synthetic data.
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from numpy.random import rand
x_train = rand(1000,1)
y_train = rand(1000,1)
poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept
x_new = poly.fit_transform(x_train)
new_model = LinearRegression()
new_model.fit(x_new,y_train)
#plotting
y_prediction = new_model.predict(x_new) #this predicts y
plt.scatter(x_train,y_train)
plt.plot(x_new[:,0], y_prediction, 'r')
plt.legend(['Predicted line', 'Observed data'])
plt.show()
Now we can predict y value by transforming an x-value into a polynomial of degree 8 without an intercept
print(new_model.predict(poly.fit_transform(0.25)))
[[0.47974408]]

how to reduce rmse while performing Linear Regression in python

I am not really a coder but this is what I have so far. I'm trying to apply linear regression to predict something from a sample data. I'm probably doing some mistake here since I'm getting an rmse of above 1. I've checked the correlation among the continuous variables which turned out to be quite small reaching to a max of 0.2. Also checked for outliers using the Inter Quartile Range method but there were none.
Please can someone tell me how should I reduce the rmse?
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.cross_validation import train_test_split
df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv')
df_temp, df_test = train_test_split(df_hosp, test_size=0.30, train_size=0.70)
df_train, df_val = train_test_split(df_temp, test_size=0.30, train_size=0.70)
X = df_train[['rcount', 'male', 'female', 'dialysisrenalendstage', 'asthma', \
'irondef', 'pneum', 'substancedependence', \
'psychologicaldisordermajor', 'depress', 'psychother', \
'fibrosisandother', 'malnutrition', 'hemo', 'hematocrit', \
'neutrophils', 'sodium', 'glucose', 'bloodureanitro', \
'creatinine', 'bmi', 'pulse', 'respiration', \
'secondarydiagnosisnonicd9']]
y = df_train['lengthofstay']
model = linear_model.LinearRegression(fit_intercept=True, normalize=True, copy_X=True)
m = model.fit(X, y)
predictions_train = m.predict(X)
print('Score: %.2f' % m.score(X, y))
rms_train = sqrt(mean_squared_error(y, predictions_train))
print ('Training set RMSE: %.2f' % rms_train)
Output:
Score: 0.75
Training set RMSE: 1.19
Since your y variable is the length of stay, there is no reason why it should have an rmse < 1. Here is a resource explaining the formula definition of RMSE. You can see that if (y_pred - y) is on average bigger than 1, then your RMSE is going to be bigger than 1.
As for why this is happening, you appear to be attempting to fit a model with a large number of variables, sum of which are not actually correlated with your output variable. You should only fit a model on data which is actually correlated because a correlation implies that the input data somehow affects the output data.
Try limiting the number of input variables you fit to, starting with the most highly correlated data.

Categories