My linear regression model has negative coefficient of determination Rยฒ.
How can this happen? Any idea is helpful.
Here is my dataset:
year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
The code of the LinearRegression model is as follows:
import pandas as pd
from sklearn.linear_model import LinearRegression
data =pd.read_csv("data.csv", header=None )
data = data.drop(0,axis=0)
X=data[0]
Y=data[1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)
lm = LinearRegression()
lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))
Y_pred = lm.predict(X_test.values.reshape(-1,1))
accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)
print(accuracy)
output
-3592622948027972.5
Here is the formula of the Rยฒ score:
\hat{y_i} is the predictor of the i-th observation y_i and \bar{y} is the mean of all observations.
Therefore, a negative Rยฒ means that if someone knew the mean of your y_test sample and always used it as a "prediction", this "prediction" would be more accurate than your model.
Moving on to your dataset (thanks to #Prayson W. Daniel for the convenient loading script), let us have a quick look at your data.
df.population.plot()
It looks like a logarithmic transformation could help.
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
Now let us perform a linear regression using OpenTURNS.
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
Output:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
This is an almost exact fit.
EDIT
As suggested by #Prayson W. Daniel, here is the model fit after it is transformed back to the original scale.
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
Output:
R2 score in original scale = 0.9979032805107133
Sckit-learnโs LinearRegression scores uses ๐
2 score. A negative ๐
2 means that the model fitted your data extremely bad. Since ๐
2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then ๐
2 is negative when the model fits worse than a horizontal line.
๐
2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2, then ๐
2 will be negative.
reasons and ways to correct it
Problem 1: You are performing a random split of time-series data. Random split will ignore the temporal dimension.
Solution: Preserve time flow (See code below)
Problem 2: Target values are so large.
Solution: Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.
Here is a code example. Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
For those interested in making it better, here is a way to read that dataset
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
Results:
Related
So I have this small dataset and ฤฑ want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I was training a model that contains 8 features that allow us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking a value between 1 and 10)
Date: The date of stay (an integer between 1โ365, here we consider only oneโday request)
Weekday: Day of week (an integer between 1โ7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1โ4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept: Whether this post gets accepted (someone took it, 1) or not (0) in the end*
Column Accept is the "y". Hence, this is a binary classification.
I have done OneHotEncoder for the categorical data.
I have applied normalization to the data.
I have modified the following RandomRofrest parameters:
Max_depth: Peak at 16
n_estimators: Peak at 300
min_samples_leaf:
Peak at 2
max_features: Has no effect on the AUC.
The AUC peaked at 0.7889. What else can I do to increase it?
Here is my code
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import train_test_split
df_train = pd.read_csv('case2_training.csv')
# Exclude ID since it is not a feature
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05,shuffle=False)
ohe = OneHotEncoder(sparse = False)
column_trans = make_column_transformer(
(OneHotEncoder(),['Region','Weekday','Apartment']),remainder='passthrough')
X_train = column_trans.fit_transform(X_train)
X_test = column_trans.fit_transform(X_test)
# Normalization
from sklearn.preprocessing import MaxAbsScaler
mabsc = MaxAbsScaler()
X_train = mabsc.fit_transform(X_train)
X_test = mabsc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
RF = RandomForestClassifier(min_samples_leaf=2,random_state=0, n_estimators=300,max_depth = 16,n_jobs=-1,oob_score=True,max_features=i)
cross_val_score(RF,X_train,y_train,cv=5,scoring = 'roc_auc').mean()
RF.fit(X_train, y_train)
yhat = RF.predict_proba(X_test)
print("AUC:",roc_auc_score(y_test, yhat[:,-1]))
# Run the prediction on the given test set.
testset = pd.read_csv('case2_testing.csv')
testset = testset.iloc[:, 1:] # exclude the 'ID' column
testset = column_trans.fit_transform(testset)
testset = mabsc.transform(testset)
yhat_2 = RF.predict_proba(testset)
final_prediction = yhat[:,-1]
However, all the probabilities from 'final_prediction` are below 0.45, basically, the model believes that all the samples are 0.
Can anyone help ?
You are using column_trans.fit_transform on the test set, this completely overwrites the features that were fitted during training. Basically the data is now in a format your trained model doesn't understand.
Once fitted in training on the training set, simply use column_trans.transform afterwards.
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
I am new to statistic modelling so please forgive if I am mistaken about this.
I am currently working on a function in python which will predict accuracy score for logistics regression model on the test data set. User will have the flexibility to supply model parameters/coefficients (other than the ones generated by training model-part of the requirement). I have a functional code which updates the coefficients but accuracy or prediction on test data set stays the same no matter how different model parameters I supply. My understanding is that the score on test set should change if I change model coefficients?
I am using statsmodel library to make things easier for me and following this link. Can someone please help me understand what am I missing ? Below is the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
data = pd.read_csv("E:\\Dev\\testing\\rawdata.txt", header=None,
names=['Exam1', 'Exam2', 'Admitted'])
X = data.copy() # ou training data
y = X.Admitted.copy() # copy โyโ column values out
X.drop(['Admitted'], axis=1, inplace=True) # then, drop y column
# manually add the intercept
X['intercept'] = 1.0 # so we don't need to use sm.add_constant every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = sm.Logit(y_train, X_train)
result = model.fit()
print("old parameters :\n" + str(list(result.params)))
#New parameters supplied
mdict = { 'Exam1':10000000.2234, 'Exam2':1.1233423, 'intercept':2313.423 }
result.params = mdict
print("New parameters: \n"+str(result.params))
def logitPredict(modelParams, X, threshold):
probabilities = modelParams.predict(X)
return [1 if x >= threshold else 0 for x in probabilities]
predictions = logitPredict(result, X_test, .5)
accuracy = np.mean(predictions == y_test)
#accuracy always remains same as train model
print ('accuracy = {0}%'.format(accuracy*100) )
#test sample
myExams = pd.DataFrame({'Exam1': [40.], 'Exam2': [78.], 'intercept': [1.]})
myExams
print ('Your probability = {0}%'.format(result.predict(myExams)[0]*100))
I'm working on a what I thought was a fairly simple machine learning problem.
In this problem the y (label) I'm wanting to classify is a multi-class value. In this dataset I have 6 possible choices.
I've been using the preprocessing.LabelBinarizer() function to pivot my y set to an array of ones or zeros in hopes that this would be sufficient (e.g. [0 0 0 0 0 1]).
This code below fails on the model.fit() due to a ValueError: Found arrays with inconsistent numbers of samples: [ 217 1302] || 1302 is 217*6 BTW
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
It seems that the binarizer returns results that appear like 6 columns to the algorithm instead of 1 column containing an array of 6 digits.
I've tried to force it into an array model using the code below but then the fit function bails for another reason: ValueError: Unknown label type array([array[0,1,0,0,0]), arrary([0,1,0,0...])
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y_list = []
for x in api_y:
item = {'gear': np.array(x)}
y_list.append(item)
y = pd.DataFrame(y_list)
print("after changing to binary classes array y is "+repr(y.shape))
y = np.ravel(y)
I also tried the sklearn_pandas.DataFrameMapper to no avail as it also created 6 distinct fields vs. an array of 6 values represented as one field.
Any help or suggestions would be appreciated...full version of what I thought was right posted here for clarity:
#!/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
import sklearn_pandas
#
# load traing data taken from 2 years of strava rides
df = pd.DataFrame.from_csv("gear_train.csv")
#
# Prepare data for logistic regression
#
y, X = dmatrices('gear ~ distance + moving_time + total_elevation_gain + average_speed + max_speed + average_cadence + has_heartrate + device_watts', df, return_type="dataframe")
#
# Fix up y to be a flattened array of 1 column (binary array?)
#
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
#
# run the logistic regression
#
model = LogisticRegression()
model = model.fit(X, y)
score = model.score(X, y)
#
# evaluate the model by splitting into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
predicted = model2.predict(X_test)
print("predicted="+repr(lb2.inverse_transform(predicted)))
print(metrics.classification_report(y_test, predicted))
#
# do a 10-fold CV test to see if this model holds up
#
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores.mean())enter code here
The root cause of my problem was y field contained string values instead of numeric. For example b12345 as a key instead of 12345. Once I changed that to use LabelEncoding and Decoding it worked like a champ.