I am so confused. I am comparing lasso and linear regression on a model that predicts housing prices. I don't understand how when I run a linear model in sklearn I get a negative for R^2 yet when I run it in lasso I get a reasonable R^2. I know that you can get a negative R^2 if linear regression is a poor fit for your model so I decided to check it using OLS in statsmodels where I also get a high R^2. I am just confused how this possible and what is going on? Is it due to multicollinearity?
Also, yes I know that I can use grid search cv to find alpha for lasso but my professor wanted us to try it this way in order to get practice coding. I am a math major and this is for a statistics course.
# Linear regression in sklearn
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=60)
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions_linear = lm.predict(X_test)
print('\nR^2 of linear model is {:.5f}\n'.format(metrics.r2_score(y_test, predictions_linear)))
>>>>R^2 of linear model is -213279628873266528256.00000
# Lasso in sklearn
r2_alpha_lasso = [None]*200
i=0
for num in np.logspace(-4,1,len(r2_alpha_lasso)):
lasso = linear_model.Lasso(alpha=num, random_state=50)
lasso.fit(X_train, y_train)
predictions_lasso = lasso.predict(X_test)
r2 = metrics.r2_score(y_test, predictions_lasso)
r2_alpha_lasso[i] = [num, r2]
i+=1
r2_maximized_lasso = sorted(r2_alpha_lasso, key=itemgetter(1))[-1]
print("\nR^2 maximized where:\n Alpha: {:.5f}\n R^2: {:.5f}\n".format(r2_maximized_lasso[0], r2_maximized_lasso[1]))
>>>>R^2 maximized where:
Alpha: 0.00120
R^2: 0.90498
# OLS in statsmodels
df['Constant'] = 1
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
mod = sm.OLS(endog=y, exog=X, data=df)
res = mod.fit()
print(res.summary()) # only printed the relevant results, not the entire table
>>>>R-squared: 0.921
Adj. R-squared: 0.908
[2] The smallest eigenvalue is 1.26e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Related
I am trying to tune the parameter for a lasso regression model in Sklearn, but I'm finding that the GridSearchCV does not seem to be choosing the best R-squared parameter. I find that when I test with other parameters for lambda, they have higher R-squared than GridSearchCV returns. Here is my code:
lasso = Lasso(random_state = 0,max_iter=10000,tol=1)
alphas = np.logspace(-80,20,101)
tuned_parameters = [{'alpha': alphas}]
n_folds = 3
clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds,refit=True,scoring='r2')
clf.fit(feature_matrix,labels)
print("\nBest parameters:")
print(clf.best_params_)
print("\nR-Squared:")
print(clf.best_estimator_.score(feature_matrix,labels))
This returns a best alpha of 1e4, with R-squared equal to zero. However, when I probe the Lasso model manually, I get a different result:
clf = Lasso(alpha=1e2,max_iter=10000,tol=1)
clf.fit(feature_matrix,labels)
print(R-squared:\n{}".format(clf.score(feature_matrix,labels)))
This returns an R-squared value of 0.39 (as well as pretty much any alpha smaller than this). Why would this model not have been chosen by GridSearchCV?
I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers
I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
I am trying to fit a multivariable linear regression on a dataset to find out how well the model explains the data. My predictors have 120 dimensions and I have 177 samples:
X.shape=(177,120), y.shape=(177,)
Using statsmodels, I get a very good R-squared of 0.76 with a Prob(F-statistic) of 0.06 which trends towards significance and indicates a good model for the data.
When I use scikit-learn's linear regression and try to compute 5-fold cross validation r2 score, I get an average r2 score of -5.06 which shows very poor generalization performance.
The two models should be exactly the same as their train r2 score is. So why the performance evaluations from these libraries are too different? Which one should I use? Greatly appreciate your comments on this.
Here is my code for your reference:
# using statsmodel:
import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y, X)
est2 = est.fit()
print(est2.summary())
# using scikitlearn:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
print 'train r2 score:',lin_reg.score(X, y)
cv_results = cross_val_score(lin_reg, X, y, cv = 5, scoring = 'r2')
msg = "%s: %f (%f)" % ('r2 score', cv_results.mean(),cv_results.std())
print(msg)
The difference in rsquared because of the difference between training sample and left out cross-validation sample.
You are most likely strongly overfitting with 121 regressors including constant and only 177 observations without regularization or variable selection.
Statsmodels only reports rsquared, R2, for the training sample, there is no cross-validation. Scikit-learn needs to reduce the training sample size for cross-validation which makes overfitting even worse.
A low cross-validation score as reported by scikit-learn, then means that the overfitted estimates do not generalize to the left out data, and is matching idiosyncratic features of the training sample.
I'm studying machine learning with 'Python Machine Learning' book written by Sebastian Raschka.
My question is about learning rate eta0 in scikit-learn Perceptron Class.
The following code is implemented for Iris data classifier using Perceptron in that book.
(...omitted...)
from sklearn import datasets
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
ml = Perceptron(eta0=0.1, n_iter=40, random_state=0)
ml.fit(X_train_std, y_train)
y_pred = ml.predict(X_test_std)
print('total test:%d, errors:%d' %(len(y_test), (y_test != y_pred).sum()))
print('accuracy: %.2f' %accuracy_score(y_test, y_pred))
My question is like the following.
The result(total test, errors, accuracy) is not changed for various eta0 values.
The same result of "total test=45, errors=4, accuracy=0.91' is out with both eta0=0.1 and eta0=100.
What's the wrong?
I will try to briefly explain the position of the learning rate in the Perceptron so you understand why there is no difference between the final error magnitude and the accuracy score.
The algorithm of the Perceptron always finds a solution provided we have defined a finite number of epochs (i.e. iterations or steps), no matter how big eta0 is, because this constant simply multiplies the output weights during fitting.
The learning rate in other implementations (like neural nets and basically everything else*) is a value which is multiplied on partial derivatives of a given function during the process of reaching the optimal minima. While higher learning rates give us higher chances of overshooting the optimum, lower learning rates consume more time to converge (to reach the optimal point). The theory is complex, though, there is really good topic describing the learning rate which you should read:
http://neuralnetworksanddeeplearning.com/chap3.html
Okay, now I will also show you that the learning rate in the Perceptron is only used to rescale weights. Let us consider X as our train data and y as our train labels. Let us try to fit the Perceptron with two different eta0, say, 1.0 and 100.0:
X = [[1,2,3], [4,5,6], [1,2,3]]
y = [1, 0, 1]
clf = Perceptron(eta0=1.0, n_iter=5)
clf.fit(X, y)
clf.coef_ # returns weights assigned to the input features
array([[-5., -1., 3.]])
clf = Perceptron(eta0=100.0, n_iter=5)
clf.fit(X, y)
clf.coef_
array([[-500., -100., 300.]])
As you can see, the learning rate in the Perceptron only rescales the weights (leaving signs unchanged) of the model while leaving accuracy score and the error term constant.
Hope that suffices. E.