I am trying to tune the parameter for a lasso regression model in Sklearn, but I'm finding that the GridSearchCV does not seem to be choosing the best R-squared parameter. I find that when I test with other parameters for lambda, they have higher R-squared than GridSearchCV returns. Here is my code:
lasso = Lasso(random_state = 0,max_iter=10000,tol=1)
alphas = np.logspace(-80,20,101)
tuned_parameters = [{'alpha': alphas}]
n_folds = 3
clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds,refit=True,scoring='r2')
clf.fit(feature_matrix,labels)
print("\nBest parameters:")
print(clf.best_params_)
print("\nR-Squared:")
print(clf.best_estimator_.score(feature_matrix,labels))
This returns a best alpha of 1e4, with R-squared equal to zero. However, when I probe the Lasso model manually, I get a different result:
clf = Lasso(alpha=1e2,max_iter=10000,tol=1)
clf.fit(feature_matrix,labels)
print(R-squared:\n{}".format(clf.score(feature_matrix,labels)))
This returns an R-squared value of 0.39 (as well as pretty much any alpha smaller than this). Why would this model not have been chosen by GridSearchCV?
Related
I understand that Support Vector Machine algorithm does not compute probabilities, which is needed to find the AUC value, is there any other way to just find the AUC score?
from sklearn.svm import SVC
model_ksvm = SVC(kernel = 'rbf', random_state = 0)
model_ksvm.fit(X_train, y_train)
model_ksvm.predict_proba(X_test)
I can't get the the probability output from the SVM algorithm, without the probability output I can't get the AUC score, which I can get with other algorithm.
You don't really need probabilities for the ROC, just any sort of confidence score. You need to rank-order the samples according to how likely they are to be in the positive class. Support Vector Machines can use the (signed) distance from the separating plane for that purpose, and indeed sklearn does that automatically under the hood when scoring with AUC: it uses the decision_function method, which is the signed distance.
You can also set the probability option in the SVC (docs), which fits a Platt calibration model on top of the SVM to produce probability outputs:
model_ksvm = SVC(kernel='rbf', probability=True, random_state=0)
But this will lead to the same AUC, because the Platt calibration just maps the signed distances to probabilities monotonically.
I have tried gridsearchcv
pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [1000, 100,10,1,0.1]}
grid_pipe = GridSearchCV(pipe, param_grid, cv = 5)
If i exclude 1000 in ridge_alpha, ridge_alpha = 100 is the best model. If I include 1000 in ridge_alpha, ridge_alpha being 1000 is the best model, however, this model has both higher rmse and lower R^2...Why does it not choose alpha = 100 even with 1000? I thought R^2 is the default criteria for regression with Ridge...
gridsearchcv checks every split of the cross-validation score and gets the average of all those to find the better hyperparameter. You can use grid_pipe.cv_results_ to check every split score, average score and std score. it gives the details the result of every hyperparameter with cross-validation splits. so you can analyze that data for better understanding.
I am so confused. I am comparing lasso and linear regression on a model that predicts housing prices. I don't understand how when I run a linear model in sklearn I get a negative for R^2 yet when I run it in lasso I get a reasonable R^2. I know that you can get a negative R^2 if linear regression is a poor fit for your model so I decided to check it using OLS in statsmodels where I also get a high R^2. I am just confused how this possible and what is going on? Is it due to multicollinearity?
Also, yes I know that I can use grid search cv to find alpha for lasso but my professor wanted us to try it this way in order to get practice coding. I am a math major and this is for a statistics course.
# Linear regression in sklearn
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=60)
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions_linear = lm.predict(X_test)
print('\nR^2 of linear model is {:.5f}\n'.format(metrics.r2_score(y_test, predictions_linear)))
>>>>R^2 of linear model is -213279628873266528256.00000
# Lasso in sklearn
r2_alpha_lasso = [None]*200
i=0
for num in np.logspace(-4,1,len(r2_alpha_lasso)):
lasso = linear_model.Lasso(alpha=num, random_state=50)
lasso.fit(X_train, y_train)
predictions_lasso = lasso.predict(X_test)
r2 = metrics.r2_score(y_test, predictions_lasso)
r2_alpha_lasso[i] = [num, r2]
i+=1
r2_maximized_lasso = sorted(r2_alpha_lasso, key=itemgetter(1))[-1]
print("\nR^2 maximized where:\n Alpha: {:.5f}\n R^2: {:.5f}\n".format(r2_maximized_lasso[0], r2_maximized_lasso[1]))
>>>>R^2 maximized where:
Alpha: 0.00120
R^2: 0.90498
# OLS in statsmodels
df['Constant'] = 1
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
mod = sm.OLS(endog=y, exog=X, data=df)
res = mod.fit()
print(res.summary()) # only printed the relevant results, not the entire table
>>>>R-squared: 0.921
Adj. R-squared: 0.908
[2] The smallest eigenvalue is 1.26e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().
When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.
I'm building a logistic regression model as follows:
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'roc_auc')
I looked at the roc_auc score for the best estimator:
grid_search_object.best_score_
Out[195]: 0.94505225726738229
However, when I used the best estimator to score the full training set, I got a worse score:
grid_search_object.best_estimator_.score(X,Y)
Out[196]: 0.89636762322433028
How can this be? What am I doing wrong?
Edit: Nevermind. I'm an idiot. grid_search_object.best_estimator_.score calculates accuracy, not auc_roc. Right?
But if that is the case, how does GridSearchCV compute the grid_scores_? Does it pick the best decision threshold for each parameter, or is the decision threshold always at 0.5? For area under the ROC curve, decision threshold doesn't matter, but it does for say, f1_score.
If you evaluated the best_estimator_ on the full training set it is not surprising that the scores are different from the best_score_, even if the scoring methods are the same:
The best_score_ is the average over your cross-validation fold scores of the best model (best in exactly that sense: scores highest on average over folds).
When scoring on the whole training set, your score may be higher or lower than this. Especially if you have some sort of temporal structure in your data and you are using the wrong data splitting, scores on the full set can be worse.