sklearn GridSearchCV gives questionable results - python

I have input data X_train with dimension (477 x 200) and y_train with length 477.
I want to use a support vector machine regressor and I am doing grid search.
param_grid = {'kernel': ['poly', 'rbf', 'linear','sigmoid'], 'degree': [2,3,4,5], 'C':[0.01,0.1,0.3,0.5,0.7,1,1.5,2,5,10]}
grid = GridSearchCV(estimator=regressor_2, param_grid=param_grid, scoring='neg_root_mean_squared_error', n_jobs=1, cv=3, verbose = 1)
grid_result = grid.fit(X_train, y_train))
I get for grid_result.best_params_ {'C': 0.3, 'degree': 2, 'kernel': 'linear'} with a score of -7.76. And {'C': 10, 'degree': 2, 'kernel': 'rbf'} gives mit -8.0.
However, when I do
regressor_opt = SVR(kernel='linear', 'degree'=2, C=0.3)
regressor_opt.fit(X_train,y_train)
y_train_pred = regressor_opt.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))
I get 7.4 and when I do
regressor_2 = SVR(kernel='rbf', 'degree'=2, C=10)
regressor_2.fit(X_train,y_train)
y_train_pred = regressor_2.predict(X_train)
print("rmse=",np.sqrt(sum(y_train-y_train_pred)**2)/np.shape(y_train_pred)))
I get 5.9. This is clearly better than 7.4 but in the gridsearch the negative rmse I got for that parameter combination was -8 and therefore worse than 7.4.
Can anybody explain to me what is going on? Should I not use scoring='neg_root_mean_square_error'?

GridSearchCV will give you the score based on the left-out data. This is fundamentally how cross-validation works. What you're doing when you train and assess on the full train set is failing to do that cross-validation; you will be obtaining an overly optimistic result. You see this slightly for the linear kernel (7.4 vs 7.76) and more exaggerated for the more flexible RBF kernel (5.9 vs 8). GridSearchCV has, I expect correctly, identified that your more flexible model does not generalise as well.
You should be able to see this effect more clearly by taking your specific estimators (regressor_opt and regressor_2) and using sklearn's cross_validate() to get the results for left-out folds. I expect you will see regressor_2 performing considerably worse than your optimistic value of 5.9. You may find that an informative exercise.
Remember, you want a model that will perform best on new data, not a model that fits arbitrarily well to your training data.
I suggest further discussion of this does not belong on stackoverflow, but instead on crossvalidated.

Related

GridSearchCV Returns WORST Possible Parameter (Ridge & Lasso Regression)

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?
This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

Classification Model's parameters produce different results

I'm working on SVC model for classification and I faced different accuracy result in each time I changed the values of the parameters (svc__gamma, svc__kernel and svc__C), I read the documentation of Sklearn but I could not understand what those parameters mean, I have Three questions :
What did those parameters indicate to?
How its effect Accuracy each time I change it?
What is the correct parameter values?
the result of accuracy is 0.70, but when I delete svc__gamma and svc__C , the result increases up to 0.76.
pipe = make_pipeline(TfidfVectorizer(),
SVC())
param_grid = {'svc__kernel': ['rbf', 'linear', 'poly'],
'svc__gamma': [0.1, 1, 10, 100],
'svc__C': [0.1, 1, 10, 100]}
svc_model = GridSearchCV(pipe, param_grid, cv=3)
svc_model.fit(X_train, Y_train)
prediction = svc_model.predict(X_test)
print(f"Accuracy score is {accuracy_score(Y_test, prediction):.2f}")
print(classification_report(Y_test, prediction))
to 1.
gamma is a parameter of the gaussian bell curve, so it should only
affect the RBF( Gaussian Kernel)
C is the paramter of the optimization problem, the inverse of the Lagrangian multiplier
to. 2.
get familiar with the mathematical background to fully understand how they affect your accuracy (sidenote: Accuracy is usuallly no reliable measure, but depends on context)
to 3.
there are no 'correct' parameters. They depend on the context, data and the goal you want to achive. Usually there is a tradeoff between how good the algorithm works on test data and how it works on new data ( overfitting vs. underfitting)
I hope that helps as a first step :)
for further information I suggest SVM.

Why did Run best_estimator_ from GridSearch using cross-validation produce different accuracy score?

Basically, I want to perform binary classification using SVM (SVC) from sk-learn. Since I do not have separate training and testing data, I use cross-validation to evaluate the effectiveness of the feature set that I use.
Then, I use GridSearchCV to find the best estimator and set the cross-validation parameter to 10. Because I want to analyze the prediction result, I use the best estimator to perform cross-validation using the same dataset (of course I use 10-fold cross-validation).
However, when I print the scores of performance (precision, recall, f-measure, and accuracy), It produces different scores. Why do you think this happen?
I am wondering, in sk-learn should I specify the label for positive one? In my dataset, I have already labelled the positive case as 1.
Lastly, the following text is the snippet for my code.
tuned_parameters = [{'kernel': ['linear','rbf'], 'gamma': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10], 'C': [0.1, 1, 5, 10, 50, 100, 1000]}]
scoring = ['f1_macro', 'precision_macro', 'recall_macro', 'accuracy']
clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=10, scoring= scoring, refit='f1_macro')
clf.fit(feature, label)
param_C = clf.cv_results_['param_C']
param_gamma = clf.cv_results_['param_gamma']
P = clf.cv_results_['mean_test_precision_macro']
R = clf.cv_results_['mean_test_recall_macro']
F1 = clf.cv_results_['mean_test_f1_macro']
A = clf.cv_results_['mean_test_accuracy']
#print clf.best_estimator_
print clf.best_score_
scoring2 = ['f1', 'precision', 'recall', 'accuracy']
scores = cross_validate(clf.best_estimator_, feature, label, cv=n, scoring=scoring2, return_train_score=True)
print scores
scores_f1 = np.mean(scores['test_f1'])
scores_p = np.mean(scores['test_precision'])
scores_r = np.mean(scores['test_recall'])
scores_a = np.mean(scores['test_accuracy'])
print '\t'.join([str(scores_f1), str(scores_p), str(scores_r),str(scores_a)])
It may be due to that the cross-validation splits used in cross_validate and GridSearchCV are different, due to the randomness. The effect of this randomness becomes even larger as your dataset is very small (93) and the number of folds is so large (10). A possible fix is to feed into cv a fix train/test splits, and reduce the number of folds to reduce the variance, i.e.
kfolds=StratifiedKFold(n_splits=3).split(feature, label)
...
clf = GridSearchCV(..., cv=kfolds, ...)
...
scores = cross_validate(..., cv=kfolds, ...)

Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn

I have a machine learning problem and want to optimize my SVC estimators as well as the feature selection.
For optimizing SVC estimators I use essentially the code from the docs. Now my question is, how can I combine this with recursive feature elimination cross validation (RCEV)? That is, for each estimator-combination I want to do the RCEV in order to determine the best combination of estimators and features.
I tried the solution from this thread, but it yields the following error:
ValueError: Invalid parameter C for estimator RFECV. Check the list of available parameters with `estimator.get_params().keys()`.
My code looks like this:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-4,1e-3],'C': [1,10]},
{'kernel': ['linear'],'C': [1, 10]}]
estimator = SVC(kernel="linear")
selector = RFECV(estimator, step=1, cv=3, scoring=None)
clf = GridSearchCV(selector, tuned_parameters, cv=3)
clf.fit(X_train, y_train)
The error appears at clf = GridSearchCV(selector, tuned_parameters, cv=3).
I would use a Pipeline, but here you have a more adequate response
Recursive feature elimination and grid search using scikit-learn

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).
X1,y2 = train data
X2,y2 = validation data
clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
cv=???, # validate on X2,y2
n_jobs=8,verbose=10)
clf.fit(X1, y1)
Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.
If cv =5 it will run five cross validations for every iteration.
In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)
If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.
This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

Categories