GridSearchCV scoring on mean absolute error - python

I'm trying to setup an instance of GridSearchCV to determine which set of hyperparameters will produce the lowest mean absolute error. This scikit documentation indicates that score metrics can be passed into the grid upon creation of a GridSearchCV (below).
param_grid = {
'hidden_layer_sizes' : [(20,),(21,),(22,),(23,),(24,),(25,),(26,),(27,),(28,),(29,),(30,),(31,),(32,),(33,),(34,),(35,),(36,),(37,),(38,),(39,),(40,)],
'activation' : ['relu'],
'random_state' : [0]
}
gs = GridSearchCV(model, param_grid, scoring='neg_mean_absolute_error')
gs.fit(X_train, y_train)
print(gs.scorer_)
[1] make_scorer(mean_absolute_error, greater_is_better=False)
However the grid search is not selecting the best performing model in terms of mean absolute error
model = gs.best_estimator_.fit(X_train, y_train)
print(metrics.mean_squared_error(y_test, model.predict(X_test)))
print(gs.best_params_)
[2] 125.0
[3] Best parameters found by grid search are: {'hidden_layer_sizes': (28,), 'learning_rate': 'constant', 'learning_rate_init': 0.01, 'random_state': 0, 'solver': 'lbfgs'}
After running the above code and determining the so called 'best parameters', I delete one of the values found in gs.best_params_, and find that by running my program again the mean squared error will sometimes decrease.
param_grid = {
'hidden_layer_sizes' : [(20,),(21,),(22,),(23,),(24,),(25,),(26,),(31,),(32,),(33,),(34,),(35,),(36,),(37,),(38,),(39,),(40,)],
'activation' : ['relu'],
'random_state' : [0]
}
[4] 122.0
[5] Best parameters found by grid search are: {'hidden_layer_sizes': (23,), 'learning_rate': 'constant', 'learning_rate_init': 0.01, 'random_state': 0, 'solver': 'lbfgs'}
To clarify, I changed the set that was fed into my grid search so that it did not contain an option to select a hidden layer size of 28, when that change was made, I ran the code again and this time it picked a hidden layer size of 23 and the mean absolute error decreased (even though the size of 23 had been available from the start), why didn't it just pick this option from the start if it is evaluating the mean absolute error?

The grid-search and model fitting in essential, depends on random number generators for different purposes. In scikit-learn this is controlled by a param random_state. See my other answers to know about it:
https://stackoverflow.com/a/42477052/3374996
https://stackoverflow.com/a/42197534/3374996
Now in your case, I can think of these things where this random-number generation affects the training:
1) GridSearchCV will by default use a KFold with 3 folds for regression tasks, which may split data differently on different runs. It may happen that the splits that happened in two grid-search processes are different, and hence different scores.
2) You are using a separate test data for calculation the mse which the GridSearchCV dont have access to. So it will find the parameters appropriate for the supplied data which may or may not be perfectly valid for the separate dataset.
Update:
I see now that you have used random_state in param grid for model, so this point 3 now dont apply.
3) You have not shown which model are you using. But if the model during training is using sub-samples of data (like selecting smaller number of features, or smaller number of rows for iterations, or for different internal estimators), than you need to fix that too to get the same scores.
You need to check the results by first fixing that.
Recommendation Example
You can take ideas from this example:
# Define a custom kfold
from sklearn.model_selection import KFold
kf = KFold(n_splits=3, random_state=0)
# Check if the model you chose support random_state
model = WhateEverYouChoseClassifier(..., random_state=0, ...)
# Pass these to grid-search
gs = GridSearchCV(model, param_grid, scoring='neg_mean_absolute_error', cv = kf)
And then again do the two experiments you did by changing the param grid.

Related

GridSearchCV Returns WORST Possible Parameter (Ridge & Lasso Regression)

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?
This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

Time series regression - having problems trying to predict with KRR and gaussian process regression

I need to predict what the output will be depending on the time. I want to make it so i can train my model on the first 20% of the data, and then make a model, that will follow the behavior, and predict the remaining 80%.
The data i am working on looks as follows:
My data
But when i try to make regressions to do this, i either get something way off target (or something quite close, but then it is linear), which is not accepted.
I maybe think my problem is the choice of my kernel, or the way i am making the regressions. Right now i am making the with the sklearn package as follows:
gpr=GaussianProcessRegressor(kernel=1.15**2*RBF(length_scale=41.4) + WhiteKernel(noise_level=1.32e-4),
n_restarts_optimizer=10,
optimizer='fmin_l_bfgs_b',
normalize_y=True,
alpha=0.051)
gpr.fit(X_train, y_train)
y_gpr, y_std = gpr.predict(X_test, return_std=True)
But after a few predictions, the predictions just become the same steady value, instead of having the curve as in the data. Also, the standard variations for the prediction becomes very large.
The GPR prediction on the real data
When doing the Kernel Ridge Regression in python, i can't seem to get the curve to follow the data aswell. Either it drops to 0 in a few predictions, or it has to be a linear prediction.
The KRR model, but linear instead - which is not good enough
The KRR model is made as follows (and i know the kernel=polynomial with a degree of 1, but i cant seem to figure out/find an appropiate kernel that will follow my data):
#The kernel ridge regression
krr = KernelRidge(alpha=0.051,kernel='polynomial',degree=1)
# krr = KernelRidge(alpha=0.051,kernel=RBF(0.5))
krr.fit(X_train,y_train)
list_y_pred=krr.predict(X_test)
So if possible, i would like to get some inputs, of how it should be done instead, or if a different approach to the problem would be better. But i am really hoping i can get the KRR to fit the data, and the gaussian process regression aswell.
There is nothing absolutely wrong with your code. I believe your parameters are wrong and the guess is not the best because of it.
My sugestion would be to use grid search and pipelines to estimate the best parameters.
An example of how it would work would be
param_grid = [
{'alpha': [1, 10, 100, 1000], 'kernel': ['linear']}, ## test linear kernel with varying alpha
{'alpha': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, # test rbf kernel while varying gamma and alpha
]
## you can have as many dictionaries as you want inside this list, or just 1. Keep in
## mind this takes O(n^n)*time_per_fit where n is the number of arguments you try to test, so
## it can take a long time
estimator = KernelRidge()
clf = clf = GridSearchCV(estimator, param_grid)
clf.fit(X_train,y_train)
list_y_pred=clf.predict(X_test)
For a more comprehensive tutorial try taking a look at the oficial docs here and here, or even here for a faster, but less thorough search.
Keep in mind my parameters are way off, I just copied the example from the docs

Classification Model's parameters produce different results

I'm working on SVC model for classification and I faced different accuracy result in each time I changed the values of the parameters (svc__gamma, svc__kernel and svc__C), I read the documentation of Sklearn but I could not understand what those parameters mean, I have Three questions :
What did those parameters indicate to?
How its effect Accuracy each time I change it?
What is the correct parameter values?
the result of accuracy is 0.70, but when I delete svc__gamma and svc__C , the result increases up to 0.76.
pipe = make_pipeline(TfidfVectorizer(),
SVC())
param_grid = {'svc__kernel': ['rbf', 'linear', 'poly'],
'svc__gamma': [0.1, 1, 10, 100],
'svc__C': [0.1, 1, 10, 100]}
svc_model = GridSearchCV(pipe, param_grid, cv=3)
svc_model.fit(X_train, Y_train)
prediction = svc_model.predict(X_test)
print(f"Accuracy score is {accuracy_score(Y_test, prediction):.2f}")
print(classification_report(Y_test, prediction))
to 1.
gamma is a parameter of the gaussian bell curve, so it should only
affect the RBF( Gaussian Kernel)
C is the paramter of the optimization problem, the inverse of the Lagrangian multiplier
to. 2.
get familiar with the mathematical background to fully understand how they affect your accuracy (sidenote: Accuracy is usuallly no reliable measure, but depends on context)
to 3.
there are no 'correct' parameters. They depend on the context, data and the goal you want to achive. Usually there is a tradeoff between how good the algorithm works on test data and how it works on new data ( overfitting vs. underfitting)
I hope that helps as a first step :)
for further information I suggest SVM.

Ridge regression model using cross validation technique and Grid-search technique

I created python code for ridge regression.For that I used cross validation and grid-search technique in together. i got output result. I want check whether my regression model building steps correct or not? can some one explain it?
from sklearn.linear_model import Ridge
ridge_reg = Ridge()
from sklearn.model_selection import GridSearchCV
params_Ridge = {'alpha': [1,0.1,0.01,0.001,0.0001,0] , "fit_intercept": [True, False], "solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
Ridge_GS = GridSearchCV(ridge_reg, param_grid=params_Ridge, n_jobs=-1)
Ridge_GS.fit(x_train,y_train)
Ridge_GS.best_params_
output - {'alpha': 1, 'fit_intercept': True, 'solver': 'cholesky'}
Ridgeregression = Ridge(random_state=3, **Ridge_GS.best_params_)
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=Ridgeregression, X=x_train, y=y_train, cv=5)
all_accuracies
output - array([0.93335508, 0.8984485 , 0.91529146, 0.89309012, 0.90829416])
print(all_accuracies.mean())
output - 0.909695864130532
Ridgeregression.fit(x_train,y_train)
Ridgeregression.score(x_test,y_test)
output - 0.9113458623386644
Is 0.9113458623386644 my ridge regression accuracy(R squred) ?
if it is, then what is meaning of 0.909695864130532 value.
Yes the score method from Ridge regression returns your R-squared value (docs).
In case you are not aware how the CV method works it splits your data into 5 equal chunks. Then for each combination of parameters it fits the model five times using each chunk once as evaluation set, while using the remainder of the data as the training set. The best parameter set is chosen to be the set which gives the highest average score.
Your main question seems to be why the average of your CV score is less than the score from the full training evaluated on the test set. This is not necessarily surprising, since the full training set will be larger than any of CV samples which are used for the all_accuracies values. More training data will generally get you a more accurate model.
The test set score (i.e. your second 'score', 0.91...) is most likely to represent how your model will generalize to unseen data. This is what you should quote as the 'score' of your model. The performance on CV set is biased, since this is the data on which you based your parameter choices.
In general your method looks correct. The step where you refit ridge regression using cross_val_score seems necessary. Once you have found your best parameters from GridSearchCV I would go straight to fitting on the full training dataset (as you do at the end).

GridSearchCV.best_score not same as cross_val_score(GridSearchCV.best_estimator_)

Consider the following gridsearch :
grid = GridSearchCV(clf, parameters, n_jobs =-1, iid=True, cv =5)
grid_fit = grid.fit(X_train1, y_train1)
According to Sklearn's ressource, grid_fit.best_score_
returns The mean cross-validated score of the best_estimator .
To me that would mean that the average of :
cross_val_score(grid_fit.best_estimator_, X_train1, y_train1, cv=5)
should be exactly the same as:
grid_fit.best_score_.
However I am getting a 10% difference between the two numbers. What am I missing ?
I am using the gridsearch on proprietary data so I am hoping somebody has run into something similar in the past and can guide me without a fully reproducible example. I will try to reproduce this with the Iris dataset if it's not clear enough...
when an integer number is passed to GridSearchCV(..., cv=int_number) parameter, then the StratifiedKFold will be used for cross-validation splitting. So the data set will be randomly splitted by StratifiedKFold. This might affect the accuracy and therefore the best score.

Categories