Ridge regression model using cross validation technique and Grid-search technique

Ridge regression model using cross validation technique and Grid-search technique - python

I created python code for ridge regression.For that I used cross validation and grid-search technique in together. i got output result. I want check whether my regression model building steps correct or not? can some one explain it?
from sklearn.linear_model import Ridge
ridge_reg = Ridge()
from sklearn.model_selection import GridSearchCV
params_Ridge = {'alpha': [1,0.1,0.01,0.001,0.0001,0] , "fit_intercept": [True, False], "solver": ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
Ridge_GS = GridSearchCV(ridge_reg, param_grid=params_Ridge, n_jobs=-1)
Ridge_GS.fit(x_train,y_train)
Ridge_GS.best_params_
output - {'alpha': 1, 'fit_intercept': True, 'solver': 'cholesky'}
Ridgeregression = Ridge(random_state=3, **Ridge_GS.best_params_)
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=Ridgeregression, X=x_train, y=y_train, cv=5)
all_accuracies
output - array([0.93335508, 0.8984485 , 0.91529146, 0.89309012, 0.90829416])
print(all_accuracies.mean())
output - 0.909695864130532
Ridgeregression.fit(x_train,y_train)
Ridgeregression.score(x_test,y_test)
output - 0.9113458623386644
Is 0.9113458623386644 my ridge regression accuracy(R squred) ?
if it is, then what is meaning of 0.909695864130532 value.

Yes the score method from Ridge regression returns your R-squared value (docs).
In case you are not aware how the CV method works it splits your data into 5 equal chunks. Then for each combination of parameters it fits the model five times using each chunk once as evaluation set, while using the remainder of the data as the training set. The best parameter set is chosen to be the set which gives the highest average score.
Your main question seems to be why the average of your CV score is less than the score from the full training evaluated on the test set. This is not necessarily surprising, since the full training set will be larger than any of CV samples which are used for the all_accuracies values. More training data will generally get you a more accurate model.
The test set score (i.e. your second 'score', 0.91...) is most likely to represent how your model will generalize to unseen data. This is what you should quote as the 'score' of your model. The performance on CV set is biased, since this is the data on which you based your parameter choices.
In general your method looks correct. The step where you refit ridge regression using cross_val_score seems necessary. Once you have found your best parameters from GridSearchCV I would go straight to fitting on the full training dataset (as you do at the end).

Related

GridSearchCV Returns WORST Possible Parameter (Ridge & Lasso Regression)

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?

This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

Logistic Regression - Python?

Could you briefly describe me what the below lines of code mean. This is the code of logistic regression in Python.
What means size =0.25 and random_state = 0 ? And what is train_test_split ? What was done in this line of code ?
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
And what was done in these lines of code ?
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

Have a look at the description of the function here:
random_state sets the seed for the random number generator to give you the same result with each run, especially useful in education settings to give everyone an identical result.
test_size refers to the proportion used in the test split, here 75% of the data is used for training, 25% is used for testing the model.
The other lines simply run the logistic regression on the training dataset. You then use the test dataset to check the goodness of the fitted regression.

What means size =0.25 and random_state = 0 ?
test_size=0.25 -> 25% split of training and test data.
random_state = 0 -> for reproducible results this can be any number.
What was done in this line of code ?
Splits X and y into X_train, X_test, y_train, y_test
And what was done in these lines of code ?
Trains the logistic regression model through the fit(X_train, y_train) and then makes predictions on the test set X_test.
Later you probably compare y_pred to y_test to see what the accuracy of the model is.

Based on the documentation:
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25 would mean that you have:
750 data points for train
250 data points for test
The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,
To comment your regression:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Will load your Regression, for python this is only to name it
Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your y_train fits
This will use the learned weights of step 2 to do an estimation for y_pred with the X_test
After that you can test your results, you now have a y_pred which you calculated and the real y_test, you can know calculate some accuracy scores and the how good the regression was trained.

This line line:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
divides your source into train and test set, 0.25 shows 25% of the source will be used for test and remaining will be used for training.
For, random_state = 0, here is a brief discussion.
A part from above link:
if you use random_state=some_number, then you can guarantee that the
output of Run 1 will be equal to the output of Run 2,
logistic_regression= LogisticRegression() #Creates logistic regressor
Calculates some values for your source. Recommended read
logistic_regression.fit(X_train,y_train)
A part from above link:
Here the fit method, when applied to the training dataset,learns the
model parameters (for example, mean and standard deviation)
....
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results,
Perform prediction on test set based on the learning from training set.
y_pred=logistic_regression.predict(X_test)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above line splits your data into training and testing data randomly
X is your dataset minus output variable
y is your output variable
test_size=0.25 means you are dividing data into 75%-25% where 25% is your testing dataset
random_state is used for generating same sample again when you run the code
Refer train-test-split documentation

GridSearchCV.best_score not same as cross_val_score(GridSearchCV.best_estimator_)

Consider the following gridsearch :
grid = GridSearchCV(clf, parameters, n_jobs =-1, iid=True, cv =5)
grid_fit = grid.fit(X_train1, y_train1)
According to Sklearn's ressource, grid_fit.best_score_
returns The mean cross-validated score of the best_estimator .
To me that would mean that the average of :
cross_val_score(grid_fit.best_estimator_, X_train1, y_train1, cv=5)
should be exactly the same as:
grid_fit.best_score_.
However I am getting a 10% difference between the two numbers. What am I missing ?
I am using the gridsearch on proprietary data so I am hoping somebody has run into something similar in the past and can guide me without a fully reproducible example. I will try to reproduce this with the Iris dataset if it's not clear enough...

when an integer number is passed to GridSearchCV(..., cv=int_number) parameter, then the StratifiedKFold will be used for cross-validation splitting. So the data set will be randomly splitted by StratifiedKFold. This might affect the accuracy and therefore the best score.

RFECV does not return same features for same data

I have a dataframe X which is comprised of 60 features and ~ 450k outcomes. My response variable y is categorical (survival, no survival).
I would like to use RFECV to reduce the number of significant features for my estimator (right now, logistic regression) on Xtrain, which I would like to score of accuracy under an ROC Curve. "Features Selected" is a list of all features.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
import sklearn.linear_model as lm
# Create train and test datasets to evaluate each model
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,train_size = 0.70)
# Use RFECV to reduce features
# Create a logistic regression estimator
logreg = lm.LogisticRegression()
# Use RFECV to pick best features, using Stratified Kfold
rfecv = RFECV(estimator=logreg, cv=StratifiedKFold(ytrain, 10), scoring='roc_auc')
# Fit the features to the response variable
X_new = rfecv.fit_transform(Xtrain[features_selected], ytrain)
I have a few questions:
a) X_new returns different features when run on separate occasions (one time it returned 5 features, another run it returned 9. One is not a subset of the other). Why would this be?
b) Does this imply an unstable solution? While using the same seed for StratifiedKFold should solve this problem, does this mean I need to reconsider the approach in totality?
c) IN general, how do I approach tuning? e.g., features are selected BEFORE tuning in my current implementation. Would tuning affect the significance of certain features? Or should I tune simultaneously?

In k-fold cross-validation, the original sample is randomly partitioned into k equal size sub-samples. Therefore, it's not surprising to get different results every time you execute the algorithm. Source
There is an approach, so-called Pearson's correlation coefficient. By using this method, you can calculate the a correlation coefficient between each two features, and aim for removing features with a high correlation. This method could be considered as a stable solution to such a problem. Source

Why should we perform a Kfold cross validation on test set??

I was working on a knearest neighbours problem set. I couldn't understand why are they performing K fold cross validation on test set?? Cant we directly test how well our best parameter K performed on the entire test data? rather than doing a cross validation?
iris = sklearn.datasets.load_iris()
X = iris.data
Y = iris.target
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
X, Y, test_size=0.33, random_state=42)
k = np.arange(20)+1
parameters = {'n_neighbors': k}
knn = sklearn.neighbors.KNeighborsClassifier()
clf = sklearn.grid_search.GridSearchCV(knn, parameters, cv=10)
clf.fit(X_train, Y_train)
def computeTestScores(test_x, test_y, clf, cv):
kFolds = sklearn.cross_validation.KFold(test_x.shape[0], n_folds=cv)
scores = []
for _, test_index in kFolds:
test_data = test_x[test_index]
test_labels = test_y[test_index]
scores.append(sklearn.metrics.accuracy_score(test_labels, clf.predict(test_data)))
return scores
scores = computeTestScores(test_x = X_test, test_y = Y_test, clf=clf, cv=5)

TL;DR
Did you ever have a science teacher who said, 'any measurement without error bounds is meaningless?'
You might worry that the score on using your fitted, hyperparameter optimized, estimator on your test set is a fluke. By doing a number of tests on a randomly chosen subsample of the test set you get a range of scores; you can report their mean and standard deviation etc. This is, hopefully, a better proxy for how the estimator will perform on new data from the wild.
The following conceptual model may not apply to all estimators but it is a useful to bear in mind. You end up needing 3 subsets of your data. You can skip to the final paragraph if the numbered points are things you are already happy with.
Training your estimator will fit some internal parameters that you need not ever see directly. You optimize these by training on the training set.
Most estimators also have hyperparameters (number of neighbours, alpha for Ridge, ...). Hyperparameters also need to be optimized. You need to fit them to a different subset of your data; call it the validation set.
Finally, when you are happy with the fit of both the estimator's internal parameters and the hyperparmeters, you want to see how well the fitted estimator predicts on new data. You need a final subset (the test set) of your data to figure out how well the training and hyperparameter optimization went.
In lots of cases the partitioning your data into 3 means you don't have enough samples in each subset. One way around this is to randomly split the training set a number of times, fit hyperparameters and aggregate the results. This also helps stop your hyperparameters being over-fit to a particular validation set. K-fold cross-validation is one strategy.
Another use for this splitting a data set at random is to get a range of results for how your final estimator did. By splitting the test set and computing the score you get a range of answers to 'how might we do on new data'. The hope is that this is more representative of what you might see as real-world novel data performance. You can also get a standard deviation for you final score. This appears to be what the Harvard cs109 gist is doing.

If you make a program that adapts to input, then it will be optimal for the input you adapted it to.
This leads to a problem known as overfitting.
In order to see if you have made a good or a bad model, you need to test it on some other data that is not what you used to make the model. This is why you separate your data into 2 parts.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.