Problem of doing preprocessing for testing set in GridSearchCV

Problem of doing preprocessing for testing set in GridSearchCV - python

I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)

short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.

Related

Using scaler in Sklearn PIpeline and Cross validation

I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.

In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.

Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

I was assigned a task that requires creating a Decision Tree Classifier and determining the accuracy rates using the training set and 10-fold cross-validation. I went over the documentation for cross_val_predict as I believe that this is the module I am going to need.
What I am having trouble with, is the splitting of the data set. As far as I am aware, in the usual case, the train_test_split() method is used to split the data set into 2 - the train and the test. From my understanding, for K-fold validation you need to further split the train set into K-number of parts.
My question is: do I need to split the data set at the beginning into train and test, or not?

It depends. My personal opinion is yes you have to split your dataset into training and test set, then you can do a cross-validation on your training set with K-folds. Why ? Because it is interesting to test after your training and fine-tuning your model on unseen example.
But some guys just do a cross-val. Here is the workflow I often use:
# Data Partition
X_train, X_valid, Y_train, Y_valid = model_selection.train_test_split(X, Y, test_size=0.2, random_state=21)
# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
# Then visualize the score you just obtain using mean, std or plot
print('Mean CV-score : ' + str(cv_score.mean()))
# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
model = model_with_param
cv_score = cross_val_score(model, X_train, Y_train, scoring=metric, cv=5)
print('Mean CV-score with param: ' + str(cv_score.mean()))
# Now I have best parameters for the model, I can train the final model
model = model_with_best_parameters
model.fit(X_train, y_train)
# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred, y_test)

Short answer: NO
Long answer.
If you want to use K-fold validation when you do not usually split initially into train/test.
There are a lot of ways to evaluate a model. The simplest one is to use train/test splitting, fit the model on the train set and evaluate using the test.
If you adopt a cross-validation method, then you directly do the fitting/evaluation during each fold/iteration.
It's up to you what to choose but I would go with K-Folds or LOOCV.
K-Folds procedure is summarised in the figure (for K=5):

Fitting sklearn GridSearchCV model

I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be

In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).

This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).

How to access hyperparameters in case of nested cross-validation using scikit-learn

The code for hyperparameter tuning using scikit-learn looks like this:
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,
scoring='accuracy',
cv=10,
n_jobs=-1)
gs = gs.fit(X_train, y_train)
clf = gs.best_estimator_
clf.fit(X_train, y_train)
where for each combination of hyperparameters K-fold cross-validation is performed and the the combination that gives the best score is used to train over the entire training data to fit the model and this model will be used to predict on the test (unseen) data.
My question is how can I do the same job using nested cross-validation.
The below code performs nested 5x2 cross-validation
gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,
scoring='accuracy',
cv=2)
scores = cross_val_score(gs, X_train, y_train,
scoring='accuracy', cv=5)
where GridSearchCV runs the inner loop while cross_val_score() runs the outer loop. Since cv=5 given to cross_val_score(), the result will be five different models (i.e., hyperparameters)
If the model is stable enough, then all the resulting hyperparameters may be same. But if not, one should naturally choose the hyperparameters that correspond to the highest one in the scores array returned by cross_val_score()
I would like to know how to access it so that I can use it to once again fit the model using the entire training data and finally predict on the test dataset.

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?

Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.

It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.

Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem of doing preprocessing for testing set in GridSearchCV - python

Related

Using scaler in Sklearn PIpeline and Cross validation

Splitting a data set for K-fold Cross Validation in Sci-Kit Learn

Fitting sklearn GridSearchCV model

How to access hyperparameters in case of nested cross-validation using scikit-learn

How can I explain this drop in performance on test data?

Categories

Resources