Held out training and validation set in gridsearchcv sklearn - python

I see that in gridsearchcv best parameters are determined based on cross-validation, but what I really want to do is to determine the best parameters based on one held out validation set instead of cross validation.
Not sure if there is a way to do that. I found some similar posts where customizing the cross-validation folds. However, again what I really need is to train on one set and validate the parameters on a validation set.
One more information about my dataset is basically a text series type created by panda.

I did come up with an answer to my own question through the use of PredefinedSplit
for i in range(len(doc_train)-1):
train_ind[i] = -1
for i in range(len(doc_val)-1):
val_ind[i] = 0
ps = PredefinedSplit(test_fold=np.concatenate((train_ind,val_ind)))
and then in the gridsearchCV arguments
grid_search = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1 , cv=ps)

Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))

Related

Finding accuracy, precision and recall of a model after hyperparameter tuning in sklearn

I've a binary classification problem, for which I've chosen 3 algorithms, Logistic Regression, SVM and Adaboost. I'm using grid-search and k-fold cross validation on each of these to find the optimal set of hyper-parameters. Now, based on the accuracy, precision and recall I need to choose the best model. But the problem is I'm not able to find any suitable way to retrieve these information. My code is given below:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn import cross_validation
# TODO: Initialize the classifier
clfr_A = LogisticRegression(random_state=128)
clfr_B = SVC(random_state=128)
clfr_C = AdaBoostClassifier(random_state=128)
lr_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
svc_param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}
adb_param_grid = {'n_estimators' : [50,100,150,200,250,500],'learning_rate' : [.5,.75,1.0,1.25,1.5,1.75,2.0]}
# TODO: Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score, beta = 0.5)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
clfrs = [clfr_A, clfr_B, clfr_C]
params = [lr_param_grid, svc_param_grid, adb_param_grid]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
print grid_fit.best_estimator_
print grid_fit.cv_results_
Problem is the cv_results_ gives out a lot of info but I'm not able to find anything relevant except mean_test_score. Moreover I don't see any accuracy, precision or recall related metric there.
I can think of one way to achieve it. I can change the for loop to look something like the following:
score_params = ["accuracy", "precision", "recall"]
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit=True)
grid_fit = grid_obj.fit(features_raw, target_raw)
best_clf = grid_fit.best_estimator_
for score in score_params:
print score,
print " : ",
print cross_val_score(best_clf, features_raw, target_raw, scoring=score, cv=3).mean()
But is there any better way of doing it? It seems I'm doing the operations multiple times for each model. Any help is appreciated.
GridSearchCV is doing what you gave. You gave the f_beta as scorer, so mean_test_score will return results of that f_beta for each parameter combination.
If you want to access other metrics, you need to tell the GridSearchCV explicitly to do so.
GridSearchCV in newer versions of scikit-learn, supports multi-metric scoring. So you can pass multiple type of scorers in that. As per documentation:
scoring : string, callable, list/tuple, dict or None, default: None
...
...
For evaluating multiple metrics, either give a list of (unique)
strings or a dict with names as keys and callables as values.
See this example here:
http://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#running-gridsearchcv-using-multiple-evaluation-metrics
And change your scoring param as:
scoring = {'Accuracy': 'accuracy',
'FBeta': make_scorer(fbeta_score, beta = 0.5),
# ... Add others here as you want.
}
But now when you do it, you need to change the refit param also. Since different metrics here will give different type of scores for the parameter combinations, so you need to decide which one to select when refitting the estimator. So choose one of the keys from the scoring dict for refit
for clfr, param in zip(clfrs, params):
grid_obj = GridSearchCV(clfr, param, cv=3, scoring=scorer, refit='FBeta')
...
...

Prevent overfitting in Logistic Regression using Sci-Kit Learn

I trained a model using Logistic Regression to predict whether a name field and description field belong to a profile of a male, female, or brand. My train accuracy is around 99% while my test accuracy is around 83%. I have tried implementing regularization by tuning the C parameter but the improvements were barely noticed. I have around 5,000 examples in my training set. Is this an instance where I just need more data or is there something else I can do in Sci-Kit Learn to get my test accuracy higher?
overfitting is a multifaceted problem. It could be your train/test/validate split (anything from 50/40/10 to 90/9/1 could change things). You might need to shuffle your input. Try an ensemble method, or reduce the number of features. you might have outliers throwing things off
then again, it could be none of these, or all of these, or some combination of these.
for starters, try to plot out test score as a function of test split size, and see what you get
#The 'C' value in Logistic Regresion works very similar as the Support
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:
# For SVC You can remove the gamma and kernel keys
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001],
# 'kernel': ['rbf']}
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C
#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)
# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

Performing grid search with a predefined validation set Sklearn

This question has been asked several times before. But I get an error when following the answer
First I specify which part is the training set and the validation set as follows.
my_test_fold = []
for i in range(len(train_x)):
my_test_fold.append(-1)
for i in range(len(test_x)):
my_test_fold.append(0)
And then gridsearch is performed.
from sklearn.model_selection import PredefinedSplit
param = {
'n_estimators':[200],
'max_depth':[5],
'min_child_weight':[3],
'reg_alpha':[6],
'gamma':[0.6],
'scale_neg_weight':[1],
'learning_rate':[0.09]
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(
objective= 'reg:linear',
seed=1),
param_grid = param,
scoring='roc_auc',
cv = PredefinedSplit(test_fold=my_test_fold),
verbose = 1)
gsearch1.fit(new_data_df, df_y)
But I get the following error
object of type 'PredefinedSplit' has no len()
Try to replace
cv = PredefinedSplit(test_fold=my_test_fold)
with
cv = list(PredefinedSplit(test_fold=my_test_fold).split(new_data_df, df_y))
The reason is that you may need to apply the split method to actually get the split into training and testing (and then transform it from an iterable object to a list object).
The hypopt Python package (pip install hypopt), for which I am an author, was created for this exact purpose: parameter optimization with a validation set. It works with scikit-learn models and can be used with Tensorflow, PyTorch, Caffe2, etc.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
Edit: Has something changed with hypopt to cause the sudden recent downvotes? Some feedback would help as hypopt solves this exact problem and if there is an issue, we should fix it.

Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn

I have a machine learning problem and want to optimize my SVC estimators as well as the feature selection.
For optimizing SVC estimators I use essentially the code from the docs. Now my question is, how can I combine this with recursive feature elimination cross validation (RCEV)? That is, for each estimator-combination I want to do the RCEV in order to determine the best combination of estimators and features.
I tried the solution from this thread, but it yields the following error:
ValueError: Invalid parameter C for estimator RFECV. Check the list of available parameters with `estimator.get_params().keys()`.
My code looks like this:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-4,1e-3],'C': [1,10]},
{'kernel': ['linear'],'C': [1, 10]}]
estimator = SVC(kernel="linear")
selector = RFECV(estimator, step=1, cv=3, scoring=None)
clf = GridSearchCV(selector, tuned_parameters, cv=3)
clf.fit(X_train, y_train)
The error appears at clf = GridSearchCV(selector, tuned_parameters, cv=3).
I would use a Pipeline, but here you have a more adequate response
Recursive feature elimination and grid search using scikit-learn

Grid-search with specific validation data

I'm looking for a way to grid-search for hyperparameters in sklearn, without using K-fold validation. I.e I want my grid to train on on specific dataset (X1,y1 in the example below) and validate itself on specific hold-out dataset (X2,y2 in the example below).
X1,y2 = train data
X2,y2 = validation data
clf_ = SVC(kernel='rbf',cache_size=1000)
Cs = [1,10.0,50,100.0,]
Gammas = [ 0.4,0.42,0.44,0.46,0.48,0.5,0.52,0.54,0.56]
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),
cv=???, # validate on X2,y2
n_jobs=8,verbose=10)
clf.fit(X1, y1)
Use the hypopt Python package (pip install hypopt). It's a professional package created specifically for parameter optimization with a validation set. It works with any scikit-learn model out-of-the-box and can be used with Tensorflow, PyTorch, Caffe2, etc. as well.
# Code from https://github.com/cgnorthcutt/hypopt
# Assuming you already have train, test, val sets and a model.
from hypopt import GridSearch
param_grid = [
{'C': [1, 10, 100], 'kernel': ['linear']},
{'C': [1, 10, 100], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
# Grid-search all parameter combinations using a validation set.
opt = GridSearch(model = SVR(), param_grid = param_grid)
opt.fit(X_train, y_train, X_val, y_val)
print('Test Score for Optimized Parameters:', opt.score(X_test, y_test))
clf = GridSearchCV(clf_,dict(C=Cs,gamma=Gammas),cv=???, # validate on X2,y2,n_jobs=8,verbose=10)
n_jobs>1 does not make any sense. If n_jobs=-1 it means the processing will use all the cores on your machine. If it is 1 only one core would be use.
If cv =5 it will run five cross validations for every iteration.
In your case total number of iterations will be 9(size of Cs)*5(Size of gammas)*5(Value of CV)
If you are using cross validation it does not make any sense to hold out the data for rechecking your model. If you are not confident about the performance you can just increase the cv to get a better fit.
This will be very time consuming especially for SVM ,I will rather suggest you to use RandomSearchCV which allows you give the number of iterations you want your model to randomly select.

Categories