how to do fine tune SVM hyperparameter with Kfold - python

I would like to use Gridsearch in the code to fine tune my SVM model, I have copied this code from other githubs and it has been working perfectly fine for my cross-fold.
X = Corpus.drop(['text','ManipulativeTag','compound'],axis=1).values # !!! this drops compund because of Naive Bayes
y = Corpus['ManipulativeTag'].values
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Create splits
splits = kf.split(X)
# Access the training and validation indices of splits
kfold_accuracy = {}
kfold_precision = {}
kfold_f = {}
kfold_recall = {}
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
# print("training:", train_index, "validations:", val_index)
X_val,y_val= X[val_index], y[val_index]
SVM = svm.SVC(C=1.0, kernel='linear', random_state=1111, probability=True) ### the base estimator
SVM.fit(X_train, y_train)
# predict the labels on validation dataset
predictions = SVM.predict(X_val)
# Use accuracy_score function to get the accuracy
kfold_accuracy[i] = accuracy_score(y_val, predictions)
kfold_precision[i] = precision_score(y_val, predictions)
kfold_f[i] = f1_score(y_val,predictions)
kfold_recall[i] = recall_score(y_val,predictions)
However when trying to implement Gridsearch most of the articles that I ran into uses train_test_split() rather than my kf.split(), I am having trouble finding the right place to shove the GridSearchCV() line:
GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)

I found my solution here: Grid search and cross validation SVM
I have copied this from the post:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
{'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000] },{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}]
And I have kept everything from my code and only made changes in the loop by adding the Gridsearch() in my loop:
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
X_val,y_val= X[val_index], y[val_index]
# this is where I put GridSearch()
# here cv cannot be 1, so I put 2 instead
SVM = GridSearchCV(SVC(), tuned_parameters, cv=2, scoring='accuracy')
SVM.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(SVM.best_params_)

Related

Invalid parameter for estimator Pipeline (SVR)

I have a data set with 100 columns of continuous features, and a continuous label, and I want to run SVR; extracting features of relevance, tuning hyper parameters, and then cross-validating my model that is fit to my data.
I wrote this code:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = SVR()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
# define the grid
grid = dict()
#How many features to try
grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]
# define the grid search
#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
search = GridSearchCV(
pipeline,
# estimator=SVR(kernel='rbf'),
param_grid={
'estimator__svr__C': [0.1, 1, 10, 100, 1000],
'estimator__svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'estimator__svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
scoring='neg_mean_squared_error',
verbose=1,
n_jobs=-1)
for param in search.get_params().keys():
print(param)
# perform the search
results = search.fit(X_train, y_train)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
print(">%.3f with: %r" % (mean, param))
I get the error:
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
steps=[('sel',
SelectKBest(k=10,
score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
When I print estimator.get_params().keys(), as suggested in the error message, I get:
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__sel
estimator__svr
estimator__sel__k
estimator__sel__score_func
estimator__svr__C
estimator__svr__cache_size
estimator__svr__coef0
estimator__svr__degree
estimator__svr__epsilon
estimator__svr__gamma
estimator__svr__kernel
estimator__svr__max_iter
estimator__svr__shrinking
estimator__svr__tol
estimator__svr__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Fitting 5 folds for each of 405 candidates, totalling 2025 fits
But when I change the line:
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
to:
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])
I get the error:
ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']
Could someone explain what I'm doing wrong, i.e. how do I combine the pipeline/feature selection step into the GridSearchCV?
As a side note, if I comment out pipeline in the GridSearchCV, and uncomment estimator=SVR(kernal='rbf'), the cell runs without issue, but in that case, I presume I am not incorporating the feature selection in, as it's not called anywhere. I have seen some previous SO questions, e.g. here, but they don't seem to answer this specific question.
Is there a cleaner way to write this?
The first error message is about the pipeline parameters, not the search parameters, and indicates that your param_grid is bad, not the pipeline step names. Running pipeline.get_params().keys() should show you the right parameter names. Your grid should be:
param_grid={
'svr__C': [0.1, 1, 10, 100, 1000],
'svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...

Using early stopping with SVR and grid search

I am trying to use a grid search with my SVR model, and as it takes too much time to fit I wonder if I could use the early stopping, but I don't know how to do so.
Instead, I used max_iter, but still not sure of my best parameters. Any suggestion? Thank you!
#We can use a grid search to find the best parameters for this model. Lets try
#X_feat = F_DF.drop(columns=feat)
y = F_DF["Production_MW"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)
#Define a list of parameters for the models
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 10, 100],
'epsilon': [0.001, 0.01, 0.1, 1, 10, 100]
}
#searchcv.fit(X, y, callback=on_step)
#We can build Grid Search model using the above parameters.
#cv=5 means cross validation with 5 folds
grid_search = GridSearchCV(SVR(kernel='rbf'), params, cv=5, n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
print("train score - " + str(grid_search.score(X_train, y_train)))
print("test score - " + str(grid_search.score(X_test, y_test)))
print("SVR GridSearch score: "+str(grid_search.best_score_))
print("SVR GridSearch params: ")
print(grid_search.best_params_)

How to do only simple cross validation using GridSearchCV

How I am using below code to perform both simple cross validation and K-fold cross validation
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
import numpy as np
# our hyperparameters to choose from
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2]
n_estimators = [30, 50, 100, 150, 200]
param_grid = dict(learning_rate = learning_rate, n_estimators = n_estimators)
xgb_model = xgb.XGBClassifier(random_state=42, n_jobs = -1)
clf = GridSearchCV(xgb_model, param_grid, scoring = 'roc_auc', cv=3, return_train_score=True)
sc = clf.fit(X_train, y_train)
# getting all the results
scores = clf.cv_results_
# getting train scores and cross validation scores
train_score = scores['mean_train_score']
cv_score = scores['mean_test_score']
Access the classifier trained with the best set of hyper-parameters, then call the score method, which will make predictions from X_cv and score accuracy compared to y_cv:
clf.best_estimator_.score(X_cv,y_cv)
If you just want the predictions, then call the predict method instead with X_cv as argument.

Results from GridSearchCV/RandomizedSearchCV cannot be reproduced by running a single model using the same parameters

I am running RandomizedSearchCV with 5-folds in order to find best parameters. I have a hold-out set (X_test) that I use to predict. My portion of code is:
svc= SVC(class_weight=class_weights, random_state=42)
Cs = [0.01, 0.1, 1, 10, 100, 1000, 10000]
gammas = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
param_grid = {'C': Cs,
'gamma': gammas,
'kernel': ['linear', 'rbf', 'poly']}
my_cv = TimeSeriesSplit(n_splits=5).split(X_train)
rs_svm = RandomizedSearchCV(SVC(), param_grid, cv = my_cv, scoring='accuracy',
refit='accuracy', verbose = 3, n_jobs=1, random_state=42)
rs_svm.fit(X_train, y_train)
y_pred = rs_svm.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
print (rs_svm.best_params_)
The result is classification report:
Now, I am interested in reproducing this result using a run-alone model (no randomizedsearchCV) with the selected parameters:
from sklearn.model_selection import TimeSeriesSplit
tcsv=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tcsv.split(X_train):
train_index_ = int(train_index.shape[0])
test_index_ = int(test_index.shape[0])
X_train_, y_train_ = X_train[0:train_index_],y_train[0:train_index_]
X_test_, y_test_ = X_train[test_index_:],y_train[test_index_:]
class_weights = compute_class_weight('balanced', np.unique(y_train_), y_train_)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True,
random_state=42)
svc.fit(X_train_, y_train_)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_)
clfreport = classification_report(y_test, y_pred_)
In my understanding, the clfreports should be identical but my result after this run are:
Does anyone have any suggestions why that might be happening?
Given your 1st code snippet, where you use RandomizedSearchCV to find the best hyperparameters, you don't need to do any splitting again; so, in your 2nd snippet, you should just fit using the found hyperparameters and the class weights using the whole of your training set, and then predict on your test set:
class_weights = compute_class_weight('balanced', np.unique(y_train), y_train)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True, random_state=42)
svc.fit(X_train, y_train)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
The discussion in Order between using validation, training and test sets might be useful for clarifying the procedure...

How to run RFECV with SVC in sklearn

I am trying to perform Recursive Feature Elimination with Cross Validation (RFECV) with GridSearchCV as follows using SVC as the classifier.
My code is as follows.
X = df[my_features]
y = df['gold_standard']
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = SVC(class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=k_fold, scoring='roc_auc')
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 10.0, 100.0, 1000.0],
'estimator__gamma': [0.001, 0.01, 0.1, 1.0, 2.0, 3.0, 10.0, 100.0, 1000.0],
'estimator__kernel':('rbf', 'sigmoid', 'poly')
}
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc', verbose=10)
CV_rfc.fit(x_train, y_train)
However, I got an error saying: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
Is there a way to resolve this error? If not what are the other feature selection techniques that I can use with SVC?
I am happy to provide more details if needed.
To look at more feature selection implementations you can have a look at:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
As an example, in the next link they use PCA with k-best feature selection and svc.
https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html#sphx-glr-auto-examples-compose-plot-feature-union-py
An example of use would be, modified form the previous link for more simplicity:
iris = load_iris()
X, y = iris.data, iris.target
# Maybe some original features where good, too?
selection = SelectKBest()
# Build SVC
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", selection), ("svm", svm)])
param_grid = dict(features__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
emmm...in sklearn 0.19.2,The problem seems to have been solved.My code is similar to yours, but it works:
svc = SVC(
kernel = 'linear',
probability = True,
random_state = 1 )
rfecv = RFECV(
estimator = svc,
scoring = 'roc_auc'
)
rfecv.fit(train_values,train_Labels)
selecInfo = rfecv.support_
selecIndex = np.where(selecInfo==1)

Categories