I have a data set with 100 columns of continuous features, and a continuous label, and I want to run SVR; extracting features of relevance, tuning hyper parameters, and then cross-validating my model that is fit to my data.
I wrote this code:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = SVR()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
# define the grid
grid = dict()
#How many features to try
grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]
# define the grid search
#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
search = GridSearchCV(
pipeline,
# estimator=SVR(kernel='rbf'),
param_grid={
'estimator__svr__C': [0.1, 1, 10, 100, 1000],
'estimator__svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'estimator__svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
scoring='neg_mean_squared_error',
verbose=1,
n_jobs=-1)
for param in search.get_params().keys():
print(param)
# perform the search
results = search.fit(X_train, y_train)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
print(">%.3f with: %r" % (mean, param))
I get the error:
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
steps=[('sel',
SelectKBest(k=10,
score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
When I print estimator.get_params().keys(), as suggested in the error message, I get:
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__sel
estimator__svr
estimator__sel__k
estimator__sel__score_func
estimator__svr__C
estimator__svr__cache_size
estimator__svr__coef0
estimator__svr__degree
estimator__svr__epsilon
estimator__svr__gamma
estimator__svr__kernel
estimator__svr__max_iter
estimator__svr__shrinking
estimator__svr__tol
estimator__svr__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Fitting 5 folds for each of 405 candidates, totalling 2025 fits
But when I change the line:
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
to:
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])
I get the error:
ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']
Could someone explain what I'm doing wrong, i.e. how do I combine the pipeline/feature selection step into the GridSearchCV?
As a side note, if I comment out pipeline in the GridSearchCV, and uncomment estimator=SVR(kernal='rbf'), the cell runs without issue, but in that case, I presume I am not incorporating the feature selection in, as it's not called anywhere. I have seen some previous SO questions, e.g. here, but they don't seem to answer this specific question.
Is there a cleaner way to write this?
The first error message is about the pipeline parameters, not the search parameters, and indicates that your param_grid is bad, not the pipeline step names. Running pipeline.get_params().keys() should show you the right parameter names. Your grid should be:
param_grid={
'svr__C': [0.1, 1, 10, 100, 1000],
'svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...
Related
I would like to use Gridsearch in the code to fine tune my SVM model, I have copied this code from other githubs and it has been working perfectly fine for my cross-fold.
X = Corpus.drop(['text','ManipulativeTag','compound'],axis=1).values # !!! this drops compund because of Naive Bayes
y = Corpus['ManipulativeTag'].values
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Create splits
splits = kf.split(X)
# Access the training and validation indices of splits
kfold_accuracy = {}
kfold_precision = {}
kfold_f = {}
kfold_recall = {}
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
# print("training:", train_index, "validations:", val_index)
X_val,y_val= X[val_index], y[val_index]
SVM = svm.SVC(C=1.0, kernel='linear', random_state=1111, probability=True) ### the base estimator
SVM.fit(X_train, y_train)
# predict the labels on validation dataset
predictions = SVM.predict(X_val)
# Use accuracy_score function to get the accuracy
kfold_accuracy[i] = accuracy_score(y_val, predictions)
kfold_precision[i] = precision_score(y_val, predictions)
kfold_f[i] = f1_score(y_val,predictions)
kfold_recall[i] = recall_score(y_val,predictions)
However when trying to implement Gridsearch most of the articles that I ran into uses train_test_split() rather than my kf.split(), I am having trouble finding the right place to shove the GridSearchCV() line:
GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)
I found my solution here: Grid search and cross validation SVM
I have copied this from the post:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
{'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000] },{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}]
And I have kept everything from my code and only made changes in the loop by adding the Gridsearch() in my loop:
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
X_val,y_val= X[val_index], y[val_index]
# this is where I put GridSearch()
# here cv cannot be 1, so I put 2 instead
SVM = GridSearchCV(SVC(), tuned_parameters, cv=2, scoring='accuracy')
SVM.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(SVM.best_params_)
I am trying to use a grid search with my SVR model, and as it takes too much time to fit I wonder if I could use the early stopping, but I don't know how to do so.
Instead, I used max_iter, but still not sure of my best parameters. Any suggestion? Thank you!
#We can use a grid search to find the best parameters for this model. Lets try
#X_feat = F_DF.drop(columns=feat)
y = F_DF["Production_MW"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=42)
#Define a list of parameters for the models
params = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 10, 100],
'epsilon': [0.001, 0.01, 0.1, 1, 10, 100]
}
#searchcv.fit(X, y, callback=on_step)
#We can build Grid Search model using the above parameters.
#cv=5 means cross validation with 5 folds
grid_search = GridSearchCV(SVR(kernel='rbf'), params, cv=5, n_jobs=-1,verbose=1)
grid_search.fit(X_train, y_train)
print("train score - " + str(grid_search.score(X_train, y_train)))
print("test score - " + str(grid_search.score(X_test, y_test)))
print("SVR GridSearch score: "+str(grid_search.best_score_))
print("SVR GridSearch params: ")
print(grid_search.best_params_)
I'm using SVR to solve a prediction problem and I would like to do feature selection as well as hyper-parameters search. I'm trying to use both RFECV and GridSearchCV but I'm receiving errors from my code.
My code is as follow:
def svr_model(X, Y):
estimator=SVR(kernel='rbf')
param_grid={
'C': [0.1, 1, 100, 1000],
'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
}
selector = RFECV(estimator, step = 1, cv = 5)
gsc = GridSearchCV(
selector,
param_grid,
cv=5, scoring='neg_root_mean_squared_error', verbose=0, n_jobs=-1)
grid_result = gsc.fit(X, Y)
best_params = grid_result.best_params_
best_svr = SVR(kernel='rbf', C=best_params["C"], epsilon=best_params["epsilon"], gamma=best_params["gamma"],
coef0=0.1, shrinking=True,
tol=0.001, cache_size=200, verbose=False, max_iter=-1)
scoring = {
'abs_error': 'neg_mean_absolute_error',
'squared_error': 'neg_mean_squared_error',
'r2':'r2'}
scores = cross_validate(best_svr, X, Y, cv=10, scoring=scoring, return_train_score=True, return_estimator = True)
return scores
The errors is
ValueError: Invalid parameter C for estimator RFECV(cv=5,
estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False),
min_features_to_select=1, n_jobs=None, scoring=None, step=1, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
I'm quite new to using machine learning, any help would be highly appreciated.
Grid search runs the selector initialized with different combinations of parameters passed in the param_grid. But in this case, we want the grid search to initialize the estimator inside the selector. This is achieved by using the dictionary naming style <estimator>__<parameter>. Follow the docs for more details.
Working code
estimator=SVR(kernel='linear')
selector = RFECV(estimator, step = 1, cv = 5)
gsc = GridSearchCV(
selector,
param_grid={
'estimator__C': [0.1, 1, 100, 1000],
'estimator__epsilon': [0.0001, 0.0005],
'estimator__gamma': [0.0001, 0.001]},
cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)
grid_result = gsc.fit(X, Y)
Two other bugs in your code
neg_root_mean_squared_error is not a valid scoring method
rbf kernel does not return feature importance, hence you cannot use this kernel if you want to use RFECV
I am running this:
# Hyperparameter tuning - Random Forest #
# Hyperparameters' grid
parameters = {'n_estimators': list(range(100, 250, 25)), 'criterion': ['gini', 'entropy'],
'max_depth': list(range(2, 11, 2)), 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [{0: 1, 1: i} for i in np.arange(1, 4, 0.2).tolist()], 'min_samples_split': list(range(2, 7))}
# Instantiate random forest
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)
# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring='balanced_accuracy',
cv=5, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)
and I am receiving this warning:
.../anaconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536:
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
TypeError: '<' not supported between instances of 'str' and 'int'
Why is this and how can I fix it?
I had similar issue of FitFailedWarning with different details, after many runs I found, the parameter value passing has the error, try
parameters = {'n_estimators': [100,125,150,175,200,225,250],
'criterion': ['gini', 'entropy'],
'max_depth': [2,4,6,8,10],
'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [0.2,0.4,0.6,0.8,1.0],
'min_samples_split': [2,3,4,5,6,7]}
This will pass for sure, for me it happened in XGBClassifier, somehow the values datatype mixing up
One more is if the value exceeds the range, for example in XGBClassifier 'subsample' paramerters max value is 1.0, if it is set as 1.1, FitFailedWarning will occur
For me this was giving same error but after removing none from max_dept it is fitting properly.
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':['None',5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
code which is running properly:
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':[5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
I too got same error and when I passed hyperparameters as in MachineLearningMastery, I got output without warning...
Try this way if anyone get similar issues...
# grid search logistic regression model on the sonar dataset
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = LogisticRegression()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
Make sure the y-variable is an int, not bool or str.
Change your last line of code to make the y series a 0 or 1, for example:
classifiers_grid.fit(X, list(map(int, y)))
I have used a GridSearch for parameter optimization when predicting values with 10-fold cross validation using sklearn, as shown below,
svr_params = {
'C': [0.1, 1, 10],
'epsilon': [0.01, 0.05, 0.1, 0.5, 1],
}
svr = SVR(kernel='linear', coef0=0.1, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=-1)
best_svr = GridSearchCV(
svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1)
predicted = cross_val_predict(best_svr, X, y, cv=10)
I want to print out the best parameters selected by the GridSearch for C and epsilon. I would really appriate some help. Thanks in advance.
The best parameters are available as best_params_ attribute of GridSearchCV.
best_svr = GridSearchCV(svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1, refit=True)
best_svr.fit(X, y)
print(best_svr.best_params_)