I'm using SVR to solve a prediction problem and I would like to do feature selection as well as hyper-parameters search. I'm trying to use both RFECV and GridSearchCV but I'm receiving errors from my code.
My code is as follow:
def svr_model(X, Y):
estimator=SVR(kernel='rbf')
param_grid={
'C': [0.1, 1, 100, 1000],
'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
}
selector = RFECV(estimator, step = 1, cv = 5)
gsc = GridSearchCV(
selector,
param_grid,
cv=5, scoring='neg_root_mean_squared_error', verbose=0, n_jobs=-1)
grid_result = gsc.fit(X, Y)
best_params = grid_result.best_params_
best_svr = SVR(kernel='rbf', C=best_params["C"], epsilon=best_params["epsilon"], gamma=best_params["gamma"],
coef0=0.1, shrinking=True,
tol=0.001, cache_size=200, verbose=False, max_iter=-1)
scoring = {
'abs_error': 'neg_mean_absolute_error',
'squared_error': 'neg_mean_squared_error',
'r2':'r2'}
scores = cross_validate(best_svr, X, Y, cv=10, scoring=scoring, return_train_score=True, return_estimator = True)
return scores
The errors is
ValueError: Invalid parameter C for estimator RFECV(cv=5,
estimator=SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False),
min_features_to_select=1, n_jobs=None, scoring=None, step=1, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
I'm quite new to using machine learning, any help would be highly appreciated.
Grid search runs the selector initialized with different combinations of parameters passed in the param_grid. But in this case, we want the grid search to initialize the estimator inside the selector. This is achieved by using the dictionary naming style <estimator>__<parameter>. Follow the docs for more details.
Working code
estimator=SVR(kernel='linear')
selector = RFECV(estimator, step = 1, cv = 5)
gsc = GridSearchCV(
selector,
param_grid={
'estimator__C': [0.1, 1, 100, 1000],
'estimator__epsilon': [0.0001, 0.0005],
'estimator__gamma': [0.0001, 0.001]},
cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)
grid_result = gsc.fit(X, Y)
Two other bugs in your code
neg_root_mean_squared_error is not a valid scoring method
rbf kernel does not return feature importance, hence you cannot use this kernel if you want to use RFECV
Related
I've got a multiclass classification problem and I need to find the best parameters. I cannot change the max_iter, solver, and tol (they are given), but I'd like to check which penalty is better. However, GridSearchCV always returns the first given penalty as the best one.
Example:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
fixed_params = {
'random_state': 42,
'multi_class': 'multinomial',
'solver': 'saga',
'tol': 1e-3,
'max_iter': 500
}
parameters = [
{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2', None]},
{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['elasticnet'], 'l1_ratio': np.arange(0.0, 1.0, 0.1)}
]
model = GridSearchCV(LogisticRegression(**fixed_params), parameters, n_jobs=-1, verbose=10, scoring='f1_macro' ,cv=cv)
model.fit(X_train, y_train)
print(model.best_score_)
# 0.6836409100287101
print(model.best_params_)
# {'C': 0.1, 'penalty': 'l2'}
If I change the order of parameters rows, the result will be quite opposite:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
fixed_params = {
'random_state': 42,
'multi_class': 'multinomial',
'solver': 'saga',
'tol': 1e-3,
'max_iter': 500
}
parameters = [
{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['elasticnet'], 'l1_ratio': np.arange(0.0, 1.0, 0.1)}
{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2', None]}
]
model = GridSearchCV(LogisticRegression(**fixed_params), parameters, n_jobs=-1, verbose=10, scoring='f1_macro' ,cv=cv)
model.fit(X_train, y_train)
print(model.best_score_)
# 0.6836409100287101
print(model.best_params_)
# {'C': 0.1, 'l1_ratio': 0.0, 'penalty': 'elasticnet'}
So, the best_score_ is the same for both options, but the best_params_ is not.
Could you please tell me what is wrong?
Edited
GridSearchCV gives a worse result in comparison to baseline with default parameters.
Baseline:
baseline_model = LogisticRegression(multi_class='multinomial', solver='saga', tol=1e-3, max_iter=500)
baseline_model.fit(X_train, y_train)
train_pred_baseline = baseline_model.predict(X_train)
print(f1_score(y_train, train_pred_baseline, average='micro'))
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=500,
multi_class='multinomial', n_jobs=None, penalty='l2',
random_state=None, solver='saga', tol=0.001, verbose=0,
warm_start=False)
Baseline gives me f1_micro better than GridSearchCV:
0.7522768670309654
Edited-2
So, according to best f1_score performance, C = 1 is the best choice for my model. But GridSearchCV returns me C = 0.1.
I think, I miss something...
Baseline's f1_macro better than GridSearchCV too:
train_pred_baseline = baseline_model.predict(X_train)
print(f1_score(y_train, train_pred_baseline, average='macro'))
# 0.7441968750050458
Actually there's nothing wrong. Here's the thing. Elasticnet uses both L1 and L2 penalty terms. However, if your l1_ratio is 0, then you're basically applying L2 regularization so you're only using the L2 penalty term. As stated in the docs:
Setting l1_ratio=0 is equivalent to using penalty='l2', while setting l1_ratio=1 is equivalent to using penalty='l1'. For 0 < l1_ratio <1, the penalty is a combination of L1 and L2.
Since your second result had l1_ratio to be 0, it's equivalent to using L2 penalty terms.
I have a data set with 100 columns of continuous features, and a continuous label, and I want to run SVR; extracting features of relevance, tuning hyper parameters, and then cross-validating my model that is fit to my data.
I wrote this code:
X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, test_size=0.2)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = SVR()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
# define the grid
grid = dict()
#How many features to try
grid['estimator__sel__k'] = [i for i in range(1, X_train.shape[1]+1)]
# define the grid search
#search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
search = GridSearchCV(
pipeline,
# estimator=SVR(kernel='rbf'),
param_grid={
'estimator__svr__C': [0.1, 1, 10, 100, 1000],
'estimator__svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'estimator__svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
scoring='neg_mean_squared_error',
verbose=1,
n_jobs=-1)
for param in search.get_params().keys():
print(param)
# perform the search
results = search.fit(X_train, y_train)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
print(">%.3f with: %r" % (mean, param))
I get the error:
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
steps=[('sel',
SelectKBest(k=10,
score_func=<function mutual_info_regression at 0x7fd2ff649cb0>)),
('svr',
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
tol=0.001, verbose=False))],
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
When I print estimator.get_params().keys(), as suggested in the error message, I get:
cv
error_score
estimator__memory
estimator__steps
estimator__verbose
estimator__sel
estimator__svr
estimator__sel__k
estimator__sel__score_func
estimator__svr__C
estimator__svr__cache_size
estimator__svr__coef0
estimator__svr__degree
estimator__svr__epsilon
estimator__svr__gamma
estimator__svr__kernel
estimator__svr__max_iter
estimator__svr__shrinking
estimator__svr__tol
estimator__svr__verbose
estimator
iid
n_jobs
param_grid
pre_dispatch
refit
return_train_score
scoring
verbose
Fitting 5 folds for each of 405 candidates, totalling 2025 fits
But when I change the line:
pipeline = Pipeline(steps=[('sel',fs), ('svr', model)])
to:
pipeline = Pipeline(steps=[('estimator__sel',fs), ('estimator__svr', model)])
I get the error:
ValueError: Estimator names must not contain __: got ['estimator__sel', 'estimator__svr']
Could someone explain what I'm doing wrong, i.e. how do I combine the pipeline/feature selection step into the GridSearchCV?
As a side note, if I comment out pipeline in the GridSearchCV, and uncomment estimator=SVR(kernal='rbf'), the cell runs without issue, but in that case, I presume I am not incorporating the feature selection in, as it's not called anywhere. I have seen some previous SO questions, e.g. here, but they don't seem to answer this specific question.
Is there a cleaner way to write this?
The first error message is about the pipeline parameters, not the search parameters, and indicates that your param_grid is bad, not the pipeline step names. Running pipeline.get_params().keys() should show you the right parameter names. Your grid should be:
param_grid={
'svr__C': [0.1, 1, 10, 100, 1000],
'svr__epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10],
'svr__gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 5, 10]
},
I don't know how substituting the plain SVR for the pipeline runs; your parameter grid doesn't specify the right things there either...
I am trying to perform Recursive Feature Elimination with Cross Validation (RFECV) with GridSearchCV as follows using SVC as the classifier.
My code is as follows.
X = df[my_features]
y = df['gold_standard']
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = SVC(class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=k_fold, scoring='roc_auc')
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 10.0, 100.0, 1000.0],
'estimator__gamma': [0.001, 0.01, 0.1, 1.0, 2.0, 3.0, 10.0, 100.0, 1000.0],
'estimator__kernel':('rbf', 'sigmoid', 'poly')
}
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc', verbose=10)
CV_rfc.fit(x_train, y_train)
However, I got an error saying: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
Is there a way to resolve this error? If not what are the other feature selection techniques that I can use with SVC?
I am happy to provide more details if needed.
To look at more feature selection implementations you can have a look at:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
As an example, in the next link they use PCA with k-best feature selection and svc.
https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html#sphx-glr-auto-examples-compose-plot-feature-union-py
An example of use would be, modified form the previous link for more simplicity:
iris = load_iris()
X, y = iris.data, iris.target
# Maybe some original features where good, too?
selection = SelectKBest()
# Build SVC
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", selection), ("svm", svm)])
param_grid = dict(features__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
emmm...in sklearn 0.19.2,The problem seems to have been solved.My code is similar to yours, but it works:
svc = SVC(
kernel = 'linear',
probability = True,
random_state = 1 )
rfecv = RFECV(
estimator = svc,
scoring = 'roc_auc'
)
rfecv.fit(train_values,train_Labels)
selecInfo = rfecv.support_
selecIndex = np.where(selecInfo==1)
I have used a GridSearch for parameter optimization when predicting values with 10-fold cross validation using sklearn, as shown below,
svr_params = {
'C': [0.1, 1, 10],
'epsilon': [0.01, 0.05, 0.1, 0.5, 1],
}
svr = SVR(kernel='linear', coef0=0.1, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=-1)
best_svr = GridSearchCV(
svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1)
predicted = cross_val_predict(best_svr, X, y, cv=10)
I want to print out the best parameters selected by the GridSearch for C and epsilon. I would really appriate some help. Thanks in advance.
The best parameters are available as best_params_ attribute of GridSearchCV.
best_svr = GridSearchCV(svr, param_grid=svr_params, cv=10, verbose=0, n_jobs=-1, refit=True)
best_svr.fit(X, y)
print(best_svr.best_params_)
I want to use GridSearchCV over a range of alphas (LaPlace smoothing parameters) to check which gives me the best accuracy with a Bernoulli Naive Bayes model.
def binarize_pixels(data, threshold=0.784):
# Initialize a new feature array with the same shape as the original data.
binarized_data = np.zeros(data.shape)
# Apply a threshold to each feature.
for feature in range(data.shape[1]):
binarized_data[:,feature] = data[:,feature] > threshold
return binarized_data
binarized_train_data = binarize_pixels(mini_train_data)
def BNB():
clf = BernoulliNB()
clf.fit(binarized_train_data, mini_train_labels)
scoring = clf.score(mini_train_data, mini_train_labels)
predsNB = clf.predict(dev_data)
print "Bernoulli binarized model accuracy: {:.4}".format(np.mean(predsNB == dev_labels))
The model runs fine, while my GridSearch cross validation does not:
pipeline = Pipeline([('classifier', BNB())])
def P8(alphas):
gs_clf = GridSearchCV(pipeline, param_grid = alphas, refit=True)
y_predictions = gs_clf.best_estimator_.predict(dev_data)
print classification_report(dev_labels, y_predictions)
alphas = {'alpha' : [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
P8(alphas)
I get AttributeError: 'GridSearchCV' object has no attribute 'best_estimator_'
The problem is in the following two rows:
gs_clf = GridSearchCV(pipeline, param_grid = alphas, refit=True)
y_predictions = gs_clf.best_estimator_.predict(dev_data)
Note that before using predict, you first need to fit the model. That is, to call gs_clf.fit. See the following example from the documentation:
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svr = svm.SVC()
>>> clf = GridSearchCV(svr, parameters)
>>> clf.fit(iris.data, iris.target)
...
GridSearchCV(cv=None, error_score=...,
estimator=SVC(C=1.0, cache_size=..., class_weight=..., coef0=...,
decision_function_shape=None, degree=..., gamma=...,
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=...,
verbose=False),
fit_params={}, iid=..., n_jobs=1,
param_grid=..., pre_dispatch=..., refit=..., return_train_score=...,
scoring=..., verbose=...)
>>> sorted(clf.cv_results_.keys())
...
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
'mean_train_score', 'param_C', 'param_kernel', 'params',...
'rank_test_score', 'split0_test_score',...
'split0_train_score', 'split1_test_score', 'split1_train_score',...
'split2_test_score', 'split2_train_score',...
'std_fit_time', 'std_score_time', 'std_test_score', 'std_train_score'...]