Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn

Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn - python

I have a machine learning problem and want to optimize my SVC estimators as well as the feature selection.
For optimizing SVC estimators I use essentially the code from the docs. Now my question is, how can I combine this with recursive feature elimination cross validation (RCEV)? That is, for each estimator-combination I want to do the RCEV in order to determine the best combination of estimators and features.
I tried the solution from this thread, but it yields the following error:
ValueError: Invalid parameter C for estimator RFECV. Check the list of available parameters with `estimator.get_params().keys()`.
My code looks like this:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-4,1e-3],'C': [1,10]},
{'kernel': ['linear'],'C': [1, 10]}]
estimator = SVC(kernel="linear")
selector = RFECV(estimator, step=1, cv=3, scoring=None)
clf = GridSearchCV(selector, tuned_parameters, cv=3)
clf.fit(X_train, y_train)
The error appears at clf = GridSearchCV(selector, tuned_parameters, cv=3).

I would use a Pipeline, but here you have a more adequate response
Recursive feature elimination and grid search using scikit-learn

Related

Using Multiple Metric Evaluation with GridSearchCV

I am attempting to use multiple metrics in GridSearchCV. My project needs multiple metrics including "accuracy" and "f1 score". However, after following the sklearn models and online posts, I can't seem to get mine to work. Here is my code:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
clf = KNeighborsClassifier()
param_grid = {'n_neighbors': range(1,30), 'algorithm': ['auto','ball_tree','kd_tree', 'brute'], 'weights': ['uniform', 'distance'],'p': range(1,5)}
#Metrics for Evualation:
met_grid= ['accuracy', 'f1'] #The metric codes from sklearn
custom_knn = GridSearchCV(clf, param_grid, scoring=met_grid, refit='accuracy', return_train_score=True)
custom_knn.fit(X_train, y_train)
y_pred = custom_knn.predict(X_test)
My error occurs on the custom_knn.fit(X_train,y_train). Further more, if you comment-out the scoring=met_grid, refit='accuracy', return_train_score=True, it works.
Here is my error:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Also, if you could explain multiple metric evaluation or refer me to someone who can, that would be much appreciated!
Thanks

f1 is a binary classification metric. For multi-class classification, you have to use averaged f1 based on different aggregation. You can find the exhaustive list of scoring available in Sklearn here.
Try this!
scoring = ['accuracy','f1_macro']
custom_knn = GridSearchCV(clf, param_grid, scoring=scoring,
refit='accuracy', return_train_score=True,cv =3)

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet
text_clf =Pipeline([('vect', TfidfVectorizer()),('scale', StandardScaler(with_mean=False)),('smt', SMOTE(random_state=5)),('clf', LinearSVC(class_weight='balanced'))])
After this I am using GridsearchCV.
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning parameters mostly for TfidfVectorizer().
I am getting the following error.
All intermediate steps should be transformers and implement fit and transform. 'SMOTE
Post this error, I have changed the code to as follows.
vect = TfidfVectorizer(use_idf=True,smooth_idf = True, max_df = 0.25, sublinear_tf = True, ngram_range=(1,2))
X = vect.fit_transform(X).todense()
Y = vect.fit_transform(Y).todense()
X_Train,X_Test,Y_Train,y_test = train_test_split(X,Y, random_state=0, test_size=0.33, shuffle=True)
text_clf =make_pipeline([('smt', SMOTE(random_state=5)),('scale', StandardScaler(with_mean=False)),('clf', LinearSVC(class_weight='balanced'))])
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning Cin SVC classifiers.
This time I am getting the following error:
Last step of Pipeline should implement fit.SMOTE(....) doesn't
What is going here? Can anyone please help?

imblearn.SMOTE has no transform method. Docs is here.
But all steps except the last in a pipeline should have it, along with fit.
To use SMOTE with sklearn pipeline you should implement a custom transformer calling SMOTE.fit_sample() in transform method.
Another easier option is just to use ibmlearn pipeline:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
# This doesn't work with sklearn.pipeline.Pipeline because
# SMOTE doesn't have a .tranform() method.
# (It has .fit_sample() or .sample().)
pipe = imbPipeline([
...
('oversample', SMOTE(random_state=5)),
('clf', LinearSVC(class_weight='balanced'))
])

Basic Sklearn: How to Pass Scoring Function to Fit Method

I'm using sklearn to do some machine learning. I often use GridSearchCV to explore hyperparameters and perform cross-validation. Using this, I can specify a scoring function, like this:
scores = -cross_val_score(svr, X, Y, cv=10, scoring='neg_mean_squared_error')
However, I want to train my SVR model using mean squared error. Unfortunately, there's no scoring parameter in either the constructor for SVR or the fit method.
How should I do this?
Thanks!

I typically use Pipeline to do it. You can create list of pipelines including SVR model (and others if you want). Then, you can apply GridSearchCV where putting pipeline in as your argument.
Here, you can add params_grid where searching space can be defined as pipelinename__paramname (double underscore in between). For example, I have pipeline name svr and I want to search on parameter C, I can put the key in my parameter dictionary as svr__C.
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.svm import SVR
c_range = np.arange(1, 10, 1)
pipeline = Pipeline([('svr', SVR())])
params_grid = {'svr__C': c_range}
# grid search with 3-fold cross validation
gridsearch_model = GridSearchCV(pipeline, params_grid,
cv=3, scoring='neg_mean_squared_error')
Then, you can do the same procedure by fitting training data and find best score and parameters
gridsearch_model.fit(X_train, y_train)
print(gridsearch_model.best_params_, gridsearch_model.best_score_)
You can also use cross_val_score to find the score:
cross_val_score(gridsearch_model, X_train, y_train,
cv=3, scoring='neg_mean_squared_error')
Hope this helps!

Prevent overfitting in Logistic Regression using Sci-Kit Learn

I trained a model using Logistic Regression to predict whether a name field and description field belong to a profile of a male, female, or brand. My train accuracy is around 99% while my test accuracy is around 83%. I have tried implementing regularization by tuning the C parameter but the improvements were barely noticed. I have around 5,000 examples in my training set. Is this an instance where I just need more data or is there something else I can do in Sci-Kit Learn to get my test accuracy higher?

overfitting is a multifaceted problem. It could be your train/test/validate split (anything from 50/40/10 to 90/9/1 could change things). You might need to shuffle your input. Try an ensemble method, or reduce the number of features. you might have outliers throwing things off
then again, it could be none of these, or all of these, or some combination of these.
for starters, try to plot out test score as a function of test split size, and see what you get

#The 'C' value in Logistic Regresion works very similar as the Support
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:
# For SVC You can remove the gamma and kernel keys
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001],
# 'kernel': ['rbf']}
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C
#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)
# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))

scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model.
I'd like to use the GridSearchCV over different values of C.
However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall performance (r2 in this instance).
To address this problem, I'd like to implement a repeated 5-fold cross-validation (10 x 5CV). Is there a built-in way of performing it using GridSearchCV?
Quick solution, following the idea presented in the sci-kit official documentation:
NUM_TRIALS = 10
scores = []
for i in range(NUM_TRIALS):
cv = KFold(n_splits=5, shuffle=True, random_state=i)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
scores.append(clf.best_score_)
print "Average Score: {0} STD: {1}".format(numpy.mean(scores), numpy.std(scores))

This is called as nested cross_validation. You can look at official documentation example to guide you into right direction and also have a look at my other answer here for a similar approach.
You can adapt the steps to suit your need:
svr = SVC(kernel="rbf")
c_grid = {"C": [1, 10, 100, ... ]}
# CV Technique "LabelKFold", "LeaveOneOut", "LeaveOneLabelOut", etc.
# To be used within GridSearch (5 in your case)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)
# To be used in outer CV (you asked for 10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=i)
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
# This will be your required 10 x 5 cvs
# 10 for outer cv and 5 for gridSearch's internal CV
clf = GridSearchCV(estimator=svr, param_grid=c_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv).mean()
Edit - Description of nested cross validation with cross_val_score() and GridSearchCV()
clf = GridSearchCV(estimator, param_grid, cv= inner_cv).
Pass clf, X, y, outer_cv to cross_val_score
As seen in source code of cross_val_score, this X will be divided into X_outer_train, X_outer_test using outer_cv. Same for y.
X_outer_test will be held back and X_outer_train will be passed on to clf for fit() (GridSearchCV in our case). Assume X_outer_train is called X_inner from here on since it is passed to inner estimator, assume y_outer_train is y_inner.
X_inner will now be split into X_inner_train and X_inner_test using inner_cv in the GridSearchCV. Same for y
Now the gridSearch estimator will be trained using X_inner_train and y_train_inner and scored using X_inner_test and y_inner_test.
The steps 5 and 6 will be repeated for inner_cv_iters (5 in this case).
The hyper-parameters for which the average score over all inner iterations (X_inner_train, X_inner_test) is best, is passed on to the clf.best_estimator_ and fitted for all data, i.e. X_outer_train.
This clf (gridsearch.best_estimator_) will then be scored using X_outer_test and y_outer_test.
The steps 3 to 9 will be repeated for outer_cv_iters (10 here) and array of scores will returned from cross_val_score
We then use mean() to get back nested_score.

You can supply different cross-validation generators to GridSearchCV. The default for binary or multiclass classification problems is StratifiedKFold. Otherwise, it uses KFold. But you can supply your own. In your case, it looks like you want RepeatedKFold or RepeatedStratifiedKFold.
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# Define svr here
...
# Specify cross-validation generator, in this case (10 x 5CV)
cv = RepeatedKFold(n_splits=5, n_repeats=10)
clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=cv)
# Continue as usual
clf.fit(...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use both Recursive Feature Eliminiation and Grid Search in SciKit-Learn - python

I would use a Pipeline, but here you have a more adequate response Recursive feature elimination and grid search using scikit-learn

Related

Using Multiple Metric Evaluation with GridSearchCV

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

Basic Sklearn: How to Pass Scoring Function to Fit Method

Prevent overfitting in Logistic Regression using Sci-Kit Learn

scikit-learn GridSearchCV with multiple repetitions

Categories

Resources