I would like to write a script to select most important features using RFECV. The estimator I want to use is logistic regression. In addition, I also want to do the GridSearchCV. In other words, I want to tune the parameters first using GridSearchCV and then update the parameters in each iterations of RFECV.
I have written a code below but I'm not sure when I use RFECV(GridSearchCV(LogisticRegression)), the parameters of the model is tuned and updated in each iterations of RFECV or not. Please give me some advices on this issue.
Thank you so much!
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.model_selection import ParameterGrid, StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
X,y = make_classification(n_samples =50,
n_features=5,
n_informative=3,
n_redundant=0,
random_state=0)
class GridSeachWithCoef(GridSearchCV):
#property
def coef_(self):
return self.best_estimator_.coef_
solvers = ['lbfgs', 'liblinear']
penalty = ['l1', 'l2']
c_values = np.logspace(-4, 4, 20)
param_grid = [
{'penalty' : penalty,
'C': c_values,
'solver': solvers}
]
GS = GridSeachWithCoef(LogisticRegression(), param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(
estimator=GS, cv=3, scoring = "accuracy"
)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
(the code above was adopted from other people in the forum. Thank you for your code)
Related
The whole idea is to perform a grid search over all possible values of lambda, where each possible values of lambda would give a specific best subset of feature. At The end of the day I'm trying to do hyperparameter tuning (lambda) and feature selection at the same time. any advice is greatly appreciated! thankyou so much
ISSUE :
result of gs_cv.best_estimator_[0].estimator.alpha while gs_cv.best_estimator_[1].alpha = 1.0 (pipeline indexing results)
best_parameter from the grid_search_cv doesnt seem to be fitted to the model part of the pipeline as seen in the image.
I got this when print(gs_cv.best_estimator_.named_steps). The Ridge() still uses the default value lambda of 1
{'sfs_ridge': SequentialFeatureSelector(estimator=Ridge(alpha=0.0), k_features=5,
scoring='r2'), 'ridge_regression': Ridge()}
------------Code------------------
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Model
ridge = Ridge()
#hyperparameter_alpha = np.logspace(-6,6, num=5)
#SFS model
sfs_ridge = SFS(estimator=ridge, k_features = 5, forward=True, floating=False, scoring='r2', cv = 5)
#Pipeline model
pipe = Pipeline([ ('sfs_ridge', sfs_ridge), ('ridge_regression', ridge) ])
#GridSearchCV
#The parameter_grid for the model should start with the name you give when defining the pipeline!!
param_grid = [ {'sfs_ridge__k_features': [2,4,5] ,'sfs_ridge__estimator__alpha': np.arange(0,1,0.05) }]
gs_cv = GridSearchCV(estimator= pipe, param_grid= param_grid, scoring="neg_mean_absolute_error", n_jobs = -1, cv=5, refit=True)
gs_cv.fit(X_train, y_train)
print(gs_cv.best_estimator_[0].estimator.alpha) #print out 0.0
print(gs_cv.best_estimator_[1].alpha) #print out 1.0
print(gs_cv.best_estimator_[0].k_feature_idx_)
I am trying to use RFECV for feature selection of different Machine learning algorithms and it is taking too long. The code is running for hours and not giving any output..
enter image description here
Here is my code:
# Feature selection by RFECV
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
estimator = AdaBoostClassifier(random_state=0)
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(features, popular)
selector.ranking_
#estimator_LR = LogisticRegression(random_state=0)
estimator_LR = LogisticRegression(C=1.0, tol=0.01, random_state=0, max_iter=10000)
selector_LR = RFECV(estimator_LR, step=1, cv=5)
selector_LR = selector_LR.fit(features, popular)
selector_LR.ranking_
estimator_RF = RandomForestClassifier(random_state=0)
selector_RF = RFECV(estimator_RF, step=1, cv=5)
selector_RF = selector_RF.fit(features, popular)
selector_RF.ranking_
I tried to run the code 1 line at a time and the code is stuck on selector.fit line for all three classifier.
selector_RF = selector_RF.fit(features, popular)
My Dataset consist of almost 35000 instances and 60 attributes.
I think you should make these changes to make RFECV faster:
selector = RFECV(estimator, step=1, cv=5, n_jobs = -1)
and
estimator_RF = RandomForestClassifier(random_state=0, n_jobs = -1)
and max_iter = 10000 is too much so,
estimator_LR = LogisticRegression(C=1.0, tol=0.01, random_state=45, max_iter=100)
setting n_jobs = -1 parameter will tell your model to use all of your cpu cores and hence make code run faster.
Below is the code that I am trying to execute
# Train a logistic regression model, report the coefficients and model performance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
clf = LogisticRegression().fit(X_train, y_train)
params = {'penalty':['l1','l2'],'dual':[True,False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True,False],
'solver':['saga']}
gridlog = GridSearchCV(clf, params, cv=5, n_jobs=2, scoring='roc_auc')
cv_scores = cross_val_score(gridlog, X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_) # throws error
The last code line above is where the error is being thrown from. I have used this exact same code to run other models. Any idea why I may be facing this issue?
You need to fit gridlog first. cross_val_score will not do this, it returns the scores & nothing else.
Hence, as gridlog isn't trained, it throws error.
Below code works perfectly fine:
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
diabetes = datasets.load_breast_cancer()
x = diabetes.data[:150]
y = diabetes.target[:150]
clf = LogisticRegression().fit(x, y)
params = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
gridlog = GridSearchCV(clf, params, cv=2, n_jobs=2,
scoring='roc_auc')
gridlog.fit(x,y) # <- missing in your code
cv_scores = cross_val_score(gridlog, x, y)
print(cv_scores)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_)
# result:
Logistic regression parameters: {'C': 1}
Your code should be updated such that the LogisticRegression classifier is passed to the GridSearch (not its fit):
from sklearn.datasets import load_breast_cancer # For example only
X_train, y_train = load_breast_cancer(return_X_y=True)
params = {'penalty':['l1', 'l2'],'dual':[True, False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True, False],
'solver':['saga']}
gridlog = GridSearchCV(LogisticRegression(), params, cv=5, n_jobs=2, scoring='roc_auc')
gridlog.fit(X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ', gridlog.best_params_) # Now it displays all the parameters selected by the grid search
Results
Logistic Regression parameters: {'C': 0.1, 'dual': False, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'saga'}
Note, as #desertnaut pointed out, you don't use cross_val_score for GridSearchCV.
See a complete example of how to use GridSearch here.
The example use a SVC classifier instead of a LogisticRegression, but the approach is the same.
I am trying to optimize hyperparameters for ridge regression. But also add polynomial features. So, pipeline looks okay but getting error when try to gridsearchcv. Here:
# Importing the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from collections import Counter
from IPython.core.display import display, HTML
sns.set_style('darkgrid')
# Data Preprocessing
from sklearn.datasets import load_boston
boston_dataset = load_boston()
dataset = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)
dataset['MEDV'] = boston_dataset.target
# X and y Variables
X = dataset.iloc[:, 0:13].values
y = dataset.iloc[:, 13].values.reshape(-1,1)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 25)
# Building the Model ------------------------------------------------------------------------
# Fitting regressior to the Training set
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
steps = [
('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
# Predicting the Test set results
y_pred = ridge_pipe.predict(X_test)
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = ridge_pipe, X = X_train, y = y_train, cv = 10)
accuracies.mean()
#accuracies.std()
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [ {'alpha': np.arange(0, 0.2, 0.01) } ]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train) # <-- GETTING ERROR IN HERE
Error:
ValueError: Invalid parameter ridge for estimator
What to do or, is there a better way to use ridge regression with pipeline? I would be pleased if put some sources about gridsearch because I am a newbie on this. The error:
There are two problems in your code. First since you are using a pipeline, you need to specify in the params list which part of the pipeline does the params belongs to. See the official documentation for more information :
The purpose of the pipeline is to assemble several steps that can be
cross-validated together while setting different parameters. For this,
it enables setting parameters of the various steps using their names
and the parameter name separated by a ‘__’, as in the example below
In this case, since alpha is going to be used with ridge-regression and you have used the string model in the Pipeline defintion, you need to rename the key alpha to model_alpha:
steps = [
('scalar', StandardScaler()),
('poly', PolynomialFeatures(degree=2)),
('model', Ridge()) # <------ Whatever string you assign here will be used later
]
# Since you have named it as 'model', you need change it to 'model_alpha'
parameters = [ {'model__alpha': np.arange(0, 0.2, 0.01) } ]
Next, you need to understand this dataset is for Regression. You should not use accuracy here, instead use a regression based scoring function like, mean_squared_error. Here are some other metrics for regression that you can use. Something like this
from sklearn.metrics import mean_squared_error, make_scorer
scoring_func = make_scorer(mean_squared_error)
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = scoring_func, #<--- Use the scoring func defined above
cv = 10,
n_jobs = -1)
Here is a link to a Google colab notebook with working code.
For the GridSearchCV parameters, the parameter name for ridge should be 'ridge__alpha' (note 2 underscores) instead of just 'alpha'.
I want to do a binary classification for 30 groups of subjects having 230 samples by 150 features. I founded it very hard to implement especially when doing feature selection, parameters tunning through nested leave one group out cross-validation and report the accuracy using two classifiers the SVM and random forest and to see which features been selected.
I'm new to this and I'm sure the following code is not correct:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
X= the data (230 samples * 150 features)
y= [1,0,1,0,0,0,1,1,1..]
groups = [1,2...30]
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0]}]
inner_cross_validation = LeaveOneGroupOut().split(X, y, groups)
outer_cross_validation = LeaveOneGroupOut().split(X, y, groups)
estimator = SVC(kernel="linear")
selector = RFE(estimator, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search.fit(X, y)
scores = cross_val_score(grid_search, X, y,cv=outer_cross_validation)
I don't know where to set "the random forest classifier" in the above because I want to compare the accuracies between SVM and random forest.
Thank you very much for reading and hope that someone can help me.
Best regards
You should call the tree in the same way that you call the SVM
#your libraries
from sklearn.tree import DecisionTreeClassifier
#....
estimator = SVC(kernel="linear")
estimator2 = DecisionTreeClassifier( ...parameters here...)
selector = RFE(estimator, step=1)
selector2 = RFE(estimator2, step=1)
grid_search = GridSearchCV(selector, param_grid, cv=inner_cross_validation)
grid_search = GridSearchCV(selector2, ..greed for the tree here.., cv=inner_cross_validation)
Please note that this procedure will lead to two different set of selected features: one for the SVM and one for the Decision Tree