Below is the code that I am trying to execute
# Train a logistic regression model, report the coefficients and model performance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import metrics
clf = LogisticRegression().fit(X_train, y_train)
params = {'penalty':['l1','l2'],'dual':[True,False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True,False],
'solver':['saga']}
gridlog = GridSearchCV(clf, params, cv=5, n_jobs=2, scoring='roc_auc')
cv_scores = cross_val_score(gridlog, X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_) # throws error
The last code line above is where the error is being thrown from. I have used this exact same code to run other models. Any idea why I may be facing this issue?
You need to fit gridlog first. cross_val_score will not do this, it returns the scores & nothing else.
Hence, as gridlog isn't trained, it throws error.
Below code works perfectly fine:
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
diabetes = datasets.load_breast_cancer()
x = diabetes.data[:150]
y = diabetes.target[:150]
clf = LogisticRegression().fit(x, y)
params = {'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
gridlog = GridSearchCV(clf, params, cv=2, n_jobs=2,
scoring='roc_auc')
gridlog.fit(x,y) # <- missing in your code
cv_scores = cross_val_score(gridlog, x, y)
print(cv_scores)
#find best parameters
print('Logistic Regression parameters: ',gridlog.best_params_)
# result:
Logistic regression parameters: {'C': 1}
Your code should be updated such that the LogisticRegression classifier is passed to the GridSearch (not its fit):
from sklearn.datasets import load_breast_cancer # For example only
X_train, y_train = load_breast_cancer(return_X_y=True)
params = {'penalty':['l1', 'l2'],'dual':[True, False],'C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 'fit_intercept':[True, False],
'solver':['saga']}
gridlog = GridSearchCV(LogisticRegression(), params, cv=5, n_jobs=2, scoring='roc_auc')
gridlog.fit(X_train, y_train)
#find best parameters
print('Logistic Regression parameters: ', gridlog.best_params_) # Now it displays all the parameters selected by the grid search
Results
Logistic Regression parameters: {'C': 0.1, 'dual': False, 'fit_intercept': True, 'penalty': 'l2', 'solver': 'saga'}
Note, as #desertnaut pointed out, you don't use cross_val_score for GridSearchCV.
See a complete example of how to use GridSearch here.
The example use a SVC classifier instead of a LogisticRegression, but the approach is the same.
Related
I have a binary classification problem. I've been using cross validation to optimize the ElasticNet parameters. However ElasticNet only seems to work when I supply roc_auc as the scoring method to be used during CV, However I also want to test out a wide range of scoring methods, in particular accuracy. Specifically, when using accuracy, ElasticNet returns this error:
ValueError: Classification metrics can't handle a mix of binary and continuous targets
However my y targets are indeed binary. Below is a replication of my problem using the dataset from here:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import ElasticNet
data = pd.read_csv('data 2.csv')
# by default majority class (benign) will be negative
lb = LabelBinarizer()
data['diagnosis'] = lb.fit_transform(data['diagnosis'].values)
targets = data['diagnosis']
data.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(data, targets, stratify=targets)
#elastic net logistic regression
lr = ElasticNet(max_iter=2000)
scorer = 'accuracy'
param_grid = {
'alpha': [1e-4, 1e-3, 1e-2, 0.01, 0.1, 1, 5, 10],
'l1_ratio': np.arange(0.2, 0.9, 0.1)
}
skf = StratifiedKFold(n_splits=10)
clf = GridSearchCV(lr, param_grid, scoring=scorer, cv=skf, return_train_score=True,
n_jobs=-1)
clf.fit(X_train.values, y_train.values)
I figured that ElasticNet might be trying to solve a linear regression problem so I tried lr = LogisticRegression(penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga') as the classifier but the same problem persists.
If I use as the scoring metric scorer = 'roc_auc' then the model is built as expected.
Also, as a sanity to check to see if there is something wrong with the data I tried the same but with a random forest classifier and here the problem disappears:
# random forest
clf = RandomForestClassifier(n_jobs=-1)
param_grid = {
'min_samples_split': [3, 5, 10],
'n_estimators' : [100, 300],
'max_depth': [3, 5, 15, 25],
'max_features': [3, 5, 10, 20]
}
skf = StratifiedKFold(n_splits=10)
scorer = 'accuracy'
grid_search = GridSearchCV(clf, param_grid, scoring=scorer,
cv=skf, return_train_score=True, n_jobs=-1)
grid_search.fit(X_train.values, y_train.values)
Has anyone got any ideas on what's happening here?
ElasticNet is a regression model.
If you want an ElasticNet penalty in classification, use LogisticRegression:
lr = LogisticRegression(solver="saga", penalty="elasticnet")
Minimal Reproducible Example:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
lr = LogisticRegression(solver="saga", penalty="elasticnet", max_iter=2000)
param_grid = {
'l1_ratio': np.arange(0.2, 0.9, 0.1)
}
clf = GridSearchCV(lr, param_grid, scoring='accuracy', cv=StratifiedKFold(n_splits=10), return_train_score=True, n_jobs=-1)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
I try to implement gridsearchcv and then rfecv on svm
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['linear','rbf','sigmoid']}
estimator = SVC(probability = True, coef0 = 1.0)
clf = GridSearchCV (estimator, param_grid, cv=10, verbose=True)
clf.fit(data_train, label_train)
selector = RFECV(estimator, step = 1, cv= 10)
selector.fit(data_train, label_train)
label_predicted = selector.predict(data_test)
print(classification_report(label_test, label_predicted, digits=4))
it shows an error
ValueError: when importance_getter=='auto', the underlying estimator SVC should have coef_ or feature_importances_ attribute. Either pass a fitted estimator to feature selector or call fit before calling transform.
here is output image
I would like to write a script to select most important features using RFECV. The estimator I want to use is logistic regression. In addition, I also want to do the GridSearchCV. In other words, I want to tune the parameters first using GridSearchCV and then update the parameters in each iterations of RFECV.
I have written a code below but I'm not sure when I use RFECV(GridSearchCV(LogisticRegression)), the parameters of the model is tuned and updated in each iterations of RFECV or not. Please give me some advices on this issue.
Thank you so much!
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.model_selection import ParameterGrid, StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
X,y = make_classification(n_samples =50,
n_features=5,
n_informative=3,
n_redundant=0,
random_state=0)
class GridSeachWithCoef(GridSearchCV):
#property
def coef_(self):
return self.best_estimator_.coef_
solvers = ['lbfgs', 'liblinear']
penalty = ['l1', 'l2']
c_values = np.logspace(-4, 4, 20)
param_grid = [
{'penalty' : penalty,
'C': c_values,
'solver': solvers}
]
GS = GridSeachWithCoef(LogisticRegression(), param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(
estimator=GS, cv=3, scoring = "accuracy"
)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
(the code above was adopted from other people in the forum. Thank you for your code)
I have a dataset with multiple outputs and am trying to use gradient boosting to predict all the values at once. I imported MultiOutputRegressor so multiple outputs can be predicted at once; I'm able to make it work for the default gradient boosting function. However, I'm running into an error when I try to optimize the gradient boosting function for each output.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn import ensemble
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(ensemble.GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'learning_rate':[0.15, 0.125, 0.1, 0.75, 0.05], 'n_estimators':[50, 75, 100, 150, 200, 250, 300, 400]}
tuning = GridSearchCV(estimator = GradBoostModel, param_grid = LR, scoring = 'r2')
tuning.fit(X_train, y_train)
tuning.best_params_, tuning.best_score_
I'm trying to use GridSearchCV to cycle through the listed learning rates and number of estimators to find the optimal values. But, I get the following error:
Invalid parameter learning_rate for estimator MultiOutputRegressor.
Check the list of available parameters with `estimator.get_params().keys()`
I think I understand the reason for the error: when I try to optimize the gradient boosting parameters, they are passed through the MultiOutputRegressor, which doesn't recognize them. Is this the case? Also, how can I change my code, such that I can optimize these parameters for each output?
Indeed the params are prefixed with estimator__, in general, to find out what params to use downstream in your pipeline use the .get_params().keys() method on your model, eg:
print(GradBoostModel.get_params().keys())
dict_keys(['estimator__alpha', 'estimator__ccp_alpha', 'estimator__criterion', 'estimator__init', 'estimator__learning_rate',...
Full working example with the linnerud dataset:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.datasets import load_linnerud
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
# Data
rng = np.random.RandomState(0)
X, y = load_linnerud(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
# Model
params = {'max_depth': 3, 'n_estimators': 100, 'learning_rate': 0.1}
gradient_regressor = MultiOutputRegressor(GradientBoostingRegressor(**params))
GradBoostModel = gradient_regressor.fit(X_train, y_train)
prediction_GradBoost = GradBoostModel.predict(X_test)
LR = {'estimator__learning_rate': [0.15, 0.125, 0.1, 0.75, 0.05], 'estimator__n_estimators': [50, 75, 100, 150, 200, 250, 300, 400]}
print('Params from GradBoostModel', GradBoostModel.get_params().keys())
tuning = GridSearchCV(estimator=GradBoostModel, param_grid=LR, scoring='r2')
tuning.fit(X_train, y_train)
Scikit-learn GridSearchCV is used for hyper parameter tuning of XGBRegressor models. Independent of specified eval_metric in XGBRegressor().fit() the same score values are produced by GridSearchCV. On https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html it says for the parameter scoring: "If None, the estimator’s score method is used." This does not happen. Always get same value. How can I get results corresponding to XGBRegressor eval_metric?
This sample code:
import numpy as np
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.datasets import load_boston
import xgboost as xgb
rng = np.random.RandomState(31337)
boston = load_boston()
y = boston['target']
X = boston['data']
kf = KFold(n_splits=2, random_state=42)
folds = list(kf.split(X))
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', verbose=False)
reg = GridSearchCV(estimator=xgb_model,
param_grid= {'max_depth': [2], 'n_estimators': [50]},
cv=folds,
verbose=False)
reg.fit(X, y, **{'eval_metric': 'mae', 'verbose': False})
print('GridSearchCV mean(mae)?: ', reg.cv_results_['mean_test_score'])
# -----------------------------------------------
reg.fit(X, y, **{'eval_metric': 'rmse', 'verbose': False})
print('GridSearchCV mean(rmse)?: ', reg.cv_results_['mean_test_score'])
print("----------------------------------------------------")
xgb_model.set_params(**{'max_depth': 2, 'n_estimators': 50})
xgb_model.fit(X[folds[0][0],:],y[folds[0][0]], eval_metric='mae',
eval_set = [(X[folds[0][0],:],y[folds[0][0]])], verbose=False)
print('XGBRegressor 0-mae:', xgb_model.evals_result()['validation_0']['mae'][-1])
xgb_model.fit(X[folds[0][1],:],y[folds[0][1]], eval_metric='mae',
eval_set = [(X[folds[0][1],:],y[folds[0][1]])], verbose=False)
print('XGBRegressor 1-mae:', xgb_model.evals_result()['validation_0']['mae'][-1])
xgb_model.fit(X[folds[0][0],:],y[folds[0][0]], eval_metric='rmse',
eval_set = [(X[folds[0][0],:],y[folds[0][0]])], verbose=False)
print('XGBRegressor 0-rmse:', xgb_model.evals_result()['validation_0']['rmse'][-1])
xgb_model.fit(X[folds[0][1],:],y[folds[0][1]], eval_metric='rmse',
eval_set = [(X[folds[0][1],:],y[folds[0][1]])], verbose=False)
print('XGBRegressor 1-rmse:', xgb_model.evals_result()['validation_0']['rmse'][-1])
returns (the numbers above the line should have been an average of those below the line)
GridSearchCV mean(mae)?: [0.70941007]
GridSearchCV mean(rmse)?: [0.70941007]
----------------------------------------------------
XGBRegressor 0-mae: 1.273626
XGBRegressor 1-mae: 1.004947
XGBRegressor 0-rmse: 1.647694
XGBRegressor 1-rmse: 1.290872
TL;DR: What you're returned is so called R2 or coefficient of determination. This is the default scoring metric for XGBRegressor score function that is picked up by GridSearchCV if scoring=None
Compare the results with explicitly coding scoring:
from sklearn.metrics import make_scorer, r2_score, mean_squared_error
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', verbose=False)
reg = GridSearchCV(estimator=xgb_model, scoring=make_scorer(r2_score),
param_grid= {'max_depth': [2], 'n_estimators': [50]},
cv=folds,
verbose=False)
reg.fit(X, y)
reg.best_score_
0.7333542105472226
with those with scoring=None:
reg = GridSearchCV(estimator=xgb_model, scoring=None,
param_grid= {'max_depth': [2], 'n_estimators': [50]},
cv=folds,
verbose=False)
reg.fit(X, y)
reg.best_score_
0.7333542105472226
If you read GridSearchCV docstrings :
estimator : estimator object.
This is assumed to implement the scikit-learn estimator interface.
Either estimator needs to provide a score function,
or scoring must be passed.
At this point you would want to check docs for xgb_model.score?:
Signature: xgb_model.score(X, y, sample_weight=None)
Docstring:
Return the coefficient of determination R^2 of the prediction.
So, with the help of those documents, if you do not like XGBRegressor's default R2 score function, provide your scoring function explicitly to GridSearchCV
E.g. if you want RMSE you may do:
reg = GridSearchCV(estimator=xgb_model,
scoring=make_scorer(mean_squared_error, squared=False),
param_grid= {'max_depth': [2], 'n_estimators': [50]},
cv=folds,
verbose=False)
reg.fit(X, y)
reg.best_score_
4.618242594168436