Tuning parameters of the classifier used by BaggingClassifier - python

Say that I want to train BaggingClassifier that uses DecisionTreeClassifier:
dt = DecisionTreeClassifier(max_depth = 1)
bc = BaggingClassifier(dt, n_estimators = 500, max_samples = 0.5, max_features = 0.5)
bc = bc.fit(X_train, y_train)
I would like to use GridSearchCV to find the best parameters for both BaggingClassifier and DecisionTreeClassifier (e.g. max_depth from DecisionTreeClassifier and max_samples from BaggingClassifier), what is the syntax for this?

I found the solution myself:
param_grid = {
'base_estimator__max_depth' : [1, 2, 3, 4, 5],
'max_samples' : [0.05, 0.1, 0.2, 0.5]
}
clf = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(),
n_estimators = 100, max_features = 0.5),
param_grid, scoring = choosen_scoring)
clf.fit(X_train, y_train)
i.e. saying that max_depth "belongs to" __ the base_estimator, i.e. my DecisionTreeClassifier in this case. This works and returns the correct results.

If you are using a pipeline then you can extend the accepted answer with something like this (note the double, double underscores):
model = {'model': BaggingClassifier,
'kwargs': {'base_estimator': DecisionTreeClassifier()}
'parameters': {
'name__base_estimator__max_leaf_nodes': [10,20,30]
}}
pipeline = Pipeline([('name', model['model'](**model['kwargs'])])
cv_model = GridSearchCV(pipeline, param_grid=model['parameters'], cv=cv, scoring=scorer)

Related

Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty

I'm building a logistic regression model to predict a binary target feature. I want to try different values of different parameters using the param_grid argument, to find the best fit with the best values. This is my code:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = 42)
logModel = LogisticRegression(C = 1, penalty='l1',solver='liblinear');
Grid_params = {
"penalty" : ['l1','l2','elasticnet','none'],
"C" : [0.001, 0.01, 0.1, 1, 10, 100, 1000], # Basically smaller C specify stronger regularization.
'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
'max_iter' : [50,100,200,500,1000,2500]
}
clf = GridSearchCV(logModel, param_grid=Grid_params, cv = 10, verbose = True, n_jobs=-1,error_score='raise')
clf_fitted = clf.fit(X_train,Y_train)
And this is where I get the error. I have read already that some solvers dont work with l1, and some don't work with l2. How can I tune the param_grid in this case?
I tried also using only simple logModel = LogisticRegression() but didn't work.
Full error:
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.
Gridsearch accepts the list of dicts for that purpose, given you absolutely need to include solvers into grid, you should be able to do something like this:
Grid_params = [
{'solver' : ['saga'],
'penalty' : ['elasticnet', 'l1', 'l2', 'none'],
'max_iter' : [50,100,200,500,1000,2500],
'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
{'solver' : ['newton-cg', 'lbfgs'],
'penalty' : ['l2','none'],
'max_iter' : [50,100,200,500,1000,2500],
'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
# add more parameter sets as needed...
]

how to do fine tune SVM hyperparameter with Kfold

I would like to use Gridsearch in the code to fine tune my SVM model, I have copied this code from other githubs and it has been working perfectly fine for my cross-fold.
X = Corpus.drop(['text','ManipulativeTag','compound'],axis=1).values # !!! this drops compund because of Naive Bayes
y = Corpus['ManipulativeTag'].values
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Create splits
splits = kf.split(X)
# Access the training and validation indices of splits
kfold_accuracy = {}
kfold_precision = {}
kfold_f = {}
kfold_recall = {}
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
# print("training:", train_index, "validations:", val_index)
X_val,y_val= X[val_index], y[val_index]
SVM = svm.SVC(C=1.0, kernel='linear', random_state=1111, probability=True) ### the base estimator
SVM.fit(X_train, y_train)
# predict the labels on validation dataset
predictions = SVM.predict(X_val)
# Use accuracy_score function to get the accuracy
kfold_accuracy[i] = accuracy_score(y_val, predictions)
kfold_precision[i] = precision_score(y_val, predictions)
kfold_f[i] = f1_score(y_val,predictions)
kfold_recall[i] = recall_score(y_val,predictions)
However when trying to implement Gridsearch most of the articles that I ran into uses train_test_split() rather than my kf.split(), I am having trouble finding the right place to shove the GridSearchCV() line:
GridSearchCV(estimator=classifier,
param_grid=grid_param,
scoring='accuracy',
cv=5,
n_jobs=-1)
I found my solution here: Grid search and cross validation SVM
I have copied this from the post:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]},
{'kernel': ['sigmoid'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000] },{'kernel': ['linear'], 'C': [0.001, 0.10, 0.1, 10, 25, 50, 100, 1000]}]
And I have kept everything from my code and only made changes in the loop by adding the Gridsearch() in my loop:
for i, (train_index, val_index) in enumerate(splits):
print("Split n°: ", i)
# Setup the training and validation data
X_train, y_train = X[train_index], y[train_index]
X_val,y_val= X[val_index], y[val_index]
# this is where I put GridSearch()
# here cv cannot be 1, so I put 2 instead
SVM = GridSearchCV(SVC(), tuned_parameters, cv=2, scoring='accuracy')
SVM.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(SVM.best_params_)

GridSearchCV - FitFailedWarning: Estimator fit failed

I am running this:
# Hyperparameter tuning - Random Forest #
# Hyperparameters' grid
parameters = {'n_estimators': list(range(100, 250, 25)), 'criterion': ['gini', 'entropy'],
'max_depth': list(range(2, 11, 2)), 'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [{0: 1, 1: i} for i in np.arange(1, 4, 0.2).tolist()], 'min_samples_split': list(range(2, 7))}
# Instantiate random forest
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=0)
# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring='balanced_accuracy',
cv=5, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)
and I am receiving this warning:
.../anaconda/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536:
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
TypeError: '<' not supported between instances of 'str' and 'int'
Why is this and how can I fix it?
I had similar issue of FitFailedWarning with different details, after many runs I found, the parameter value passing has the error, try
parameters = {'n_estimators': [100,125,150,175,200,225,250],
'criterion': ['gini', 'entropy'],
'max_depth': [2,4,6,8,10],
'max_features': [0.1, 0.2, 0.3, 0.4, 0.5],
'class_weight': [0.2,0.4,0.6,0.8,1.0],
'min_samples_split': [2,3,4,5,6,7]}
This will pass for sure, for me it happened in XGBClassifier, somehow the values datatype mixing up
One more is if the value exceeds the range, for example in XGBClassifier 'subsample' paramerters max value is 1.0, if it is set as 1.1, FitFailedWarning will occur
For me this was giving same error but after removing none from max_dept it is fitting properly.
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':['None',5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
code which is running properly:
param_grid={'n_estimators':[100,200,300,400,500],
'criterion':['gini', 'entropy'],
'max_depth':[5,10,20,30,40,50,60,70],
'min_samples_split':[5,10,20,25,30,40,50],
'max_features':[ 'sqrt', 'log2'],
'max_leaf_nodes':[5,10,20,25,30,40,50],
'min_samples_leaf':[1,100,200,300,400,500]
}
I too got same error and when I passed hyperparameters as in MachineLearningMastery, I got output without warning...
Try this way if anyone get similar issues...
# grid search logistic regression model on the sonar dataset
from pandas import read_csv
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = LogisticRegression()
# define evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define search space
space = dict()
space['solver'] = ['newton-cg', 'lbfgs', 'liblinear']
space['penalty'] = ['none', 'l1', 'l2', 'elasticnet']
space['C'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)
# summarize result
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)
Make sure the y-variable is an int, not bool or str.
Change your last line of code to make the y series a 0 or 1, for example:
classifiers_grid.fit(X, list(map(int, y)))

How to run RFECV with SVC in sklearn

I am trying to perform Recursive Feature Elimination with Cross Validation (RFECV) with GridSearchCV as follows using SVC as the classifier.
My code is as follows.
X = df[my_features]
y = df['gold_standard']
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = SVC(class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=k_fold, scoring='roc_auc')
param_grid = {'estimator__C': [0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1.0, 10.0, 100.0, 1000.0],
'estimator__gamma': [0.001, 0.01, 0.1, 1.0, 2.0, 3.0, 10.0, 100.0, 1000.0],
'estimator__kernel':('rbf', 'sigmoid', 'poly')
}
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc', verbose=10)
CV_rfc.fit(x_train, y_train)
However, I got an error saying: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
Is there a way to resolve this error? If not what are the other feature selection techniques that I can use with SVC?
I am happy to provide more details if needed.
To look at more feature selection implementations you can have a look at:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
As an example, in the next link they use PCA with k-best feature selection and svc.
https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html#sphx-glr-auto-examples-compose-plot-feature-union-py
An example of use would be, modified form the previous link for more simplicity:
iris = load_iris()
X, y = iris.data, iris.target
# Maybe some original features where good, too?
selection = SelectKBest()
# Build SVC
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", selection), ("svm", svm)])
param_grid = dict(features__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
emmm...in sklearn 0.19.2,The problem seems to have been solved.My code is similar to yours, but it works:
svc = SVC(
kernel = 'linear',
probability = True,
random_state = 1 )
rfecv = RFECV(
estimator = svc,
scoring = 'roc_auc'
)
rfecv.fit(train_values,train_Labels)
selecInfo = rfecv.support_
selecIndex = np.where(selecInfo==1)

LinearSVC and LogisticRegression are equivalent?

I am comparing performance of LinearSVC and LogisticRegression of scikit.learn on textual data. I am using LinearSVC with 'l2' penalty and 'squared_hinge' loss. I use LogisticRegression with 'l1' penalty. However I find that their performance is near identical on my data sets (in terms of classification accuracy, precision/recall etc., as well as running times). This can't be sheer coincidence, and leads me to suspect that these are in fact identical implementations. Is that the case?
If I am to compare LogisticRegression with a support vector based implementation (that can handle multiclass data) which class in scikit.learn should I use?
Here's my code
scorefunc = make_scorer(fbeta_score, beta = 1, pos_label = None)
splits = StratifiedShuffleSplit(data.labels, n_iter = 5, test_size = 0.2)
params1 = {'penalty':['l1'],'C':[0.0001, 0.001, 0.01, 0.1, 1.0]}
lr = GridSearchCV(LogisticRegression(), param_grid = params1, cv = splits, n_jobs = 5, scoring = scorefunc)
lr.fit(data.combined_mat, data.labels)
print lr.best_params_, lr.best_score_
>> {'penalty': 'l1', 'C': 1.0} 0.91015049974
params2 = {'C':[0.0001, 0.001, 0.01, 0.1, 1.0], 'penalty':['l2']}
svm = GridSearchCV(LinearSVC(), param_grid = params2, cv = splits, n_jobs = 5, scoring = scorefunc)
svm.fit(data.combined_mat, data.labels)
print svm.best_params_, svm.best_score_
>> {'penalty': 'l2', 'C': 1.0} 0.968320989681

Categories