Multiple classification models in a scikit pipeline python - python

I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV(). I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.
This is what I have till now,
# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])
# pipeline for SVM
svm_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', SVC())
])
param_svm = [
{'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
{'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]
grid_svm_skf = GridSearchCV(
svm_pipeline, # pipeline from above
param_grid=param_svm, # parameters to tune via cross validation
refit=True, # fit using all data, on the best detected classifier
n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores"
scoring='accuracy',
cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds
)
svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])
EDIT 1:
The second pipeline is the only pipeline using gridSearchCV(), and never seems to be executed.
EDIT 2:
Added more code to show gridSearchCV() use.

Consider checking out similar questions here:
Compare multiple algorithms with sklearn pipeline
Pipeline: Multiple classifiers?
To summarize,
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.

Related

Multilabel classification in scikit-learn with hyperparameter search: specifying averaging

I am working on a simple multioutput classification problem and noticed this error showing up whenever running the below code:
ValueError: Target is multilabel-indicator but average='binary'. Please
choose another average setting, one of [None, 'micro', 'macro', 'weighted', 'samples'].
I understand the problem it is referencing, i.e., when evaluating multilabel models one needs to explicitly set the type of averaging. Nevertheless, I am unable to figure out where this average argument should go to, since only accuracy_score, precision_score, recall_score built-in methods have this argument which I do not use explicitly in my code. Moreover, since I am doing a RandomizedSearch, I cannot just pass a precision_score(average='micro') to the scoring or refit arguments either, since precision_score() requires correct and true y labels to be passed. This is why this former SO question and this one here, both with a similar issue, didn't help.
My code with example data generation is as follows:
from sklearn.datasets import make_multilabel_classification
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
X, Y = make_multilabel_classification(
n_samples=1000,
n_features=2,
n_classes=5,
n_labels=2
)
pipe = Pipeline(
steps = [
('scaler', MinMaxScaler()),
('model', MultiOutputClassifier(MultinomialNB()))
]
)
search = RandomizedSearchCV(
estimator = pipe,
param_distributions={'model__estimator__alpha': (0.01,1)},
scoring = ['accuracy', 'precision', 'recall'],
refit = 'precision',
cv = 5
).fit(X, Y)
What am I missing?
From the scikit-learn docs, I see that you can pass a callable that returns a dictionary where the keys are the metric names and the values are the metric scores. This means you can write your own scoring function, which has to take the estimator, X_test, and y_test as inputs. This in turn must compute y_pred and use that to compute the scores you want to use. This you can do doing the built-in methods. There, you can specify which keyword arguments should be used to compute the scores. In code that would look like
def my_scorer(estimator, X_test, y_test) -> dict[str, float]:
y_pred = estimator.predict(X_test)
return {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred, average='micro'),
'recall': recall_score(y_test, y_pred, average='micro'),
}
search = RandomizedSearchCV(
estimator = pipe,
param_distributions={'model__estimator__alpha': (0.01,1)},
scoring = my_scorer,
refit = 'precision',
cv = 5
).fit(X, Y)
From the table of scoring metrics, note f1_micro, f1_macro, etc., and the notes "suffixes apply as with ‘f1’" given for precision and recall. So e.g.
search = RandomizedSearchCV(
...
scoring = ['accuracy', 'precision_micro', 'recall_macro'],
...
)

How to run GridSearchCV inside RFECV?

I would like to write a script to select most important features using RFECV. The estimator I want to use is logistic regression. In addition, I also want to do the GridSearchCV. In other words, I want to tune the parameters first using GridSearchCV and then update the parameters in each iterations of RFECV.
I have written a code below but I'm not sure when I use RFECV(GridSearchCV(LogisticRegression)), the parameters of the model is tuned and updated in each iterations of RFECV or not. Please give me some advices on this issue.
Thank you so much!
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.model_selection import ParameterGrid, StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
X,y = make_classification(n_samples =50,
n_features=5,
n_informative=3,
n_redundant=0,
random_state=0)
class GridSeachWithCoef(GridSearchCV):
#property
def coef_(self):
return self.best_estimator_.coef_
solvers = ['lbfgs', 'liblinear']
penalty = ['l1', 'l2']
c_values = np.logspace(-4, 4, 20)
param_grid = [
{'penalty' : penalty,
'C': c_values,
'solver': solvers}
]
GS = GridSeachWithCoef(LogisticRegression(), param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
min_features_to_select = 1 # Minimum number of features to consider
rfecv = RFECV(
estimator=GS, cv=3, scoring = "accuracy"
)
rfecv.fit(X, y)
print("Optimal number of features : %d" % rfecv.n_features_)
(the code above was adopted from other people in the forum. Thank you for your code)

Pipeline: Multiple classifiers?

I read following example on Pipelines and GridSearchCV in Python:
http://www.davidsbatista.net/blog/2017/04/01/document_classification/
Logistic Regression:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag')),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
SVM:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC()),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Is there a way that Logistic Regression and SVM could be combined into one Pipeline? Say, I have a TfidfVectorizer and like to test against multiple classifiers that each then output the best model/parameters.
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.
Yes, you can do that by building a wrapper function. The idea is to pass it two dictionaries: the models and the the parameters;
Then you iteratively call the models with all the parameters to test, using GridSearchCV for this.
Check this example, there is added extra functionality so that at the end you output a data frame with the summary of the different models/parameters and different performance scores.
EDIT: It's too much code to paste here, you can check a full working example here:
http://www.davidsbatista.net/blog/2018/02/23/model_optimization/
This is how I did it without a wrapper function.
You can evaluate any number of classifiers. Each one can have multiple parameters for hyperparameter optimization.
The one with best score will be saved to disk using pickle
from sklearn.svm import SVC
from operator import itemgetter
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
#pipeline parameters
parameters = \
[ \
{
'clf': [MultinomialNB()],
'tf-idf__stop_words': ['english', None],
'clf__alpha': [0.001, 0.1, 1, 10, 100]
},
{
'clf': [SVC()],
'tf-idf__stop_words': ['english', None],
'clf__C': [0.001, 0.1, 1, 10, 100, 10e5],
'clf__kernel': ['linear', 'rbf'],
'clf__class_weight': ['balanced'],
'clf__probability': [True]
},
{
'clf': [DecisionTreeClassifier()],
'tf-idf__stop_words': ['english', None],
'clf__criterion': ['gini','entropy'],
'clf__splitter': ['best','random'],
'clf__class_weight':['balanced', None]
}
]
#evaluating multiple classifiers
#based on pipeline parameters
#-------------------------------
result=[]
for params in parameters:
#classifier
clf = params['clf'][0]
#getting arguments by
#popping out classifier
params.pop('clf')
#pipeline
steps = [('tf-idf', TfidfVectorizer()), ('clf',clf)]
#cross validation using
#Grid Search
grid = GridSearchCV(Pipeline(steps), param_grid=params, cv=3)
grid.fit(features, labels)
#storing result
result.append\
(
{
'grid': grid,
'classifier': grid.best_estimator_,
'best score': grid.best_score_,
'best params': grid.best_params_,
'cv': grid.cv
}
)
#sorting result by best score
result = sorted(result, key=itemgetter('best score'),reverse=True)
#saving best classifier
grid = result[0]['grid']
joblib.dump(grid, 'classifier.pickle')

How to pass two estimator objects to sklearn's GridSearchCV so that they have the same parameters in each step?

I'm trying to use GridSearchCV from SKlearn to tune hyperparameters for my estimator.
In the first step, the estimator is used to for is SequentialFeatureSelection, which is a custom library that performs wrapper based feature selection. This means iteratively adding new features and identifying the ones where the estimator performs best with. Hence, the SequentialFeatureSelection method requires my estimator. This library is programmed so that it is perfectly fine to use with SKlearn, so I integrate it in the first step of the GridSearchCV pipeline to transform the features to the ones selected.
In the second step, I would like to use exactly the same classifier with exactly the same parameters to be fitted and predict the outcome. However with the parameter grid, I can only either set the parameters to the classifier that I pass to SequentialFeatureSelector OR the ones in 'clf' and I cannot assure that they are always the same.
Finally, with the selected features and selected parameters I want to predict on a previously held out test-set.
On the bottom of the page of the SFS library, they show how to use SFS with GridSearchCV, but there the KNN algorithm used to select features and the one used to predict are also using different parameters. And when I check for myself after traininf SFS and GridSearchCV, the parameters are never the same, even when I use the clf.clone() as proposed. Here is my code:
import sklearn.pipeline
import sklearn.tree
import sklearn.model_selection
import mlxtend.feature_selection
def sfs(x, y):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2, random_state=0)
clf = sklearn.tree.DecisionTreeClassifier()
param_grid = {
"sfs__estimator__max_depth": [5]
}
sfs = mlxtend.feature_selection.SequentialFeatureSelector(clone_estimator=True, # Clone like in Tutorial
estimator=clf,
k_features=10,
forward=True,
floating=False,
scoring='accuracy',
cv=3,
n_jobs=1)
pipe = sklearn.pipeline.Pipeline([('sfs', sfs), ("clf", clf)])
gs = sklearn.model_selection.GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=3,
refit=True)
gs = gs.fit(x_train, y_train)
# Both estimators should have depth 5!
print("SFS Final Estimator Depth: " + str(gs.best_estimator_.named_steps.sfs.estimator.max_depth))
print("CLF Final Estimator Depth: " + str(gs.best_estimator_._final_estimator.max_depth))
# Evaluate...
y_test_pred = gs.predict(x_test)
# Accuracy etc...
The question would be, how do I assure that they always have the same parameters set within the same pipeline?
Thanks!
I found a solution, where I overwrite some methods of the SequentialFeatureSelector (SFS) class to also use its estimator for predicting after transformation. This is done by introducing a Custom SFS class 'CSequentialFeatureSelector' that overwrites the following methods from SFS:
In the fit(self, X, y) method, not only the normal fit is performed, but also the self.estimator is the fitted on the transformed data, so that it is possible to implement predict and predict_proba methods for the SFS class.
I implemented predict and predict_probba methods for the SFS class, that call the predict and predict_probba methods of the fitted self.estimator.
Hence, I only have one estimator left that is used for SFS and predicting.
Here is some of the code:
import sklearn.pipeline
import sklearn.tree
import sklearn.model_selection
import mlxtend.feature_selection
class CSequentialFeatureSelector(mlxtend.feature_selection.SequentialFeatureSelector):
def predict(self, X):
X = self.transform(X)
return self.estimator.predict(X)
def predict_proba(self, X):
X = self.transform(X)
return self.estimator.predict_proba(X)
def fit(self, X, y):
self.fit_helper(X, y) # fit helper is the 'old' fit method, which I copied and renamed to fit_helper
self.estimator.fit(self.transform(X), y)
return self
def sfs(x, y):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2, random_state=0)
clf = sklearn.tree.DecisionTreeClassifier()
param_grid = {
"sfs__estimator__max_depth": [3, 4, 5]
}
sfs = mlxtend.feature_selection.SequentialFeatureSelector(clone_estimator=True,
estimator=clf,
k_features=10,
forward=True,
floating=False,
scoring='accuracy',
cv=3,
n_jobs=1)
# Now only one object in the pipeline (in fact this is not even needed anymore)
pipe = sklearn.pipeline.Pipeline([('sfs', sfs)])
gs = sklearn.model_selection.GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=3,
refit=True)
gs = gs.fit(x_train, y_train)
print("SFS Final Estimator Depth: " + str(gs.best_estimator_.named_steps.sfs.estimator.max_depth))
y_test_pred = gs.predict(x_test)
# Evaluate performance of y_test_pred

Ensuring right order of operations in random forest classification in scikit learn

I would like to ensure that the order of operations for my machine learning is right:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV
# 1. Initialize model
model = RandomForestClassifier(5000)
# 2. Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
# 3. Remove unimportant features
model = SelectFromModel(model, threshold=0.5).estimator
# 4. cross validate model on the important features
k_fold = KFold(n=len(data), n_folds=10, shuffle=True)
for k, (train, test) in enumerate(k_fold):
self.model.fit(data[train], target[train])
# 5. grid search for best parameters
param_grid = {
'n_estimators': [1000, 2500, 5000],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': [3, 5, data.shape[1]]
}
gs = GridSearchCV(estimator=model, param_grid=param_grid)
gs.fit(X, y)
model = gs.best_estimator_
# Now the model can be used for prediction
Please let me know if this order looks good or if something can be done to improve it.
--EDIT, clarifying to reduce downvotes.
Specifically,
1. Should the SelectFromModel be done after cross validation?
Should grid search be done before cross validation?
The main problem with your approach is you are confusing the feature selection transformer with the final estimator. What you will need to do is create two stages, the transformer first:
rf_feature_imp = RandomForestClassifier(100)
feat_selection = SelectFromModel(rf_feature_imp, threshold=0.5)
Then you need a second phase where you use the reduced feature set to train a classifier on the reduced feature set.
clf = RandomForestClassifier(5000)
Once you have your phases, you can build a pipeline to combine the two into a final model.
model = Pipeline([
('fs', feat_selection),
('clf', clf),
])
Now you can perform a GridSearch on your model. Keep in mind you have two stages, so the parameters must be specified by stage fs or clf. In terms of the feature selection stage, you can also access the base estimator using fs__estimator. Below is an example of how to search parameters on any of the three objects.
params = {
'fs__threshold': [0.5, 0.3, 0.7],
'fs__estimator__max_features': ['auto', 'sqrt', 'log2'],
'clf__max_features': ['auto', 'sqrt', 'log2'],
}
gs = GridSearchCV(model, params, ...)
gs.fit(X,y)
You can then make predictions with gs directly or using gs.best_estimator_.

Categories