I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline:
data = fetch_20newsgroups(subset='train', categories=None)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'tfidf__use_idf': (True, False),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)
print(grid_search.best_score_)
You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces
There is a nice example in the documentation https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html which I think exactly fits your requirements. See TextStats transformer.
[Update: the example was for scikit learn =< 0.18]
Regards,
Related
I trained my classifier using a pipeline:
param_tuning = {
'classifier__learning_rate': [0.01, 0.1],
'classifier__max_depth': [3, 5, 7, 10],
'classifier__min_child_weight': [1, 3, 5],
'classifier__subsample': [0.5, 0.7],
'classifier__n_estimators' : [100, 200, 500],
}
cat_pipe = Pipeline(
[
('selector', ColumnSelector(categorical_features)),
('encoder', ce.one_hot.OneHotEncoder())
]
)
num_pipe = Pipeline(
[
('selector', ColumnSelector(numeric_features)),
('scaler', StandardScaler())
]
)
preprocessor = FeatureUnion(
transformer_list=[
('cat', cat_pipe),
('num', num_pipe)
]
)
xgb_pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', xgb.XGBClassifier())
]
)
grid = GridSearchCV(xgb_pipe, param_tuning, cv=5, n_jobs=-1, scoring='accuracy')
xgb_model = grid.fit(X_train, y_train)
The training data have categorical data, so the transformed data shape is (x , 100 ). After that, i try to explain model prediction on unseen data. Since i pass single unseen example directly to model, it preprocessed it in shape (x, 15) (because single observation does not have all examples all categorical data).
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))
And i got
ValueError: Shape of passed values is (1, 15), indices imply (1, 100).
This occurs because model was trained on whole preprocessed dataset with shape (x, 100), but i pass to explainer single observation with shape (1,15). How do i correctly pass unseen single observation to explainer?
We never use .fit_transform() on unseen data; the correct way is to use the .transform() method of the pre-processor already fitted with your training data (here xgb['preprocessor']). That way, we ensure that the (transformed) unseen data have the same features with our (transformed) training ones, and so they are compatible with the model built with the latter.
So, you should replace .fit_transform(df) here:
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))
with .transform(df):
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].transform(df), columns = xgb['classifier'].get_booster().feature_names))
I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic.
I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier?
Also, what hyperparameters are you suggesting worth tuning for my problem?
As is, the pipeline looks for a parameter learning_rate in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}
I read following example on Pipelines and GridSearchCV in Python:
http://www.davidsbatista.net/blog/2017/04/01/document_classification/
Logistic Regression:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag')),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
SVM:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC()),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Is there a way that Logistic Regression and SVM could be combined into one Pipeline? Say, I have a TfidfVectorizer and like to test against multiple classifiers that each then output the best model/parameters.
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.
Yes, you can do that by building a wrapper function. The idea is to pass it two dictionaries: the models and the the parameters;
Then you iteratively call the models with all the parameters to test, using GridSearchCV for this.
Check this example, there is added extra functionality so that at the end you output a data frame with the summary of the different models/parameters and different performance scores.
EDIT: It's too much code to paste here, you can check a full working example here:
http://www.davidsbatista.net/blog/2018/02/23/model_optimization/
This is how I did it without a wrapper function.
You can evaluate any number of classifiers. Each one can have multiple parameters for hyperparameter optimization.
The one with best score will be saved to disk using pickle
from sklearn.svm import SVC
from operator import itemgetter
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
#pipeline parameters
parameters = \
[ \
{
'clf': [MultinomialNB()],
'tf-idf__stop_words': ['english', None],
'clf__alpha': [0.001, 0.1, 1, 10, 100]
},
{
'clf': [SVC()],
'tf-idf__stop_words': ['english', None],
'clf__C': [0.001, 0.1, 1, 10, 100, 10e5],
'clf__kernel': ['linear', 'rbf'],
'clf__class_weight': ['balanced'],
'clf__probability': [True]
},
{
'clf': [DecisionTreeClassifier()],
'tf-idf__stop_words': ['english', None],
'clf__criterion': ['gini','entropy'],
'clf__splitter': ['best','random'],
'clf__class_weight':['balanced', None]
}
]
#evaluating multiple classifiers
#based on pipeline parameters
#-------------------------------
result=[]
for params in parameters:
#classifier
clf = params['clf'][0]
#getting arguments by
#popping out classifier
params.pop('clf')
#pipeline
steps = [('tf-idf', TfidfVectorizer()), ('clf',clf)]
#cross validation using
#Grid Search
grid = GridSearchCV(Pipeline(steps), param_grid=params, cv=3)
grid.fit(features, labels)
#storing result
result.append\
(
{
'grid': grid,
'classifier': grid.best_estimator_,
'best score': grid.best_score_,
'best params': grid.best_params_,
'cv': grid.cv
}
)
#sorting result by best score
result = sorted(result, key=itemgetter('best score'),reverse=True)
#saving best classifier
grid = result[0]['grid']
joblib.dump(grid, 'classifier.pickle')
I am solving a binary classification problem over some text documents using Python and implementing the scikit-learn library, and I wish to try different models to compare and contrast results - mainly using a Naive Bayes Classifier, SVM with K-Fold CV, and CV=5. I am finding a difficulty in combining all of the methods into one pipeline, given that the latter two models use gridSearchCV(). I cannot have multiple Pipelines running during a single implementation due to concurrency issues, hence I need to implement all the different models using one pipeline.
This is what I have till now,
# pipeline for naive bayes
naive_bayes_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', MultinomialNB())
])
# accessing and using the pipelines
naive_bayes = naive_bayes_pipeline.fit(train_data['data'], train_data['gender'])
# pipeline for SVM
svm_pipeline = Pipeline([
('bow_transformer', CountVectorizer(analyzer=split_into_lemmas, stop_words='english')),
('tf_idf', TfidfTransformer()),
('classifier', SVC())
])
param_svm = [
{'classifier__C': [1, 10], 'classifier__kernel': ['linear']},
{'classifier__C': [1, 10], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']},
]
grid_svm_skf = GridSearchCV(
svm_pipeline, # pipeline from above
param_grid=param_svm, # parameters to tune via cross validation
refit=True, # fit using all data, on the best detected classifier
n_jobs=-1, # number of cores to use for parallelization; -1 uses "all cores"
scoring='accuracy',
cv=StratifiedKFold(train_data['gender'], n_folds=5), # using StratifiedKFold CV with 5 folds
)
svm_skf = grid_svm_skf.fit(train_data['data'], train_data['gender'])
predictions_svm_skf = svm_skf.predict(test_data['data'])
EDIT 1:
The second pipeline is the only pipeline using gridSearchCV(), and never seems to be executed.
EDIT 2:
Added more code to show gridSearchCV() use.
Consider checking out similar questions here:
Compare multiple algorithms with sklearn pipeline
Pipeline: Multiple classifiers?
To summarize,
Here is an easy way to optimize over any classifier and for each classifier any settings of parameters.
Create a switcher class that works for any estimator
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
def __init__(
self,
estimator = SGDClassifier(),
):
"""
A Custom BaseEstimator that can switch between classifiers.
:param estimator: sklearn object - The classifier
"""
self.estimator = estimator
def fit(self, X, y=None, **kwargs):
self.estimator.fit(X, y)
return self
def predict(self, X, y=None):
return self.estimator.predict(X)
def predict_proba(self, X):
return self.estimator.predict_proba(X)
def score(self, X, y):
return self.estimator.score(X, y)
Now you can pass in anything for the estimator parameter. And you can optimize any parameter for any estimator you pass in as follows:
Perform hyper-parameter optimization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('clf', ClfSwitcher()),
])
parameters = [
{
'clf__estimator': [SGDClassifier()], # SVM if hinge loss / logreg if log loss
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': ['english', None],
'clf__estimator__penalty': ('l2', 'elasticnet', 'l1'),
'clf__estimator__max_iter': [50, 80],
'clf__estimator__tol': [1e-4],
'clf__estimator__loss': ['hinge', 'log', 'modified_huber'],
},
{
'clf__estimator': [MultinomialNB()],
'tfidf__max_df': (0.25, 0.5, 0.75, 1.0),
'tfidf__stop_words': [None],
'clf__estimator__alpha': (1e-2, 1e-3, 1e-1),
},
]
gscv = GridSearchCV(pipeline, parameters, cv=5, n_jobs=12, return_train_score=False, verbose=3)
gscv.fit(train_data, train_labels)
How to interpret clf__estimator__loss
clf__estimator__loss is interpreted as the loss parameter for whatever estimator is, where estimator = SGDClassifier() in the top most example and is itself a parameter of clf which is a ClfSwitcher object.
TfidfVectorizer provides an easy way to encode & transform texts into vectors.
My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?
update:
Maybe I should have put more details on the question:
What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)
If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.
You can do that in sklearn easily with the GridSearchCV and Pipeline objects
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)
print("Best parameters set:")
print grid_search_tune.best_estimator_.steps