Hyperparameters tuning using GridSearchCV - python

I'm new to machine learning and I'm trying to predict the topic of an article given a labeled datasets that each contains all the words in one article. There are 11 different topics total and each article only has single topic.
I have built a process pipeline:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(XGBClassifier(objective="multi:softmax", num_class=11), n_jobs=-1)),
])
I'm trying to implement a GridsearchCV to find the best hyperparameters:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(classifier, parameters, n_jobs=-1, cv=10, scoring='f1_micro')
gs_clf_svm = gs_clf_svm.fit(X, Y)
This works fine, however, how do I tune the hyperparameters of XGBClassifier? I have tried using the notation:
parameters = {'clf__learning_rate': [0.1, 0.01, 0.001]}
It doesn't work because GridSearchCV is looking for the hyperparameters of OneVsRestClassifier. How to actually tune the hyperparameters of XGBClassifier?
Also, what hyperparameters are you suggesting worth tuning for my problem?

As is, the pipeline looks for a parameter learning_rate in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter learning_rate of XGBClassifier, you should go a level deeper, i.e.:
parameters = {'clf__estimator__learning_rate': [0.1, 0.01, 0.001]}

Related

Is preprocessing repeated in a Pipeline each time a new ML model is loaded?

I have created a pipeline using sklearn so that multiple models will go through it. Since there is vectorization before fitting the model, I wonder if this vectorization is performed always before the model fitting process? If yes, maybe I should take this preprocessing out of the pipeline.
log_reg = LogisticRegression()
rand_for = RandomForestClassifier()
lin_svc = LinearSVC()
svc = SVC()
# The pipeline contains both vectorization model and classifier
pipe = Pipeline(
[
('vect', tfidf),
('classifier', log_reg)
]
)
# params dictionary example
params_log_reg = {
'classifier__penalty': ['l2'],
'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0],
'classifier__class_weight': ['balanced', class_weights],
'classifier__solver': ['lbfgs', 'newton-cg'],
# 'classifier__verbose': [2],
'classifier': [log_reg]
}
params = [params_log_reg, params_rand_for, params_lin_svc, params_svc] # param dictionaries for each model
# Grid search for to combine it all
grid = GridSearchCV(
pipe,
params,
cv=skf,
scoring= 'f1_weighted')
grid.fit(features_train, labels_train[:,0])
When you are running a GridSearchCV, pipeline steps will be recomputed for every combination of hyperparameters. So yes, this vectorization process will be done every time the pipeline is called.
Have a look at the sklearn Pipeline and composite estimators.
To quote:
Fitting transformers may be computationally expensive. With its memory
parameter set, Pipeline will cache each transformer after calling fit.
This feature is used to avoid computing the fit transformers within a
pipeline if the parameters and input data are identical. A typical
example is the case of a grid search in which the transformers can be
fitted only once and reused for each configuration.
So you can use the memory flag to cache the transformers.
cachedir = mkdtemp()
pipe = Pipeline(estimators, memory=cachedir)

How to use GridSearchCV with MultiOutputClassifier(MLPClassifier) Pipeline

I am trying out scikit-learn for the first time, for a Multi-Output Multi-Class text classification problem. I am attempting to use GridSearchCV to optimize the parameters of MLPClassifier for this purpose.
I will admit that I am shooting in the dark here, having no prior experience. Please let me know if this makes sense.
Below is what I currently have:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
df = pd.read_csv('data.csv')
df.fillna('', inplace=True) #Replaces NaNs with "" in the DataFrame (which would be considered a viable choice in this multi-classification model)
x_features = df['input_text']
y_labels = df[['output_text_label_1', 'output_text_label_2']]
x_train, x_test, y_train, y_test = train_test_split(x_features, y_labels, test_size=0.3, random_state=7)
pipe = Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(MLPClassifier()))])
pipe.fit(x_train, y_train)
pipe.score(x_test, y_test)
pipe.score gives a score of ~0.837, which seems to suggest that the above code is doing something. Running pipe.predict() on some test strings seems to yield relatively adequate output results.
However, even after looking at plenty examples, I don't understand how to implement GridSearchCV for this Pipeline. (Additionally, I would like advice on which parameters to search).
I doubt it makes sense to post my attempts with GridSearchCV since they have been varied and all unsuccessful. But a brief example from a Stack Overflow answer could be:
grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [(100,),(200,)]
}
]
grid_search = GridSearchCV(pipe, grid, scoring='accuracy', n_jobs=-1)
grid_search.fit(x_train, y_train)
This gives the error:
ValueError: Invalid parameter activation for estimator
Pipeline(steps=[('cv', CountVectorizer()),
('mlpc', MultiOutputClassifier(estimator=MLPClassifier()))]). Check the list of
available parameters with estimator.get_params().keys().
I'm not sure what causes this, nor exactly how to utilize estimator.get_params().keys() to figure out which parameters are faulty.
Perhaps my uses of 'cv', CountVectorizer() or 'mlpc', MultiOutputClassifier(estimator=MLPClassifier())) are incorrect in relation to the grid parameters.
I believe I need to use CountVectorizer() here because my inputs (and desired label outputs) are all strings.
I very much appreciate an example of how GridSearchCV should be used for a Pipeline presumably utilizing CountVectorizer() and MLPClassifier in the correct way, and which grid parameters may be advisable to search.
TL;DR Try something like this:
mlpc = MLPClassifier(solver='adam',
learning_rate_init=0.01,
max_iter=300,
activation='relu',
early_stopping=True)
pipe = Pipeline(steps=[('cv', CountVectorizer(ngram_range=(1, 1))),
('scale', StandardScaler()),
('mlpc', MultiOutputClassifier(mlpc))])
search_space = {
'cv__max_df': (0.9, 0.95, 0.99),
'cv__min_df': (0.01, 0.05, 0.1),
'mlpc__estimator__alpha': 10.0 ** -np.arange(1, 5),
'mlpc__estimator__hidden_layer_sizes': ((64, 32), (128, 64),
(64, 32, 16), (128, 64, 32)),
'mlpc__estimator__tol': (1e-3, 5e-3, 1e-4),
}
Discussion:
[Edit] For multi-output binary classification only, MLPClassifier supports multi-output classification, and having interrelating outputs, I wouldn't recommend using MultiOutputClassifier as it trains separate MLPClassifier instances without taking into account the relationship between outputs. Training only one MLPClassifier is faster, cheaper, and usually more accurate.
The ValueError is due to improper parameter grid names. See Nested parameters.
With a modest workstation and/or large training data, set solver='adam'
to use a cheaper, first-order method as opposed to a second-order 'lbfgs'. Alternatively, try solver='sgd'---even cheaper to compute---but then also tune momentum. I anticipate that your data will be sparse and of different scales after CountVectorizer, and momentum/solver='adam' is a way to tackle variant gradients.
Insert one of the standardization transformers (I guess StandardScaler will work better) after CountVectorizer as MLPs are sensitive to feature scaling. Although, solver='adam' would probably handle imbalanced bag of words well. Still, I believe it won't hurt to standardize your data.
I think tuning activation is needles. Set activation='relu'.
Use early_stopping=True, specify a large enough max_iter, and tune tol to prevent overfitting.
Definitely tune learning_rate_init with solver='sgd'; for solver='adam', I assume higher learning rates will be OK and adam won't require comprehensive learning-rate tuning.
Prefer deeper nets to wider ones (e.g., hidden_layer_sizes=(128, 64, 32) to hidden_layer_sizes=(256, 192)).
Always tune alpha.
Optimal hidden_layer_sizes may depend on a document-term dimension.
Try setting higher batch_sizes but take into account computational expenses.
If you wish to optimize CountVectorizer, tune max_df and min_df but not ngram_range; I believe at least a two-layer MLP will handle unigram relationships itself in hidden layers without need to process n-grams explicitly.
Optimize the hyperparameters in the code sample above first. But note that the remaining hyperparameters can also affect both computational performance and predictive power.
Disclaimer: Most of the remarks are based on my (insubstantialšŸ¤”) assumptions about your data and pertain only to scikit-learn's MLPs. Refer to docs to learn more about neural networks and experiment with other tips. And remember, There is No Free Lunch.

RandomizedSearchCV with scoring accuracy leads to error:Scoring failed.The score on this train-test partition for these parameters will be set to nan

I want to find a good neural network instance by using RandomizedSearchCV with regard to accuracy, because the task is to solve a binary classification problem. Unfortunately I get the error message
Scoring failed. The score on this train-test partition for these parameters will be set to nan.
This is my implementation:
# Define neural network instance
def build_model(n_hidden_layers=2, n_neurons=77, dropout_rate=0.5 ,optimizer='adam', input_shape=77, activation_hidden="relu", activation_output="sigmoid",loss='binary_crossentropy',metrics=['binary_accuracy'],hidden_weight_initializer="he_normal",output_weight_initializer="glorot_normal",l1=0,l2=0,use_batch_norm=0):
model = keras.models.Sequential()
model.add(keras.layers.InputLayer(input_shape=input_shape))
for layer in range(n_hidden_layers):
model.add(keras.layers.Dense(n_neurons, activation=activation_hidden, kernel_initializer=hidden_weight_initializer, kernel_regularizer=tf.keras.regularizers.l1_l2(l1,l2)))
model.add(keras.layers.Dropout(dropout_rate))
if use_batch_norm == 1:
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(1,activation=activation_output, kernel_initializer=output_weight_initializer))
model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
return model
# Dreate wrapper class for RandomizedSearchCV
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_model)
# Define hyperparameter spaces for trained neural network instances
param_distribs = {
"n_hidden_layers": [1,2, 3,4,5],
"n_neurons": [x for x in range(10,100)],
"dropout_rate": [0, 0.1, 0.2, 0.3, 0.4, 0.5],
"use_batch_norm": [0,1],
# "optimizer": ['adam',],
"activation_hidden": ['relu','elu','selu'],# 'relu','','elu','selu',,'LeakyRelU(alpha=0.2)','PReLU(alpha_initializer=Constant(value=0.25))'
# "activation_output": ['relu','sigmoid'],
# "loss": ['binary_crossentropy']
# "l1":
# "l2":
}
from sklearn.metrics import make_scorer, precision_score, accuracy_score
precision = make_scorer(precision_score, pos_label="donated")
accuracy = make_scorer(accuracy_score, pos_label="donated")
# Use RandomizedSearchCV to find model instance with best performance on training data
rnd_search_cv = RandomizedSearchCV(keras_reg, param_distribs, n_iter=2, cv=2,scoring="accuracy")#, scoring=accuracy,random_state=1)#iter=10,cv=3
rnd_search_cv.fit(X_train, y_train, epochs=10,#100
validation_data=(X_test, y_test),
callbacks=[keras.callbacks.EarlyStopping(patience=5)],
batch_size=256)
I was trying to do the same thing and ran into the same problem. I found out that the problem was the scorer object supplied in RandomizedSearchCV(). The loss function should be specified within your build_model() function when compiling the model, and it must not be supplied by RandomizedSearchCV().
This should be easy if you want to use a simple loss function or metric like 'accuray' since it is already provided by keras. If you want a more specific loss function, you probably need to build it by yourself and make sure that it fulffils the requirements as described in the tensorflow documentation https://www.tensorflow.org/api_docs/python/tf/keras/Model.

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet
text_clf =Pipeline([('vect', TfidfVectorizer()),('scale', StandardScaler(with_mean=False)),('smt', SMOTE(random_state=5)),('clf', LinearSVC(class_weight='balanced'))])
After this I am using GridsearchCV.
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning parameters mostly for TfidfVectorizer().
I am getting the following error.
All intermediate steps should be transformers and implement fit and transform. 'SMOTE
Post this error, I have changed the code to as follows.
vect = TfidfVectorizer(use_idf=True,smooth_idf = True, max_df = 0.25, sublinear_tf = True, ngram_range=(1,2))
X = vect.fit_transform(X).todense()
Y = vect.fit_transform(Y).todense()
X_Train,X_Test,Y_Train,y_test = train_test_split(X,Y, random_state=0, test_size=0.33, shuffle=True)
text_clf =make_pipeline([('smt', SMOTE(random_state=5)),('scale', StandardScaler(with_mean=False)),('clf', LinearSVC(class_weight='balanced'))])
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning Cin SVC classifiers.
This time I am getting the following error:
Last step of Pipeline should implement fit.SMOTE(....) doesn't
What is going here? Can anyone please help?
imblearn.SMOTE has no transform method. Docs is here.
But all steps except the last in a pipeline should have it, along with fit.
To use SMOTE with sklearn pipeline you should implement a custom transformer calling SMOTE.fit_sample() in transform method.
Another easier option is just to use ibmlearn pipeline:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
# This doesn't work with sklearn.pipeline.Pipeline because
# SMOTE doesn't have a .tranform() method.
# (It has .fit_sample() or .sample().)
pipe = imbPipeline([
...
('oversample', SMOTE(random_state=5)),
('clf', LinearSVC(class_weight='balanced'))
])

how to choose parameters in TfidfVectorizer in sklearn during unsupervised clustering

TfidfVectorizer provides an easy way to encode & transform texts into vectors.
My question is how to choose the proper values for parameters such as min_df, max_features, smooth_idf, sublinear_tf?
update:
Maybe I should have put more details on the question:
What if I am doing unsupervised clustering with bunch of texts. and I don't have any labels for the texts & I don't know how many clusters there might be (which is actually what I am trying to figure out)
If you are, for instance, using these vectors in a classification task, you can vary these parameters (and of course also the parameters of the classifier) and see which values give you the best performance.
You can do that in sklearn easily with the GridSearchCV and Pipeline objects
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(MultinomialNB(
fit_prior=True, class_prior=None))),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
'clf__estimator__alpha': (1e-2, 1e-3)
}
grid_search_tune = GridSearchCV(pipeline, parameters, cv=2, n_jobs=2, verbose=3)
grid_search_tune.fit(train_x, train_y)
print("Best parameters set:")
print grid_search_tune.best_estimator_.steps

Categories