GridSearch for best model: Save and load parameters - python

I like to run following workflow:
Selecting a model for text vectorization
Defining a list of parameters
Applying a pipeline with GridSearchCV on the parameters, using LogisticRegression() as a baseline to find the best model parameters
Save the best model (parameters)
Load the best model paramerts so that we can apply a range of other classifiers on this defined model.
Here is code that you can reproduce:
GridSearch:
%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=0)
# Find best Tfidf model using LR
pipeline = Pipeline([
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
parameters = {
'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
'tfidf__smooth_idf': (True, False),
'tfidf__norm': ('l1', 'l2', None),
}
grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)
print(grid.best_params_)
# Save model
#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters
Fitting 2 folds for each of 24 candidates, totalling 48 fits
{'tfidf__smooth_idf': True, 'tfidf__norm': 'l2', 'tfidf__max_df': 0.25}
Load Model with best parameters:
from sklearn.model_selection import GridSearchCV
# Load best parameters
tfidf_params = joblib.load('best_tfidf.pkl')
pipeline = Pipeline([
('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?
('clf', LogisticRegression())
])
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score: %s" % (np.mean(cval)))
ValueError: Invalid parameter tfidf for estimator
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2',
preprocessor=,
smooth_idf=True, stop_words=None, strip_accents=None,
sublinear_tf=False, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, use_idf=True, vocabulary=None). Check the list of available parameters with estimator.get_params().keys().
Question:
How can I load the best parameters of the Tfidf model?

This line:
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters
saves the parameters of the pipeline, not the TfidfVectorizer. So do this:
pipeline = Pipeline([
# Change the name to be same as before
('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
('clf', LogisticRegression())
])
pipeline.set_params(**tfidf_params)

Related

gridsearchcv best_estimator parameter doenst have same value as the fitted model when using pipeline indexing. Also uses sequential feature selection

The whole idea is to perform a grid search over all possible values of lambda, where each possible values of lambda would give a specific best subset of feature. At The end of the day I'm trying to do hyperparameter tuning (lambda) and feature selection at the same time. any advice is greatly appreciated! thankyou so much
ISSUE :
result of gs_cv.best_estimator_[0].estimator.alpha while gs_cv.best_estimator_[1].alpha = 1.0 (pipeline indexing results)
best_parameter from the grid_search_cv doesnt seem to be fitted to the model part of the pipeline as seen in the image.
I got this when print(gs_cv.best_estimator_.named_steps). The Ridge() still uses the default value lambda of 1
{'sfs_ridge': SequentialFeatureSelector(estimator=Ridge(alpha=0.0), k_features=5,
scoring='r2'), 'ridge_regression': Ridge()}
------------Code------------------
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
#Model
ridge = Ridge()
#hyperparameter_alpha = np.logspace(-6,6, num=5)
#SFS model
sfs_ridge = SFS(estimator=ridge, k_features = 5, forward=True, floating=False, scoring='r2', cv = 5)
#Pipeline model
pipe = Pipeline([ ('sfs_ridge', sfs_ridge), ('ridge_regression', ridge) ])
#GridSearchCV
#The parameter_grid for the model should start with the name you give when defining the pipeline!!
param_grid = [ {'sfs_ridge__k_features': [2,4,5] ,'sfs_ridge__estimator__alpha': np.arange(0,1,0.05) }]
gs_cv = GridSearchCV(estimator= pipe, param_grid= param_grid, scoring="neg_mean_absolute_error", n_jobs = -1, cv=5, refit=True)
gs_cv.fit(X_train, y_train)
print(gs_cv.best_estimator_[0].estimator.alpha) #print out 0.0
print(gs_cv.best_estimator_[1].alpha) #print out 1.0
print(gs_cv.best_estimator_[0].k_feature_idx_)

How to build a pipeline finding the best preprocessing per column in a fine-grained fashion?

In sklearn we can use the column transformer within a pipeline to apply a preprocessing choice to specific columns like this:
import pandas as pd
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, StandardScaler, ...
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
# this is my x_data
x_data = pd.DataFrame(..., columns=['Variable1', 'Variable2', 'Variable3'])
pipeline = Pipeline(steps=[('preprocessing1', make_column_transformer((StandardScaler(), ['Variable1']),
remainder='passthrough')),
('preprocessing2', make_column_transformer((MaxAbsScaler(), ['Variable2']),
remainder='passthrough')),
('preprocessing3', make_column_transformer((MinMaxScaler(), ['Variable3']),
remainder='passthrough')),
('clf', MLPClassifier(...)
]
)
then we would run the GridSearchCV something along the lines of the following:
params = [{'preprocessing1': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
'preprocessing2': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
'preprocessing3': [MinMaxScaler(), MaxAbsScaler(), StandardScaler()], # <<<<<<<<<<<<< How???
'ann__hidden_layer_sizes': [(100,), (200,)],
'ann__solver': ['adam', 'lbfs', 'sgd'],
...
}]
cv = GridSearch(pipeline, params, cv=10, verbose=1, n_jobs=-1, refit=True)
What I would like to do, is to find the best preprocessing per predictor because usually one preprocessing for all predictors doesn't work best.
The naming convention in a pipeline is using double underscore __ to separate steps, and their parameters.
You can see the different parameter of your pipeline and their value using pipeline.get_params().
In your case the parameter preprocessing1__standardscaler is referencing the scaling preprocessing defined for the first step of your pipeline and this is the argument that should be set during the GridSearchCV.
The example below illustrates how to perform this operation:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
X, y = make_classification(
n_features=3, n_informative=3, n_redundant=0, random_state=42
)
pipeline = Pipeline(
steps=[
("preprocessing1", make_column_transformer((StandardScaler(), [0]), remainder="passthrough")),
("preprocessing2", make_column_transformer((StandardScaler(), [1]), remainder="passthrough")),
("preprocessing3", make_column_transformer((StandardScaler(), [2]), remainder="passthrough")),
("clf", MLPClassifier()),
]
)
param_grid = {
"preprocessing1__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
"preprocessing2__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
"preprocessing3__standardscaler": [StandardScaler(), MinMaxScaler(), MaxAbsScaler()],
}
grid_search = GridSearchCV(pipeline, param_grid, cv=10, verbose=1, n_jobs=-1)
grid_search.fit(X, y)
grid_search.best_params_
This will return the following output:
{'preprocessing1__standardscaler': MinMaxScaler(),
'preprocessing2__standardscaler': StandardScaler(),
'preprocessing3__standardscaler': MaxAbsScaler()}

how to get a list of wrong predictions on validation set

Im trying to build a text-classification model on a database of site reviews (3 classes).
i cleaned the DF, tokenized it (with countVectorizer) and Tfidf (TfidfTransformer) and built MNB model.
now after i trained and evaluated the model, i want to get a list of the wrong predictions so i can pass them through LIME and explore the words that confuse the model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# tokenization
vectorizer = CountVectorizer()
vectorizer_fit = vectorizer.fit(x)
bow_x = vectorizer_fit.transform(x)
#### transform BOW to TF-IDF
transformer = TfidfTransformer()
transformer_x = transformer.fit(bow_x)
tfidf_x = transformer_x.transform(bow_x)
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
tfidf_x, y, test_size=0.3, random_state=101
)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train, y_train)
predmnb = mnb.predict(x_test)
my objective is to get the original indices of the reviews that the model predicted wrongly.
I managed to get the result like this:
predictions = c.predict(preprocessed_df['review_text'])
df2= preprocessed_df.join(pd.DataFrame(predictions))
df2.columns = ['review_text', 'business_category', 'word_count', 'prediction']
df2[df2['business_category']!=df2['prediction']]
im sure there is a more elegant way...
It seems like there is another problem in your code, generally the TfIdf vectorizer is fit on the training data only and in order to get the test data in the same format we do the transform operation. This is primarily done to avoid data leakage. Please refer to TfidfVectorizer: should it be used on train only or train+test. I have modified your code to suit your need.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (
classification_report,
confusion_matrix,
accuracy_score,
roc_auc_score,
roc_curve,
)
df = pd.read_csv(
"https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv"
)
cleaned_df = df.drop(
labels=["review_id", "user_id", "business_id", "review_date"], axis=1
)
x = cleaned_df["review_text"]
y = cleaned_df["business_category"]
# SPLITTING THE DATASET INTO TRAINING SET AND TESTING SET
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=101
)
transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train)
x_test_tf = transformer.transform(x_test)
mnb = MultinomialNB(alpha=0.14)
mnb.fit(x_train_tf, y_train)
predmnb = mnb.predict(x_test_tf)
incorrect_docs = x_test[predmnb == y_test]

apply gridsearch CV on a scikit-learn pipeline [[feature selection] + [algorithm]] but it give the following error:

I would like to apply gridsearch CV on a scikit-learn pipeline [[feature selection] + [algorithm]] but it give the following error, how can I correct the code?
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SelectFromModel
pipeline1 = Pipeline([
('feature_selection', SelectFromModel(svm.SVC(kernel='linear'))),
('filter' , SelectKBest(k=11)),
('classification' , svm.SVC(kernel='linear'))
])
grid_parameters_tune =
[{'estimator__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}]
model = GridSearchCV(pipeline1, grid_parameters_tune, cv=5, n_jobs=-1,
verbose=1)
model.fit(X, y)
ValueError: Invalid parameter estimator for estimator Pipeline(memory=None,
steps=[('feature_union', FeatureUnion(n_jobs=None,
transformer_list=[('filter', SelectKBest(k=10, score_func=<function f_classif at 0x000001ECCBB3E840>)), ('feature_selection', SelectFromModel(estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', ...r', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))]). Check the list of available parameters with `estimator.get_params().keys()`.
I think the error comes from the name in your grid_parameters_tune. You are trying to access estimator__C, but there are no steps names estimator in your pipeline. Renaming it classification__C should do the trick.
If you want to access to the C parameter from the SVC in SelectFromModel, you can do so with feature_selection__estimator__C
Below is a working example with random data. I changed some of the parameters from your original pipeline in order to save some time, do not necessarily copy it directly.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.feature_selection import SelectFromModel, SelectKBest
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
X = pd.DataFrame(data=np.arange(1000).reshape(-1, 25))
y = np.random.binomial(1, 0.5, 1000//25)
pipeline1 = Pipeline(
[
("feature_selection", SelectFromModel(svm.SVC(kernel="linear"))),
("filter", SelectKBest(k=11)),
("classification", svm.SVC(kernel="linear")),
]
)
grid_parameters_tune = [{"classification__C": [0.01, 0.1, 1.0, 10.0,]}]
model = GridSearchCV(pipeline1, grid_parameters_tune, cv=3, n_jobs=-1, verbose=1)
model.fit(X, y)
As for the second way:
pipeline1 = Pipeline(
[
("feature_selection", SelectFromModel(svm.SVC(kernel="linear"))),
("filter", SelectKBest(k=11)),
("classification", svm.SVC(kernel="linear")),
]
)
grid_parameters_tune = [{"feature_selection__estimator__C": [0.01, 0.1, 1.0, 10.0,]}]
model = GridSearchCV(pipeline1, grid_parameters_tune, cv=3, n_jobs=-1, verbose=1)
model.fit(X, y)

Save model for later prediction (OneVsRest)

I would like to know how to save OnevsRest classifier model for later prediciton.
I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.
Here's the model I have created:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
%%time
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
])
for category in categories:
printmd('**Processing {} comments...**'.format(category))
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, train[category])
# calculating test accuracy
prediction = LogReg_pipeline.predict(x_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
print("\n")
Any help will be very much appreciated.
Sincerely,
Using joblib you can save any Scikit-learn Pipeline complete of all its elements, therefore comprising also the fitted TfidfVectorizer.
Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
x_train = data.data[:100]
y_train = data.target[:100]
x_test = data.data[100:200]
y_test = data.target[100:200]
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('vectorizer', vectorizer),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag',
class_weight='balanced'),
n_jobs=-1))
])
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)
In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.
Now you simply save the entire fitted pipeline as it were a single predictor using joblib:
from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib')
Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:
clf = load('LogReg_pipeline.joblib')
clf.predict(x_test)
You will get the predictions on the raw text because the pipeline will vectorize it automatically.

Categories