Save model for later prediction (OneVsRest) - python

I would like to know how to save OnevsRest classifier model for later prediciton.
I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.
Here's the model I have created:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
%%time
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
])
for category in categories:
printmd('**Processing {} comments...**'.format(category))
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, train[category])
# calculating test accuracy
prediction = LogReg_pipeline.predict(x_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
print("\n")
Any help will be very much appreciated.
Sincerely,

Using joblib you can save any Scikit-learn Pipeline complete of all its elements, therefore comprising also the fitted TfidfVectorizer.
Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
x_train = data.data[:100]
y_train = data.target[:100]
x_test = data.data[100:200]
y_test = data.target[100:200]
# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
('vectorizer', vectorizer),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag',
class_weight='balanced'),
n_jobs=-1))
])
# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)
In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.
Now you simply save the entire fitted pipeline as it were a single predictor using joblib:
from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib')
Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:
clf = load('LogReg_pipeline.joblib')
clf.predict(x_test)
You will get the predictions on the raw text because the pipeline will vectorize it automatically.

Related

Keep model made with TFIDF for predicting new content using Scikit for Python

this is a sentiment analysis model made with tf-idf for feature extraction
i want to know how can i save this model and reuse it.
i tried saving it this way but when i load it , do same pre-processing on the test text and fit_transform on it it gave an error that the model expected X numbers of features but got Y
this is how i saved it
filename = "model.joblib"
joblib.dump(model, filename)
and this is the code for my tf-idf model
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
edit:
just to exact where to put every line
so after:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
then
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
then you do the rest of the steps you will save the classifier after training as well so after:
text_classifier.fit(X_train, y_train)
put
joblib.dump(model, "classifier.joblib")
now when you want to predict any text
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
now u have List of sentences to predict
sent = []
classifier.predict(tf_idf_converter.transform(sent))
now print that for a list of sentiments for each sentece
You can first fit tfidf to your training set using:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
Then find a way to store the tfidf_obj for instance using pickle or joblib e.g:
joblib.dump(tfidf_obj, filename)
Then load the saved tfidf_obj and apply transform only on your test set
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)

how to fix the error ValueError: could not convert string to float in a NLP project in python?

I am writing a python code using jupyter notebook that train and test a dataset in order to return a correct sentiment.
The problem that when i try to predict the sentiment of the phrase the system crash and display the below error :
ValueError: could not convert string to float: 'this book was so
interstening it made me not happy'
Note i have an imbalanced dataset so i use SMOTE in order to over_sampling the dataset
code:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE# for inbalance dataset
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
df = pd.read_csv("data/Apple-Twitter-Sentiment-DFE.csv",encoding="ISO-8859-1")
df
# data is cleaned using preprocessing functions
# Solving inbalanced dataset using SMOTE
vectorizer = TfidfVectorizer()
vect_df =vectorizer.fit_transform(df["clean_text"])
oversample = SMOTE(random_state = 42)
x_smote,y_smote = oversample.fit_resample(vect_df, df["sentiment"])
print("shape x before SMOTE: {}".format(vect_df.shape))
print("shape x after SMOTE: {}".format(x_smote.shape))
print("balance of targets feild %")
y_smote.value_counts(normalize = True)*100
# split the dataset into train and test
x_train,x_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size = 0.2,random_state =42)
logreg = Pipeline([
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))
# Make prediction
exl = "this book was so interstening it made me not happy"
logreg.predict(exl)
You should define your variable exl as the following:
exl = vectorizer.transform(["this book was so interstening it made me not happy"])
and then do the prediction.
First, put the testing data in a list and then use vectorizer to use features extracted from your training data to do the prediction.

Value error when training model with randomforest classifier

from sklearn import ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import time
from sklearn import metrics
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
enc = preprocessing.OneHotEncoder()
onehotencoder = OneHotEncoder(categories='auto')
enc.fit(X)
onehotlabels = enc.transform(X).toarray()
onehotlabels.shape
clf=RandomForestClassifier(n_estimators=10)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
predict = clf.predict(X_test)
print("Evaluation on Test Set",predict)
I am doing this to train my model with randomforest classifier. I am getting the following error:
ValueError: could not convert string to float: 'gorilla'
I can't tell for sure by looking at your code, because data structures of X, X_train or X_test is not clear.
However, I suspect that the onehotlabels variable is not used.
If one hot encoding worked properly, 'gorilla' string would not have been included.
So, I suggest that you check whether the following code had been executed.
X_train, X_test = train_test_split(onehotlabels)

How can i create an instance of multi-layer perceptron network to use in bagging classifier?

i am trying to create an instance of multi-layer perceptron network to use in bagging classifier. But i don't understand how to fix them.
Here is my code:
My task is:
1-To apply bagging classifier (with or without replacement) with eight base classifiers created at the previous step.
It would be really great if you show me how can i implement this to my algorithm. I did my search but i couldn't find a way to do that
To train your BaggingClassifier:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Load the digits data:
X,y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
# Feature scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Finally for the MLP- Multilayer Perceptron
mlp = MLPClassifier(hidden_layer_sizes=(16, 8, 4, 2), max_iter=1001)
clf = BaggingClassifier(mlp, n_estimators=8)
clf.fit(X_train,y_train)
To analyze your output you may try:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
print(cm)
To see num of correctly predicted instances per class:
print(cm[np.eye(len(clf.classes_)).astype("bool")])
To see percentage of correctly predicted instances per class:
cm[np.eye(len(clf.classes_)).astype("bool")]/cm.sum(1)
To see total accuracy of your algo:
(y_pred==y_test).mean()
EDIT
To access predictions on a per base estimator basis, i.e. your mlps, you can do:
estimators = clf.estimators_
# print(len(estimators), type(estimators[0]))
preds = []
for base_estimator in estimators:
preds.append(base_estimator.predict(X_test))

How to get the predicted probabilities of a classification model?

I'm trying out different classification models using a binary dependent variable (occupied/unoccupied). The models I am interested in are Logistic regression, Decision tree and Gaussian Naïve Bayes.
My input data is a csv-file with a datetime index (e.g. 2019-01-07 14:00), three variable columns ("R", "P", "C", containing numerical values), and the dependent variable column ("value", containing the binary values).
Training the model is not the problem, that all works fine. All the models give me their prediction in binary values (this of course should be the ultimate outcome), but I would also like to see the predicted probabilities which made them decide on either of the binary values. Is there any way to get also these values?
I have tried all of the classification visualizers that function with the yellowbrick package (ClassBalance, ROCAUC, ClassificationReport, ClassPredictionError). But all of these don't give me a graph that shows the calculated probabilities by the model for the data set.
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)
###model training
###Logistic Regression###
clf_lr = LogisticRegression()
# fit the dataset into LogisticRegression Classifier
clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
###Decision Tree###
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)
###Bayes###
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data
#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())
visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization
#(ROC)
visualizer3 = ROCAUC(LogisticRegression())
visualizer3.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer3.score(X_test, y_test) # Evaluate the model on the test data
g3 = visualizer3.poof() # Draw/show/poof the data
it would be great to have e.g. an array similar to pred_lr that contains the probabilities calculated for each row of the csv file. Is that possible? If yes, how can I get it?
In most sklearn estimators (if not all) you have a method for obtaining the probability that precluded the classification, either in log probability or probability.
For example, if you have your Naive Bayes classifier and you want to obtain probabilities but not classification itself, you could do (I used same nomenclatures as in your code):
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)
Hope this helps.

Categories