Classification_report between two files - python

I'm trying to do a score between two files. The two have the same data but not the same label. Labels from train data are corrects and the labels from test data not necessarily... and I would like to know the accuracy, recall and f-score.
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
df_train = pd.read_csv('train.csv', sep = ',')
df_test = pd.read_csv('teste.csv', sep = ',')
vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])
y_train = df_train['label']
vec_test = TfidfVectorizer()
X_test = vec_test.fit_transform(df_train['text'])
y_test = df_test['label']
clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')
y_pred = clf.predict(X_test)
print ("Accuracy on training set:")
print (clf.score(X_train, y_train))
print ("Accuracy on testing set:")
print (clf.score(X_test, y_test))
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred))
A stupid example of the data:
TRAIN
text,label
dogs are cool,animal
flowers are beautifil,plants
pen is mine,objet
beyonce is an artist,person
TEST
text,label
dogs are cool,objet
flowers are beautifil,plants
pen is mine,person
beyonce is an artist,animal
Error:
Traceback (most recent call last):
File "accuracy.py", line 30, in
y_pred = clf.predict(X_test)
File "/usr/lib/python3/dist-packages/sklearn/linear_model/base.py", line 324, in predict
scores = self.decision_function(X)
File "/usr/lib/python3/dist-packages/sklearn/linear_model/base.py", line 298, in decision_function
"yet" % {'name': type(self).name})
sklearn.exceptions.NotFittedError: This LogisticRegression instance is not fitted yet
I just wanted to calculate the accuracy of the test

You are fitting a new TfidfVectorizer on test data. This will give wrong results. You should use the same object which you fitted on train data.
Do this:
vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])
X_test = vec_train.transform(df_test['text'])
After that, as #MohammedKashif said, you need to first train your LogisticRegression model and then predict on test.
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
After that you can use the scoring code without any errors.

You have to first train your classifier object using the X_train before using the predict function over X_test. Something like this
clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')
#Then train the classifier over training data
clf.fit(X_train, y_train)
#Then use predict function to make predictions
y_pred = clf.predict(X_test)

Related

Keep model made with TFIDF for predicting new content using Scikit for Python

this is a sentiment analysis model made with tf-idf for feature extraction
i want to know how can i save this model and reuse it.
i tried saving it this way but when i load it , do same pre-processing on the test text and fit_transform on it it gave an error that the model expected X numbers of features but got Y
this is how i saved it
filename = "model.joblib"
joblib.dump(model, filename)
and this is the code for my tf-idf model
import pandas as pd
import re
import nltk
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
from nltk.corpus import stopwords
processed_text = ['List of pre-processed text']
y = ['List of labels']
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = tfidfconverter.fit_transform(processed_text).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
text_classifier = BernoulliNB()
text_classifier.fit(X_train, y_train)
predictions = text_classifier.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
edit:
just to exact where to put every line
so after:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
then
tfidf_obj = tfidfconverter.fit(processed_text)//this is what will be used again
joblib.dump(tfidf_obj, 'tf-idf.joblib')
then you do the rest of the steps you will save the classifier after training as well so after:
text_classifier.fit(X_train, y_train)
put
joblib.dump(model, "classifier.joblib")
now when you want to predict any text
tf_idf_converter = joblib.load("tf-idf.joblib")
classifier = joblib.load("classifier.joblib")
now u have List of sentences to predict
sent = []
classifier.predict(tf_idf_converter.transform(sent))
now print that for a list of sentiments for each sentece
You can first fit tfidf to your training set using:
tfidfconverter = TfidfVectorizer(max_features=10000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
tfidf_obj = tfidfconverter.fit(processed_text)
Then find a way to store the tfidf_obj for instance using pickle or joblib e.g:
joblib.dump(tfidf_obj, filename)
Then load the saved tfidf_obj and apply transform only on your test set
loaded_tfidf = joblib.load(filename)
test_new = loaded_tfidf.transform(X_test)

How can i create an instance of multi-layer perceptron network to use in bagging classifier?

i am trying to create an instance of multi-layer perceptron network to use in bagging classifier. But i don't understand how to fix them.
Here is my code:
My task is:
1-To apply bagging classifier (with or without replacement) with eight base classifiers created at the previous step.
It would be really great if you show me how can i implement this to my algorithm. I did my search but i couldn't find a way to do that
To train your BaggingClassifier:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Load the digits data:
X,y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
# Feature scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Finally for the MLP- Multilayer Perceptron
mlp = MLPClassifier(hidden_layer_sizes=(16, 8, 4, 2), max_iter=1001)
clf = BaggingClassifier(mlp, n_estimators=8)
clf.fit(X_train,y_train)
To analyze your output you may try:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
print(cm)
To see num of correctly predicted instances per class:
print(cm[np.eye(len(clf.classes_)).astype("bool")])
To see percentage of correctly predicted instances per class:
cm[np.eye(len(clf.classes_)).astype("bool")]/cm.sum(1)
To see total accuracy of your algo:
(y_pred==y_test).mean()
EDIT
To access predictions on a per base estimator basis, i.e. your mlps, you can do:
estimators = clf.estimators_
# print(len(estimators), type(estimators[0]))
preds = []
for base_estimator in estimators:
preds.append(base_estimator.predict(X_test))

Classification metrics can't handle a mix of binary and unknown targets, how to ignore unknown targets and only consider integers?

I am solving a classification problem in which I am trying to predict the first column "gold" of my input file based on the values of the remaining columns in the same input file. My input file is under the form:
gold, callersAtLeast1T, CalleesAtLeast1T, ...
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
I am using probabilities to make predictions and I choose to refrain from making predictions in case my conditions are not satisfied. In other words, in case if (probs[i][0]>=0.8) & (probs[i][1]<0.8): or (probs[i][0]<0.8) & (probs[i][1]>=0.8): are not satisfied then I choose to leave y_pred equal to "None" and not assign to it "0" or "1". I receive the error ValueError: Classification metrics can't handle a mix of binary and unknown targets due to the line of code print('confusion matrix\n',confusion_matrix(y_test,y_pred)). The reason for this is that y_pred contains both strings and integers, the content of y_pred is made of "0", "1" and "None". I would like to ignore all situations in which y_pred is equal to "None" and only perform the computation of print('confusion matrix\n',confusion_matrix(y_test,y_pred)) in case y_pred is equal to "1" or "0" and just skip all situations in which it is equal to "None" and ignore them. How to do that?
import pandas as pd
import numpy as np
dataset = pd.read_csv( 'data1extended.txt', sep= ',')
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
print(dataset.head())
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
regressor = RandomForestClassifier(n_estimators=200, random_state=0)
regressor.fit(X_train, y_train)
probs = regressor.predict_proba(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
i=0
y_pred=[None]*len(y_test)
for i in range(len(probs)):
#print('i==> ',i)
if (probs[i][0]>=0.8) & (probs[i][1]<0.8):
y_pred[i]=0
elif (probs[i][0]<0.8) & (probs[i][1]>=0.8):
y_pred[i]=1
print(y_pred[i])
print("Probabilities=%s, Predicted=%s" % (probs[i], y_pred[i]))
print(y_pred)
print('confusion matrix\n',confusion_matrix(y_test,y_pred))
print('classification report\n', classification_report(y_test,y_pred))
print('accuracy score', accuracy_score(y_test, y_pred))
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

NameError: name 'fit_classifier' is not defined

I'm trying to make a text classifier
import pandas as pd
import pandas
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
dataset = pd.read_csv('data.csv', encoding = 'utf-8')
data = dataset['text']
labels = dataset['label']
X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)
count_vector = CountVectorizer()
tfidf = TfidfTransformer()
classifier = OneVsOneClassifier(SVC(kernel = 'linear', random_state = 84))
train_counts = count_vector.fit_transform(X_train)
train_tfidf = tfidf.fit_transform(train_counts)
classifier.fit(train_tfidf, y_train)
test_counts = count_vector.transform(X_test)
test_tfidf = tfidf.transform(test_counts)
classifier.predict(test_tfidf)
fit_classifier(X_train, y_train)
predicted = predict(X_test)
print("confusion matrix")
print(confusion_matrix(X_test, predicted, labels = labels))
print("cross validation")
test_counts = count_vector.fit_transform(data)
test_tfidf = tfidf.fit_transform(test_counts)
scores = cross_validation.cross_val_score(classifier, test_tfidf, labels, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
But I have the following error and I can not understand.
Traceback (most recent call last):
File "classificacao.py", line 37, in
fit_classifier(X_train, y_train)
NameError: name 'fit_classifier' is not defined
But fit is not always defined by default?
you are calling a non existing function:
fit_classifier(X_train, y_train)
to fit your classifier you would use
classifier.fit(X_train, y_train)
instead.
You'll get the same error when trying to predict your test data.
You need to change
predicted = predict(X_test)
to
predicted = classifier.predict(X_test)
Your Confusionmatrix should get your labels, not your test data:
print(confusion_matrix(y_test, predicted, labels = labels))

ValueError: cannot use sparse input in 'SVC' trained on dense data

I'm trying to run my classifier but I get this error
import pandas
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = pd.read_csv('all_topics_limpo.csv', encoding = 'utf-8')
data = pandas.get_dummies(dataset['verbatim_corrige'])
labels = dataset['label']
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size = 0.2, random_state = 0)
count_vector = CountVectorizer()
tfidf = TfidfTransformer()
classifier = OneVsOneClassifier(SVC(kernel = 'linear', random_state = 100))
#classifier = LogisticRegression()
train_counts = count_vector.fit_transform(X_train)
train_tfidf = tfidf.fit_transform(train_counts)
classifier.fit(X_train, y_train)
test_counts = count_vector.transform(X_test)
test_tfidf = tfidf.transform(test_counts)
predicted = classifier.predict(test_tfidf)
predicted = classifier.predict(X_test)
print("confusion matrix")
print(confusion_matrix(y_test, predicted, labels = labels))
print("F-score")
print(f1_score(y_test, predicted))
print(precision_score(y_test, predicted))
print(recall_score(y_test, predicted))
print("cross validation")
test_counts = count_vector.fit_transform(data)
test_tfidf = tfidf.fit_transform(test_counts)
scores = cross_validation.cross_val_score(classifier, test_tfidf, labels, cv = 10)
print(scores)
print("Accuracy: {} +/- {}".format(scores.mean(), scores.std() * 2))
My output error:
ValueError: cannot use sparse input in 'SVC' trained on dense data
I can not execute my code because of this problem and I am not understanding anything of what is happening.
all output error
Traceback (most recent call last):
File "classification.py", line 42, in
predicted = classifier.predict(test_tfidf)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 584, in predict
Y = self.decision_function(X)
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in decision_function
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/multiclass.py", line 614, in
for est, Xi in zip(self.estimators_, Xs)]).T
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 548, in predict
y = super(BaseSVC, self).predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 308, in predict
X = self._validate_for_predict(X)
File "/usr/lib/python3/dist-packages/sklearn/svm/base.py", line 448, in _validate_for_predict
% type(self).name)
ValueError: cannot use sparse input in 'SVC' trained on dense data
You get this error because your training & test data are not of the same kind: while you train in your initial X_train set:
classifier.fit(X_train, y_train)
you are trying to get predictions from a dataset which has undergone count vectorization & tf-idf transormations first:
predicted = classifier.predict(test_tfidf)
It is puzzling why you choose to do so, why you nevertheless compute train_counts and train_tfidf (you don't seem to actually use them anywhere), and why you are also trying to redefine predicted as classifier.predict(X_test) immediately afterwards. Normally, changing your training line to
classifier.fit(train_tfidf, y_train)
and getting rid of your second predicted definition should work OK...
you can use this code :
test_tfidf = tfidf.transform(test_counts).toarray()
befor you want to predict your model and after :
predicted = classifier.predict(test_tfidf)
just do this simple code
nice job

Categories