I was using classification report to check the accuracy and also the confusion matrix
I made some modifications to the code and it seems to work now
x = np.array([17, 17.083333, 17.166667, 17.25, 17.333333, 17.416667])
x = x.reshape(6,1)
y = [1,0,1,1,0,1]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20)
clf = svm.SVC(kernel='linear')
clf.fit(X_train,y_train)
pred = clf.predict(X_test)
score= sk.metrics.accuracy_score(y_test,pred)
report = sk.metrics.classification_report (y_test, pred, target_names = ['0','1'])
confusionmatrix = sk.metrics.confusion_matrix(y_test,pred)
print ("Accuracy_Score: "+str(score))
print ("Classification_Report:\n"+report)
print ("Confusion_Matrix:")
print (confusionmatrix)
output:
Accuracy_Score: 0.5
Classification_Report:
precision recall f1-score support
0 0.00 0.00 0.00 1
1 0.50 1.00 0.67 1
avg / total 0.25 0.50 0.33 2
Confusion_Matrix:
[[0 1]
[0 1]]
I changed the input "x" to an numpy array and removed values from x.reshape and also you had a typo in clf.predict() you have given "Xtest" it has to be "X_test".
Hope this helps
Related
I wrote a very simple multiclass classifier based on the iris dataset. This is the code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = iris.target
# Use label_binarize to be multi-label like settings
Y = label_binarize(y, classes=[0, 1, 2])
n_classes = Y.shape[1]
# Add noisy features
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)
from sklearn.preprocessing import label_binarize
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))
)
# Train the model
classifier.fit(X_train, y_train)
My goal is to predict the values of the test set in 2 ways:
Using the classifier.predict() function and define y_pred.
Using the classifier.decision_function() to get the scores and then pick the highest one for each instance and define y_pred_.
Here is how I did it:
# Get the scores for the Test set
y_score = classifier.decision_function(X_test)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = label_binarize(np.argmax(y_score, axis=1), [0,1,2])
It looks like however that when I try to compute the classification report I get slightly different results, while I would expect to be the same since the predictions are based on the scores obtained from the decision function as it can be seen in the documentation (line 789). Here are both reports:
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred_))
precision recall f1-score support
0 0.54 0.62 0.58 21
1 0.44 0.40 0.42 30
2 0.36 0.50 0.42 24
micro avg 0.44 0.49 0.47 75
macro avg 0.45 0.51 0.47 75
weighted avg 0.45 0.49 0.46 75
samples avg 0.39 0.49 0.42 75
precision recall f1-score support
0 0.42 0.38 0.40 21
1 0.52 0.47 0.49 30
2 0.38 0.46 0.42 24
micro avg 0.44 0.44 0.44 75
macro avg 0.44 0.44 0.44 75
weighted avg 0.45 0.44 0.44 75
samples avg 0.44 0.44 0.44 75
What am I doing wrong? Would you be able to suggest a smart and elegant solution so that both reports are identical?
For multilabel classification you should use
y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)
to replicate the output of the predict() method as in this case the different classes are not mutually exclusive, i.e. a given sample can belong to multiple classes.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = label_binarize(iris.target, classes=[0, 1, 2])
# Split the data into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)
# Train the model
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = np.where(classifier.decision_function(X_test) > 0, 1, 0)
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.58 0.37 0.45 30
# 2 0.95 0.83 0.89 24
# micro avg 0.85 0.69 0.76 75
# macro avg 0.84 0.73 0.78 75
# weighted avg 0.82 0.69 0.74 75
# samples avg 0.66 0.69 0.67 75
print(classification_report(y_test, y_pred_))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.58 0.37 0.45 30
# 2 0.95 0.83 0.89 24
# micro avg 0.85 0.69 0.76 75
# macro avg 0.84 0.73 0.78 75
# weighted avg 0.82 0.69 0.74 75
# samples avg 0.66 0.69 0.67 75
For multiclass classification you can instead use
y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)
as in your code, as in this case the different classes are mutually exclusive, i.e. each sample is assigned to only one class.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import classification_report
# Load the data
iris = load_iris()
X = iris.data
y = iris.target
# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0
)
# Create classifier
classifier = OneVsRestClassifier(
make_pipeline(StandardScaler(), LinearSVC(random_state=0))
)
# Train the model
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
y_pred_ = np.argmax(classifier.decision_function(X_test), axis=1)
print(classification_report(y_test, y_pred))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.85 0.73 0.79 30
# 2 0.71 0.83 0.77 24
# accuracy 0.84 75
# macro avg 0.85 0.86 0.85 75
# weighted avg 0.85 0.84 0.84 75
print(classification_report(y_test, y_pred_))
# precision recall f1-score support
# 0 1.00 1.00 1.00 21
# 1 0.85 0.73 0.79 30
# 2 0.71 0.83 0.77 24
# accuracy 0.84 75
# macro avg 0.85 0.86 0.85 75
# weighted avg 0.85 0.84 0.84 75
OneVsRestClassifier is assuming that you expect multi-label result, i.e. there may be more than one positive label for a single input. The result is thus different from using argmax with decision_function.
Try
print(y_pred[0])
print(y_pred_[0])
Output:
[0 1 1]
[0 0 1]
I have a three-class problem and I'm able to report precision and recall for each class with the below code:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
which gives me the precision and recall nicely for each of the 3 classes in a table format.
My question is how can I now get sensitivity and specificity for each of the 3 classes? I looked at sklearn.metrics and I didn't find anything for reporting sensitivity and specificity.
If we check the help page for classification report:
Note that in binary classification, recall of the positive class is
also known as “sensitivity”; recall of the negative class is
“specificity”.
So we can convert the pred into a binary for every class, and then use the recall results from precision_recall_fscore_support.
Using an example:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
Looks like:
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
Using sklearn:
from sklearn.metrics import precision_recall_fscore_support
res = []
for l in [0,1,2]:
prec,recall,_,_ = precision_recall_fscore_support(np.array(y_true)==l,
np.array(y_pred)==l,
pos_label=True,average=None)
res.append([l,recall[0],recall[1]])
put the results into a dataframe:
pd.DataFrame(res,columns = ['class','sensitivity','specificity'])
class sensitivity specificity
0 0 0.75 1.000000
1 1 0.75 0.000000
2 2 1.00 0.666667
Classification report's output is a formatted string. This code snippet extracts the required values and stores it in a 2-D list.
Note: To understand the code better, add print statements to check the variable values.
y = classification_report(y_test,y_pred) #classification report's output is a string
lines = y.split('\n') #extract every line and store in a list
res = [] #list to store the cleaned results
for i in range(len(lines)):
line = lines[i].split(" ") #Values are separated by blanks. Split at the blank spaces.
line = [j for j in line if j!=''] #add only the values into the list
if len(line) != 0:
#empty lines get added as empty lists. Remove those
res.append(line)
I made up a excel sheet of random numbers (3000 rows and 6 columns) and set it so any row having a B column >= 50, a C column of 0 and an E column of 1 gets a final 'y' value of 1. Else, it gets a 0 value. Ran this through this RandomForestClassifier code and it doesn't work and either returns 0 for all new test data or doesn't even take into account the B column when predicting. How can I solve this?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pickle
data_crd = pd.read_csv(r'C:\Users\Rada1\.spyder-py3\new_created_data.csv')
#C:\Users\Rada1\.spyder-py3\new_created_data.csv
data_crd.head()
X = data_crd.iloc[:,1:5]
y = data_crd.iloc[:,5]
#print (X)
#print (y)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = RandomForestClassifier (n_estimators = 500, random_state = 0)
classifier.fit (X_train, y_train)
y_pred = classifier.predict(X_test)
print (classification_report(y_test,y_pred))
print (confusion_matrix(y_test,y_pred))
print (accuracy_score(y_test,y_pred))
with open ('model_wcd','wb') as f:
pickle.dump(classifier,f)
I get a 100% accuracy rate as my result which just already feels wrong. What do I need to adjust?
precision recall f1-score support
0 1.00 1.00 1.00 515
1 1.00 1.00 1.00 85
accuracy 1.00 600
macro avg 1.00 1.00 1.00 600
weighted avg 1.00 1.00 1.00 600
[[515 0]
[ 0 85]]
1.0
Hopefully if you use stratify =y it might work
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state=0,stratify=y)
and also use MinMaxScaler for numerical feature and reshape them to (-1,1)
x_train_num=num_feature.transform(x_train[column_name].values.reshape(-1,1))
x_test_num=num_feature.transform(x_test[column_name].values.reshape(-1,1))
I'm trying to generate some line graph with an x and y axis demonstrating accuracy of 2 different algorithms running a classification - Naive Bayes and SVM.
I train/test the data like this:
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
def tokenizersplit(str):
return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)
tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
def train_model(classifier, trains, t_labels, valids, v_labels):
# fit the training dataset on the classifier
classifier.fit(trains, t_labels)
# predict the labels on validation dataset
predictions = classifier.predict(valids)
return metrics.accuracy_score(predictions, v_labels)
# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)
However for an assignment I need something plotted on the x/y axis using matplotlib. I tried this:
m=linear_model.LogisticRegression()
m.fit(xtrain_tfidf, train_y)
y_pred = m.predict(xvalid_tfidf)
print(metrics.classification_report(valid_y, y_pred))
plt.plot(valid_y, y_pred)
plt.show()
But this gives me:
I need something that can more easily compare the accuracy of Naive Bayes vs SVM vs another algorithm. How can I do this?
Plotting classification report:
plt.plot(metrics.classification_report(valid_y, y_pred))
plt.show()
My classification output:
precision recall f1-score support
0 1.00 0.18 0.31 11
1 0.00 0.00 0.00 14
2 0.00 0.00 0.00 19
3 0.50 0.77 0.61 66
4 0.39 0.64 0.49 47
5 0.00 0.00 0.00 23
accuracy 0.46 180
macro avg 0.32 0.27 0.23 180
weighted avg 0.35 0.46 0.37 180
Error w edit:
df = pd.DataFrame(metrics.classification_report(valid_y, y_pred)).transpose()
gives error
ValueError: DataFrame constructor not properly called!
metrics.classification_report summarizes the prediction result. So this is not meant for plotting and just for printing a "report". If you want the table in a visual format you can follow https://stackoverflow.com/a/34304414/4005668.
Otherwise you can get the dataframe by capturing it in a dataframe
import pandas as pd
# put it in a dataframe
df = pd.DataFrame(metrics.classification_report(..)).transpose()
# plot the dataframe
df.plot()
I have a strange problem where I have a model with 4 clusters, the data is unbalanced at the following proportions: 75%, 15%, 7% and 3%. I split it into train and test with 80/20 proportion, then I train a KNN with 5 neighbors, giving me an acurracy of 1.
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
train_index, test_index = next(sss.split(X, y))
x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]
KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)
y_pred = KNN_final.predict(x_test)
print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 140
1 1.00 1.00 1.00 60
2 1.00 1.00 1.00 300
3 1.00 1.00 1.00 1500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
Although it seems strange, I keep going, get new data and try to classify it based on this model, but it never finds the class with smaller percentage, it always misclassifies it as the second lower class.
So I try to balance the data using the imbalance learn library with SMOTEENN algorithm:
Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})
sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})
Then I do the same thing, split it into train and test with the same proportion 80/20 and train a new KNNclassifier with 5 neighbors. But the classification report seems even worse now:
Avg. accuracy for all classes: 1.0
Classification report:
precision recall f1-score support
0 1.00 1.00 1.00 1500
1 1.00 1.00 1.00 500
accuracy 1.00 2000
macro avg 1.00 1.00 1.00 2000
weighted avg 1.00 1.00 1.00 2000
I don't see what I'm doing wrong, is there any process I need to do after resampling the data, other than split and shuffle, before training a new classifier? Why my KNN is not seeing 4 classes now?
Although a full investigation requires your data, which you do not provide, such behavior is (at least partially) consistent with the following scenario:
You have duplicates (possibly a lot) in your initial data
Due to these duplicates, some (most? all?) of your test data are actually not new/unseen, but copies of samples in your training data, which leads to an unreasonably high test accuracy of 1.0
When adding new data (no duplicates of your initial ones), the model unsurprisingly fails to fulfill the expectations created from such a high accuracy (1.0) in the test data.
Notice that the stratified split will not protect you from such a scenario; here is a demonstration with toy data, adapted from the documentation:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 0, 1, 0, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
train_index, test_index = next(sss.split(X, y))
X[train_index]
# result:
array([[3, 4],
[1, 2],
[3, 4]])
X_[test_index]
# result:
array([[3, 4],
[1, 2],
[1, 2]])