print function shifts the output result - python

I don't know if it is explicit in the title but i want to print a classification_report.
I want to write that it belongs to the Test set just like this :
print(f'Test classification report :{classification_report(y_test, y_pred)}')
But it gives this output with the 4 column names shifted :
Test classification report : precision recall f1-score support
0 0.68 0.50 0.57 187
1 0.79 0.89 0.84 407
accuracy 0.77 594
macro avg 0.74 0.69 0.71 594
weighted avg 0.76 0.77 0.76 594
Thanks

\n is the way to go
It adds a new line

Related

How to Plot 2 classification report result in one graph in python

I have two classification report results (from 2 different models), and I want to use a bar chart to plot them in one single graph. How can I do this?
Sample classificaiotn report result:
precision recall f1-score support
False 0.94 0.95 0.95 10078
True 0.95 0.94 0.94 10078
accuracy 0.94 20156
macro avg 0.94 0.94 0.94 20156
weighted avg 0.94 0.94 0.94 20156

sklearn.metrics.classification_report support field is showing wrong number for the labels

I am using sklearn.metrics.classification_report to evaluate the result of my classification.
y_pred = np.argmax(model.predict(X_test), axis=1)
y_true = np.argmax(y_test, axis=1)
print(classification_report(y_true, y_pred, target_names=list(le.classes_)))
And here is my result:
precision recall f1-score support
Technology 0.00 0.00 0.00 1
Travel 0.00 0.00 0.00 5
Fashion 0.00 0.00 0.00 25
Entertainment 0.72 1.00 0.84 130
Art 0.00 0.00 0.00 7
Politic 0.00 0.00 0.00 12
avg / total 0.52 0.72 0.61 180
The problem is actually I have 7 labels. The order goes Technology, Travel, Fashion, Entertainment, Art, Politic, Sports. Actually I don't have any Art label in my y_true result but the report lists in order so, it lists Art but skips Sports. It writes the result of Politic for Art and the result of Sports goes to Politic's row.
Why does not it skip the Art? I have no idea how can I solve this.
The index labels in the classification report are the values of the argument "target_names". Please make sure, you are giving the right values to that argument.
According to your required output, you should have
target_names = ["Technology", "Travel", "Fashion", "Entertainment", "Politic", "Sports"]
I would suggest, please check the output of your "le.classes_" and I am not sure which transformer 'le' refers to.

all classifiers are predicting "bad" positives

I'm classifying my data using several algorithms including
KNN, LogisticRegression, RandomForrest, DecisionTreeClassifier, GaussianNB etc.
After fitting my data I am analyzing results using the following:
from sklearn.metrics import confusion_matrix, classification_report
classification_report(y_test, predicted)
Im not totally clear on the semantics of the "predicted positive / negative" et.al in respects to which label it is trying to predict.
Also maybe more importantly I don't understand and am trying to analize why all of the various algorithms are predicting relatively well in regards to "Predicted Negative / True Negative vs Predicted Negative / True Positive" portions but very bad in regards to the "Predict Positive" portion .
In other words from my understanding it is quite good at saying "not something" but basically tossing a coin at predicting "is something" (around 50-50)
here are some example classification reports I generated for the different techniques:
confusion matrix (knn)
Predicted Negative Predicted Positive
True Negative 14776 5442
True Positive 2367 6337
precision recall f1-score support
f 0.73 0.86 0.79 17143
t 0.73 0.54 0.62 11779
avg / total 0.73 0.73 0.72 28922
confusion matrix (SVM)
Predicted Negative Predicted Positive
True Negative 14881 4947
True Positive 2262 6832
precision recall f1-score support
f 0.75 0.87 0.81 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (logistic regression)
Predicted Negative Predicted Positive
True Negative 14881 4947
True Positive 2262 6832
precision recall f1-score support
f 0.75 0.87 0.81 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (decision tree)
Predicted Negative Predicted Positive
True Negative 14852 4941
True Positive 2291 6838
precision recall f1-score support
f 0.75 0.87 0.80 17143
t 0.75 0.58 0.65 11779
avg / total 0.75 0.75 0.74 28922
confusion matrix (naive_bayes)
Predicted Negative Predicted Positive
True Negative 13435 4759
True Positive 3708 7020
precision recall f1-score support
f 0.74 0.78 0.76 17143
t 0.65 0.60 0.62 11779
avg / total 0.70 0.71 0.70 28922
confusion matrix (random_forest)
Predicted Negative Predicted Positive
True Negative 13287 5248
True Positive 3856 6531
precision recall f1-score support
f 0.72 0.78 0.74 17143
t 0.63 0.55 0.59 11779
avg / total 0.68 0.69 0.68 28922
confusion matrix (gradient_boost)
Predicted Negative Predicted Positive
True Negative 15071 5583
True Positive 2072 6196
precision recall f1-score support
f 0.73 0.88 0.80 17143
t 0.75 0.53 0.62 11779
avg / total 0.74 0.74 0.72 28922
confusion matrix (neural network MLPClassifier)
Predicted Negative Predicted Positive
True Negative 10789 3653
True Positive 6354 8126
precision recall f1-score support
f 0.75 0.63 0.68 17143
t 0.56 0.69 0.62 11779
avg / total 0.67 0.65 0.66 28922
The only one which seems to predict "Predicted Positive" reasonably was the MLPClassifier classifier.
Sorry, I didn't know how the dataset you used looks like. But let's say there is a flipping coin experiment with 2 kinds of results, either head (1) or tail (0). Now we implement a regression algorithm to predict the results based a bunch of possible features.
If the prediction is correct (as same as the class label), we will count it as a true one. If not, it will be a false record.
If the algorithm outputs a "Head" prediction, it would be regarded as a positive result, and negative for "tail".
For single "True Positive" portion, it has a little value. But if we add it with "False Negative", the sum of them is actually the amount of positive case.
And if we divid "True Positive" by the sum of all positive case, which is normally called "recall" or TP rate, we would get the accuracy of this model in predicting positive (head) case.
We could compare the TP rate(TP/P) with the FP rate(FP/N) to analyze the performance of a given model.
There is also some other combination and usage with these positive, negative, true, false and rate things, such as sensitivity and specificity etc..
If you want to know more, I would recommend you to look ROC Curve

Should stats.norm.pdf gives same result as stats.gaussian_kde in Python?

I was trying to estimate PDF of 1-D using gaussian_kde. However, when I plot pdf using stats.norm.pdf, it gives me different result. Please correct me if I am wrong, I think they should give quite similar result. Here's my code.
npeaks = 9
mean = np.array([0.2, 0.3, 0.38, 0.55, 0.65,0.7,0.75,0.8,0.82]) #peak locations
support = np.arange(0,1.01,0.01)
std = 0.03
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in range(0,npeaks))
df = pd.DataFrame(support)
X = df.iloc[:,0]
min_x, max_x = X.min(), X.max()
plt.figure(1)
plt.plot(support,pkfun)
kernel = stats.gaussian_kde(X)
grid = 100j
X= np.mgrid[min_x:max_x:grid]
Z = np.reshape(kernel(X), X.shape)
# plot KDE
plt.figure(2)
plt.plot(X, Z)
plt.show()
Also, when I get the first derivative of stats.gaussian_kde was far from the original signal. However, the result of first derivative of stats.norm.pdf does make sense. So, I am assuming I might have error in my code above.
Value of X= np.mgrid[min_x:max_x:grid]:
[
0. 0.01010101 0.02020202 0.03030303 0.04040404 0.05050505
0.06060606 0.07070707 0.08080808 0.09090909 0.1010101 0.11111111
0.12121212 0.13131313 0.14141414 0.15151515 0.16161616 0.17171717
0.18181818 0.19191919 0.2020202 0.21212121 0.22222222 0.23232323
0.24242424 0.25252525 0.26262626 0.27272727 0.28282828 0.29292929
0.3030303 0.31313131 0.32323232 0.33333333 0.34343434 0.35353535
0.36363636 0.37373737 0.38383838 0.39393939 0.4040404 0.41414141
0.42424242 0.43434343 0.44444444 0.45454545 0.46464646 0.47474747
0.48484848 0.49494949 0.50505051 0.51515152 0.52525253 0.53535354
0.54545455 0.55555556 0.56565657 0.57575758 0.58585859 0.5959596
0.60606061 0.61616162 0.62626263 0.63636364 0.64646465 0.65656566
0.66666667 0.67676768 0.68686869 0.6969697 0.70707071 0.71717172
0.72727273 0.73737374 0.74747475 0.75757576 0.76767677 0.77777778
0.78787879 0.7979798 0.80808081 0.81818182 0.82828283 0.83838384
0.84848485 0.85858586 0.86868687 0.87878788 0.88888889 0.8989899
0.90909091 0.91919192 0.92929293 0.93939394 0.94949495 0.95959596
0.96969697 0.97979798 0.98989899 1. ]
Value of X = df.iloc[:,0]:
[ 0. 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11
0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23
0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35
0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47
0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71
0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95
0.96 0.97 0.98 0.99 1. ]
In the row below you make pdf calculations in every peak-point along 100 datapoints with the std = 0,03. So you get a matrix with array with 100 elements per row then you summerize it elementwise, result:
Thus you get a graph with 9 narrow -because of std = 0,03- U-shape.
Are you sure, that this was your purpose with this row?
This will never get the similar graph as the kernel estimate base of the original data, result:
pkfun = sum(stats.norm.pdf(support, loc=mean[i], scale=std) for i in
range(0,npeaks))

python matplotlib plotfile explicitly use floating number

I have a simple data file to plot.
Here is the contents of a data file and I named it "ttry":
0.27 0
0.28 0
0.29 0
0.3 0
0.31 0
0.32 0
0.33 0
0.34 0
0.35 0
0.36 0
0.37 0
0.38 0.00728737997257
0.39 0.0600137174211
0.4 0.11488340192
0.41 0.157321673525
0.42 0.193158436214
0.43 0.233882030178
0.44 0.273319615912
0.45 0.311556927298
0.46 0.349879972565
0.47 0.387602880658
0.48 0.424211248285
0.49 0.460390946502
0.5 0.494855967078
0.51 0.529406721536
0.52 0.561814128944
0.53 0.594307270233
0.54 0.624228395062
0.55 0.654492455418
0.56 0.683984910837
0.57 0.711762688615
0.58 0.739368998628
0.59 0.765775034294
0.6 0.790895061728
0.61 0.815586419753
0.62 0.840192043896
0.63 0.863082990398
0.64 0.886231138546
0.65 0.906292866941
0.66 0.915809327846
0.67 0.911436899863
0.68 0.908179012346
0.69 0.904749657064
0.7 0.899519890261
0.71 0.895147462277
0.72 0.891632373114
0.73 0.888803155007
0.74 0.884687928669
0.75 0.879029492455
0.76 0.876114540466
0.77 0.872170781893
0.78 0.867541152263
0.79 0.86274005487
0.8 0.858367626886
0.81 0.854080932785
0.82 0.850994513032
0.83 0.997170781893
0.84 1.13477366255
0.85 1.24296982167
0.86 1.32690329218
0.87 1.40397805213
0.88 1.46836419753
0.89 1.52306241427
0.9 1.53232167353
0.91 1.52906378601
0.92 1.52211934156
0.93 1.516718107
0.94 1.51543209877
0.95 1.50660150892
0.96 1.50137174211
0.97 1.49408436214
0.98 1.48816872428
0.99 1.48088134431
1 1.4723079561
And then I use matplotlib.pyplot.plotfile to plot it. Here is my python script
from matplotlib import pyplot
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ")
pyplot.show()
However the following error appears:
C:\WINDOWS\system32\cmd.exe /c ttry.py
Traceback (most recent call last):
File "E:\research\ttry.py", line 2, in <module>
pyplot.plotfile("ttry",col=(0,1),delimiter=" ")
File "C:\Python33\lib\site-packages\matplotlib\pyplot.py", line 2311, in plotfile
checkrows=checkrows, delimiter=delimiter, names=names)
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in csv2rec
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2163, in <listcomp>
rows.append([func(name, val) for func, name, val in zip(converters, names, row)])
File "C:\Python33\lib\site-packages\matplotlib\mlab.py", line 2031, in newfunc
return func(val)
ValueError: invalid literal for int() with base 10: '0.00728737997257'
shell returned 1
Hit any key to close this window...
Obviously, python just considers yaxis data as int. So how to tell python I use float for yaxis data?
It implies int type of your second column based on first few values, which are all int's. To make it check all rows, add checkrows = 0 to arguments, that is:
pyplot.plotfile("ttry", cols=(0,1), delimiter=" ", checkrows = 0)
It's an argument coming from matplotlib.mlab.csv2rec, see more info here.

Categories