I am using the below code to calculate F1 score for my dataset
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer
m = MultiLabelBinarizer().fit(y_test_true_f)
print("F1-score is : {:.1%}".format(f1_score(m.transform(y_test_true_f),
m.transform(y_pred_f),
average='macro')))
and classification report
from sklearn.metrics import classification_report
print(classification_report(m.transform(y_test_true_f), m.transform(y_pred_f)))
but the output of the classification report does not show the label names
precision recall f1-score support
0 0.88 1.00 0.94 15
1 1.00 0.95 0.98 22
2 0.82 0.74 0.78 19
3 0.90 0.85 0.88 33
4 0.68 0.87 0.76 15
5 0.94 0.98 0.96 46
6 0.83 0.94 0.88 16
7 0.33 0.86 0.48 7
8 0.95 0.90 0.92 20
9 0.67 1.00 0.80 10
10 0.91 0.83 0.87 12
11 0.29 0.33 0.31 6
12 0.25 0.40 0.31 5
13 0.00 0.00 0.00 3
14 0.88 1.00 0.93 7
15 0.50 0.75 0.60 8
16 0.50 1.00 0.67 1
17 1.00 1.00 1.00 10
18 0.80 1.00 0.89 8
19 0.89 1.00 0.94 17
20 0.88 1.00 0.94 15
21 0.86 0.80 0.83 15
22 0.71 0.79 0.75 19
23 0.65 1.00 0.79 11
24 0.74 0.82 0.78 17
25 1.00 1.00 1.00 11
26 0.75 0.86 0.80 14
How shall I update my code to see the label names instead of numbers 0,1,2,3.....?
According to output there are 27 classes in the dataset if am not wrong. For getting the classes name/label you need to use attribute of MultiLabelBinarizer to get the mapping of class and 0,1,2,3,... because it transform label into 1,2,3,... numeric type
Attribute is .classes_, you could add this as an parameter in your classification_report as follows:
print(classification_report(m.transform(y_test_true_f), m.transform(y_pred_f)),target_names=m.classes_)
I hope this could give you classes label.
Specify them as target_names when calling classification_report.
From their examples:
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 2]
>>> y_pred = [0, 0, 2, 2, 1]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
Related
My class names are lengthy. So, when I save the classification report from sklearn in a txt file, the format looks messy. For example:
precision recall f1-score support
カップ 0.96 0.94 0.95 69
セット 0.70 0.61 0.65 23
パウチ 0.96 0.92 0.94 53
ビニール容器 0.53 0.47 0.50 19
ビン 0.91 0.90 0.90 305
プラ容器(ヤクルト型) 0.69 0.80 0.74 25
プラ容器(ヨーグルト型) 0.94 0.53 0.68 32
ペットボトル 0.98 0.98 0.98 1189
ペットボトル(ミニ) 0.71 0.74 0.72 23
ボトル缶 0.93 0.89 0.91 96
ポーション 0.80 0.52 0.63 23
箱(飲料) 0.76 0.77 0.76 134
紙パック(Pキャン) 0.86 0.69 0.76 35
紙パック(キャップ付き) 0.93 0.80 0.86 126
紙パック(ゲーブルトップ) 0.85 0.93 0.88 54
紙パック(ブリックパック) 0.84 0.90 0.87 277
紙パック(円柱型) 0.90 0.56 0.69 16
缶 0.89 0.96 0.92 429
accuracy 0.91 2928
macro avg 0.84 0.77 0.80 2928
weighted avg 0.91 0.91 0.91 2928
Is there any way to adjust the space among different metrics so that the alignment keeps same for all rows and columns?
I have looked into the sklearn.metrics.classification_report parameters, but I have not found anything to adjust spacing.
I have a dataframe with multiple columns having numerical float values. What I want to do is give fractional weights to each column and calculate its average to store and append it to the same df.
Let's say we have the columns: s1, s2, s3
I want to give the weights: w1, w2, w3 to them respectively
I was able to do this manually while experimenting with all values in hand. But when I go to a list format, it's giving me an error.
I was trying to do it through iteration and I've attached my code below, but it was giving me an error. I have also attached my manual code which worked, but it needs it first hand.
Code which didn't work:
score_df["weighted_avg"] += weight * score_df[feature]
Manual Code which worked but not with lists:
df["weighted_scores"] = 0.5*df["s1"] + 0.25*df["s2"] + 0.25*df["s3"]
We can use numpy broadcasting for this, since weights has the same shape as your column axis:
# given the following example df
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
print(df)
s1 s2 s3
0 0.49 1.00 0.50
1 0.65 0.87 0.75
2 0.45 0.85 0.87
3 0.91 0.53 0.30
4 0.96 0.44 0.50
5 0.67 0.87 0.24
6 0.87 0.41 0.29
7 0.06 0.15 0.73
8 0.76 0.92 0.69
9 0.92 0.28 0.29
weights = [0.5, 0.25, 0.25]
df["weighted_scores"] = df.mul(weights).sum(axis=1)
print(df)
s1 s2 s3 weighted_scores
0 0.49 1.00 0.50 0.62
1 0.65 0.87 0.75 0.73
2 0.45 0.85 0.87 0.66
3 0.91 0.53 0.30 0.66
4 0.96 0.44 0.50 0.71
5 0.67 0.87 0.24 0.61
6 0.87 0.41 0.29 0.61
7 0.06 0.15 0.73 0.25
8 0.76 0.92 0.69 0.78
9 0.92 0.28 0.29 0.60
You can use dot
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
df['weighted_scores'] = df.dot([.5,.25,.25])
df
Out
s1 s2 s3 weighted_scores
0 0.053543 0.659316 0.033540 0.199985
1 0.631627 0.257241 0.494959 0.503863
2 0.220939 0.870247 0.875165 0.546822
3 0.890487 0.519320 0.944459 0.811188
4 0.029416 0.016780 0.987503 0.265779
5 0.843882 0.784933 0.677096 0.787448
6 0.396092 0.297580 0.965454 0.513805
7 0.109894 0.011217 0.443796 0.168700
8 0.202096 0.637105 0.959876 0.500293
9 0.847020 0.949703 0.668615 0.828090
This question already has answers here:
Python - Drop row if two columns are NaN
(5 answers)
How to drop column according to NAN percentage for dataframe?
(4 answers)
Closed 4 years ago.
I have the following dataset:
A B C D
0 0.46 0.23 NaN 0.41
1 0.65 0.48 0.07 0.15
2 NaN 1.00 0.79 0.09
3 0.50 0.97 0.07 0.55
4 0.45 0.44 0.23 0.41
5 NaN 0.39 NaN 0.31
6 0.32 0.87 0.73 0.57
7 0.82 0.67 0.73 0.19
8 0.65 NaN NaN 0.81
9 0.36 0.23 1.00 0.51
I would like to delete rows in which there are 2 or more values missing (or specify as more than 50% missing?), therefore I would like to delete rows 5 and 8 and get the following output:
A B C D
0 0.46 0.23 NaN 0.41
1 0.65 0.48 0.07 0.15
2 NaN 1.00 0.79 0.09
3 0.50 0.97 0.07 0.55
4 0.45 0.44 0.23 0.41
6 0.32 0.87 0.73 0.57
7 0.82 0.67 0.73 0.19
9 0.36 0.23 1.00 0.51
Thank you.
I have a Pandas dataframe (Dt) like this:
Pc Cvt C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
0 1 2 0.08 0.17 0.16 0.31 0.62 0.66 0.63 0.52 0.38
1 2 2 0.09 0.15 0.13 0.49 0.71 1.28 0.42 1.04 0.43
2 3 2 0.13 0.24 0.22 0.17 0.66 0.17 0.28 0.11 0.30
3 4 1 0.21 0.10 0.23 0.08 0.53 0.14 0.59 0.06 0.53
4 5 1 0.16 0.21 0.18 0.13 0.44 0.08 0.29 0.12 0.52
5 6 1 0.14 0.14 0.13 0.20 0.29 0.35 0.40 0.29 0.53
6 7 1 0.21 0.16 0.19 0.21 0.28 0.23 0.40 0.19 0.52
7 8 1 0.31 0.16 0.34 0.19 0.60 0.32 0.56 0.30 0.55
8 9 1 0.20 0.19 0.26 0.19 0.63 0.30 0.68 0.22 0.58
9 10 2 0.12 0.18 0.13 0.22 0.59 0.40 0.50 0.24 0.36
10 11 2 0.10 0.10 0.19 0.17 0.89 0.36 0.65 0.23 0.37
11 12 2 0.19 0.20 0.17 0.17 0.38 0.14 0.48 0.08 0.36
12 13 1 0.16 0.17 0.15 0.13 0.35 0.12 0.50 0.09 0.52
13 14 2 0.19 0.19 0.29 0.16 0.62 0.19 0.43 0.14 0.35
14 15 2 0.01 0.16 0.17 0.20 0.89 0.38 0.63 0.27 0.46
15 16 2 0.09 0.19 0.33 0.15 1.11 0.16 0.87 0.16 0.29
16 17 2 0.07 0.18 0.19 0.15 0.61 0.19 0.37 0.15 0.36
17 18 2 0.14 0.23 0.23 0.20 0.67 0.38 0.45 0.27 0.33
18 19 1 0.27 0.15 0.20 0.10 0.40 0.05 0.53 0.02 0.52
19 20 1 0.12 0.13 0.18 0.22 0.60 0.49 0.66 0.39 0.66
20 21 2 0.15 0.20 0.18 0.32 0.74 0.58 0.51 0.45 0.37
.
.
.
From this i want to plot an histogram with kde for each column from C1 to C10 in an arrange just like the one that i obtain if i plot it with pandas,
Dt.iloc[:,2:].hist()
But so far i've been not able to add the kde in each histogram; i want something like this:
Any ideas on how to accomplish this?
You want to first plot your histogram then plot the kde on a secondary axis.
Minimal and Complete Verifiable Example MCVE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
k = len(df.columns)
n = 2
m = (k - 1) // n + 1
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
fig.tight_layout()
How It Works
Keep track of total number of subplots
k = len(df.columns)
n will be the number of chart columns. Change this to suit individual needs. m will be the calculated number of required rows based on k and n
n = 2
m = (k - 1) // n + 1
Create a figure and array of axes with required number of rows and columns.
fig, axes = plt.subplots(m, n, figsize=(n * 5, m * 3))
Iterate through columns, tracking the column name and which number we are at i. Within each iteration, plot.
for i, (name, col) in enumerate(df.iteritems()):
r, c = i // n, i % n
ax = axes[r, c]
col.hist(ax=ax)
ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
ax2.set_ylim(0)
Use tight_layout() as an easy way to sharpen up the layout spacing
fig.tight_layout()
Here is a pure seaborn solution, using FacetGrid.map_dataframe as explained here.
Stealing the example from #piRSquared:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000, 4)).add_prefix('C')
Get the data in the required format:
df = df.stack().reset_index(level=1, name="val")
Result:
level_1 val
0 C0 0.879714
0 C1 -0.927096
0 C2 -0.929429
0 C3 -0.571176
1 C0 -1.127939
Then:
import seaborn as sns
def distplot(x, **kwargs):
ax = plt.gca()
data = kwargs.pop("data")
sns.distplot(data[x], ax=ax, **kwargs)
g = sns.FacetGrid(df, col="level_1", col_wrap=2, size=3.5)
g = g.map_dataframe(distplot, "val")
You can adjust col_wrap as needed.
I have a pandas dataframe with about 100 columns of following type:
X1 Y1 X2 Y2 X3 Y3
0.78 0.22 0.19 0.42 0.04 0.65
0.43 0.29 0.43 0.84 0.14 0.42
0.57 0.70 0.59 0.86 0.11 0.40
0.92 0.52 0.81 0.33 0.54 1.00
w1here (X,Y) are basically pairs of values
I need to create the following from above.
X Y
0.78 0.22
0.43 0.29
0.57 0.70
0.92 0.52
0.19 0.42
0.43 0.84
0.59 0.86
0.81 0.33
0.04 0.65
0.14 0.42
0.11 0.40
0.54 1.00
i.e. stack all the X columns which are odd numbered and then stack all the Y columns which are even numbered.
I have no clue where to even start. For small number of columns I could easily have use the column names.
You can use lreshape, for column names use list comprehension:
x = [col for col in df.columns if 'X' in col]
y = [col for col in df.columns if 'Y' in col]
df = pd.lreshape(df, {'X': x,'Y': y})
print (df)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00
Solution with MultiIndex and stack:
df.columns = [np.arange(len(df.columns)) % 2, np.arange(len(df.columns)) // 2]
df = df.stack().reset_index(drop=True)
df.columns = ['X','Y']
print (df)
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
It may also be worth noting that you could just construct a new DataFrame explicitly with the X-Y values. This will most likely be quicker, but it assumes that the X-Y column pairs are the entirety of your DataFrame.
pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
Demo
>>> pd.DataFrame(dict(X=df.values[:,::2].reshape(-1),
Y=df.values[:,1::2].reshape(-1)))
X Y
0 0.78 0.22
1 0.19 0.42
2 0.04 0.65
3 0.43 0.29
4 0.43 0.84
5 0.14 0.42
6 0.57 0.70
7 0.59 0.86
8 0.11 0.40
9 0.92 0.52
10 0.81 0.33
11 0.54 1.00
You can use the documented pd.wide_to_long but you will need to use a 'dummy' column to uniquely identify each row. You can drop this column later.
pd.wide_to_long(df.reset_index(),
stubnames=['X', 'Y'],
i='index',
j='dropme').reset_index(drop=True)
X Y
0 0.78 0.22
1 0.43 0.29
2 0.57 0.70
3 0.92 0.52
4 0.19 0.42
5 0.43 0.84
6 0.59 0.86
7 0.81 0.33
8 0.04 0.65
9 0.14 0.42
10 0.11 0.40
11 0.54 1.00