I wanted to create an accuracy rate of each letter in a bar graph using matplotlib.
Example Dataset
data = {'Actual Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D'], 'Predicted Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D']}
df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
Actual Letter Predicted Letter
10113 U U
19164 A A
12798 X X
12034 P P
17719 C C
17886 R R
4624 C C
6047 U U
15608 J J
11815 D D
df.plot(kind='bar')
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-a5f21be4f14b> in <module>
3 df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
4
----> 5 df.plot(kind='bar')
e:\Anaconda3\lib\site-packages\pandas\plotting\_core.py in __call__(self, *args, **kwargs)
970 data.columns = label_name
971
--> 972 return plot_backend.plot(data, kind=kind, **kwargs)
973
974 __call__.__doc__ = __doc__
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\__init__.py in plot(data, kind, **kwargs)
69 kwargs["ax"] = getattr(ax, "left_ax", ax)
70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
72 plot_obj.draw()
73 return plot_obj.result
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in generate(self)
284 def generate(self):
285 self._args_adjust()
--> 286 self._compute_plot_data()
287 self._setup_subplots()
288 self._make_plot()
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in _compute_plot_data(self)
451 # no non-numeric frames or series allowed
452 if is_empty:
--> 453 raise TypeError("no numeric data to plot")
454
455 self.data = numeric_data.apply(self._convert_to_ndarray)
TypeError: no numeric data to plot
I wanted a bar graph that would be like this. However I don't know how to do it.
Imports and Sample DataFrame
import pandas as pd
import numpy as np # for sample data only
import string # for sample data only
# create sample dataframe for testing
np.random.seed(365)
rows = 1100
data = {'Actual': np.random.choice(list(string.ascii_uppercase), size=rows),
'Predicted': np.random.choice(list(string.ascii_uppercase), size=rows)}
df = pd.DataFrame(data)
Calculations and Plotting
Updated
The following implementation is more succinct; unnecessary steps have be removed.
Create a Boolean 'Match' column depending on if there is a match between 'Predicted' and 'Actual'
.groupby on 'Actual', aggregate .mean(), multiply by 100, and round, to get the percent.
The group for each letter will sum the Booleans and divide by the count. For 'A', the sum is 1, because there is 1 True, which is divided by the total count of the group, 33. Therefore, 1/33 = 0.030303030303030304
Plot the bar for the selected data with pandas.DataFrame.plot
Note that step (1) and (2) can be reduced and combined to the following:
dfa = df.Predicted.eq(df.Actual).groupby(df.Actual).mean().mul(100).round(2)
# determine where Predicted equals Actual
df['Match'] = df.Predicted.eq(df.Actual)
# display(df.head())
Actual Predicted Match
0 S Z False
1 U J False
2 B L False
3 M V False
4 F C False
# groupby and get percent
dfa = df.groupby('Actual').Match.mean().mul(100).round(2)
# display(dfa.head())
Actual
A 3.03
B 2.63
C 4.44
D 6.82
E 5.77
Name: Match, dtype: float64
# plot
ax = dfa.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
Original Code
This works as well
# determine where Predicted equals Actual and convert to an int; True = 1 and False = 0
df['Match'] = df.Predicted.eq(df.Actual).astype(int)
# get the normalized value counts
dfg = df.groupby('Actual').Match.value_counts(normalize=True).mul(100).round(2).reset_index(name='%')
# get the accuracy scores where there is a Match
df_accuracy = dfg[dfg.Match.eq(1)]
# display(df_accuracy.head())
Actual Match %
1 A 1 3.03
3 B 1 2.63
5 C 1 4.44
7 D 1 6.82
9 E 1 5.77
# plot
ax = df_accuracy.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
have simulated data that you note
graph is exceptionally simple if you calc the percentages first
import numpy as np
import pandas as pd
# simulate some data...
df = pd.DataFrame(
{"Actual Letter": np.random.choice(list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 200)}
).assign(
**{
"Predicted Letter": lambda d: d["Actual Letter"].apply(
lambda l: np.random.choice(
[l] + list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 1, p=tuple([0.95]+ [0.05/26]*26)
)[0]
)
}
)
# now just calc percentage of where actual and predicted are the same
# graph it...
df.groupby("Actual Letter").apply(lambda d: (d["Actual Letter"]==d["Predicted Letter"]).sum()/len(d)).plot(kind="bar")
Related
Here's the table from the dataframe:
Points_groups
Qty Contracts
Qty Gones
1
350+
108
275
2
300-350
725
1718
3
250-300
885
3170
4
200-250
2121
10890
5
150-200
3120
7925
6
100-150
653
1318
7
50-100
101
247
8
0-50
45
137
I'd like to get something like this out of it:
But that the columns correspond to the 'x' axis,
which was built from the 'Scores_groups' column like this
I tried a bunch of options already, but I couldn't get it.
For example:
df.plot(kind ='hist')
plt.xlabel('Points_groups')
plt.ylabel("Number Of Students");
or
sns.distplot(df['Кол-во Ушедшие'])
sns.distplot(df['Кол-во Контракт'])
plt.show()
or
df.hist(column='Баллы_groups', by= ['Кол-во Контракт', 'Кол-во Ушедшие'], bins=2, grid=False, rwidth=0.9,color='purple', sharex=True);
Since you already have the distribution in your pandas dataframe, the plot you need can be achieved with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Df = pd.DataFrame({'key': ['red', 'green', 'blue'], 'A': [1, 2, 1], 'B': [2, 4, 3]})
X_axis = np.arange(len(Df['key']))
plt.bar(X_axis - 0.2, Df['A'], 0.4, label = 'A')
plt.bar(X_axis + 0.2, Df['B'], 0.4, label = 'B')
X_label = list(Df['key'].values)
plt.xticks(X_axis, X_label)
plt.legend()
plt.show()
Since I don't have access to your data, I made some mock dataframe. This results in the following figure:
I am trying to make a pie chart that looks like the below -
I am using geopandas for that-
us_states = gpd.read_file("conus_state.shp")
data = gpd.read_file("data_file.shp")
fig, ax = plt.subplots(figsize= (10,10))
us_states.plot(color = "None", ax = ax)
data.plot(column = ["Column1","Column2"], ax= ax, kind = "pie",subplots=True)
This gives me the following error-
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\LSRATH~1.STU\AppData\Local\Temp/ipykernel_17992/1047905594.py in <module>
1 fig, ax = plt.subplots(figsize= (10,10))
2 us_states.plot(color = "None", ax = ax)
----> 3 diff_env.plot(column = ["WS_MON1","WS_MON2"], ax= ax, kind = "pie")
c:\python38\lib\site-packages\geopandas\plotting.py in __call__(self, *args, **kwargs)
951 if kind in self._pandas_kinds:
952 # Access pandas plots
--> 953 return PlotAccessor(data)(kind=kind, **kwargs)
954 else:
955 # raise error
c:\python38\lib\site-packages\pandas\plotting\_core.py in __call__(self, *args, **kwargs)
921 if isinstance(data, ABCDataFrame):
922 if y is None and kwargs.get("subplots") is False:
--> 923 raise ValueError(
924 f"{kind} requires either y column or 'subplots=True'"
925 )
ValueError: pie requires either y column or 'subplots=True'
Even after specifying, subplots = True, it does not work.
How can I make a pie chart using 2 columns of the dataframe?
Below are the first five rows of the relevant columns-
diff_env[["Column1", "Column2", "geometry"]].head().to_dict()
{'Column1': {0: 2, 1: 0, 2: 0, 3: 1, 4: 12},
'Column2': {0: 2, 1: 0, 2: 0, 3: 1, 4: 12},
'geometry': {0: <shapely.geometry.point.Point at 0x2c94e07f190>,
1: <shapely.geometry.point.Point at 0x2c94e07f130>,
2: <shapely.geometry.point.Point at 0x2c94e07f0d0>,
3: <shapely.geometry.point.Point at 0x2c94bb86d30>,
4: <shapely.geometry.point.Point at 0x2c94e07f310>}}
you have not provided any usable sample data. Have randomly generated some
this is inspired by How to plot scatter pie chart using matplotlib
sample data
value0
value1
geometry
size
0
5
3
POINT (-105.96116535117056 31.014979334448164)
312
1
2
3
POINT (-79.70609244147155 36.46222924414716)
439
2
4
7
POINT (-68.89518006688962 37.84436728093645)
363
3
7
9
POINT (-118.12344177257525 31.909303946488293)
303
4
2
7
POINT (-102.1001252173913 28.57591221070234)
326
5
3
3
POINT (-96.88772103678929 47.76324025083612)
522
6
5
8
POINT (-112.33188157190635 48.16975143812709)
487
7
7
6
POINT (-95.15025297658862 44.59245298996656)
594
8
3
1
POINT (-100.36265715719063 46.787613401337794)
421
9
2
4
POINT (-81.82966451505015 35.161393444816056)
401
full code
import geopandas as gpd
import numpy as np
import shapely
import matplotlib.pyplot as plt
states = (
gpd.read_file(
"https://raw.githubusercontent.com/nvkelso/natural-earth-vector/master/geojson/ne_110m_admin_1_states_provinces.geojson"
)
.loc[lambda d: d["iso_3166_2"].ne("US-AK"), "geometry"]
.exterior
)
# geodataframe of points where pies are to be plotted
n = 10
pies = gpd.GeoDataFrame(
geometry=[
shapely.geometry.Point(xy)
for xy in zip(
np.random.choice(np.linspace(*states.total_bounds[[0, 2]], 300), n),
np.random.choice(np.linspace(*states.total_bounds[[1, 3]], 300), n),
)
],
data={f"value{c}": np.random.randint(1, 10, n) for c in range(2)},
crs=states.crs,
).pipe(lambda d: d.assign(size=np.random.randint(300, 600, n)))
# utility function inspired by https://stackoverflow.com/questions/56337732/how-to-plot-scatter-pie-chart-using-matplotlib
def draw_pie(dist, xpos, ypos, size, ax):
# for incremental pie slices
cumsum = np.cumsum(dist)
cumsum = cumsum / cumsum[-1]
pie = [0] + cumsum.tolist()
colors = ["blue", "red", "yellow"]
for i, (r1, r2) in enumerate(zip(pie[:-1], pie[1:])):
angles = np.linspace(2 * np.pi * r1, 2 * np.pi * r2)
x = [0] + np.cos(angles).tolist()
y = [0] + np.sin(angles).tolist()
xy = np.column_stack([x, y])
ax.scatter([xpos], [ypos], marker=xy, s=size, color=colors[i], alpha=1)
return ax
fig, ax = plt.subplots()
ax = states.plot(ax=ax, edgecolor="black", linewidth=0.5)
for _, r in pies.iterrows():
ax = draw_pie([r.value0, r.value1], r.geometry.x, r.geometry.y, r["size"], ax)
output
I got a 'No numeric types to aggregate' error when I run the following code.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x_labels = ['Female', 'Male']
genderincomeTM1 = round(TM1.groupby('Gender')['Income'].mean())
genderincomeTM2 = round(TM2.groupby('Gender')['Income'].mean())
genderincomeTM3 = round(TM3.groupby('Gender')['Income'].mean())
genderTM1 = genderincomeTM1.index
genderTM2 = genderincomeTM2.index
genderTM3 = genderincomeTM3.index
x = np.arange(len(x_labels))
plt.figure(figsize=(12,8))
width = 0.35
fig, ax = plt.subplots()
bar1 = ax.bar(x - 0.3, genderincomeTM1, width=0.2, label='TM1')
bar2 = ax.bar(x, genderincomeTM2, width=0.2, label='TM2')
bar3 = ax.bar(x + 0.3, genderincomeTM3, width=0.2, label='TM3')
ax.set_title('Average Income by Product Model', fontsize = 18)
ax.set_ylabel('Sales', fontsize = 12)
ax.set_xticks(x)
ax.set_xticklabels(x_labels)
ax.set_ylim(bottom = 0, top = 90000)
ax.legend(loc=(1.02,0.4), borderaxespad=0, fontsize = 12)
def autolabel(bars):
for each in bars:
height = each.get_height()
ax.annotate('{}'.format(height),
xy=(each.get_x() + each.get_width() / 2, height),
xytext=(0, 2), # 2 points vertical offset
textcoords="offset points",
ha='center', va='bottom')
autolabel(bar1)
autolabel(bar2)
autolabel(bar3)
DataError Traceback (most recent call last)
<ipython-input-24-fb1aa4ae1242> in <module>
1 x_labels = ['Female', 'Male']
2
----> 3 genderincomeTM1 = round(TM1.groupby('Gender')['Income'].mean())
4 genderincomeTM2 = round(TM2.groupby('Gender')['Income'].mean())
5 genderincomeTM3 = round(TM3.groupby('Gender')['Income'].mean())
~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in mean(self, *args, **kwargs)
1223 """
1224 nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
-> 1225 return self._cython_agg_general(
1226 "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
1227 )
~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
906
907 if len(output) == 0:
--> 908 raise DataError("No numeric types to aggregate")
909
910 return self._wrap_aggregated_output(output)
DataError: No numeric types to aggregate
I have removed all empty rows and checked on 'Income' column using is_numeric_dtype. I also converted the column to int.
from pandas.api.types import is_numeric_dtype
is_numeric_dtype(df['Income'])
>True
df['Income'] = df['Income'].astype(int)
df.info()
><class 'pandas.core.frame.DataFrame'>
>Int64Index: 180 entries, 0 to 182
>Data columns (total 10 columns):
> # Column Non-Null Count Dtype
>--- ------ -------------- -----
> 0 Product 180 non-null object
> 1 Branch 180 non-null object
> 2 Age 180 non-null object
> 3 Gender 180 non-null object
> 4 Education 180 non-null object
> 5 MaritalStatus 180 non-null object
> 6 Usage 180 non-null object
> 7 Fitness 180 non-null object
> 8 Income 180 non-null int32
I am confused why there is no numeric type for Income after validation. Could it be referring to 'Gender'? How should I go about resolving the error?
I can't probably answer your question, because you haven't provided a dataframe which reproduces the error. Maybe you could start by running this code:
import numpy as np
import pandas as pd
TM1 = pd.DataFrame({'Gender':['M','F','M','F'],'Income':['10','20','30','40']})
TM1['Income'] = TM1.Income.astype(int)
TM1.groupby('Gender')['Income'].mean().round()
Because the incomes are initially given as strings, the mean can't be computed until they're converted to integers.
I use this tips https://python-graph-gallery.com/58-show-number-of-observation-on-violinplot/ to add Number of observation on a violon plot.
Here is m code:
# Calculate number of obs per group & median to position labels
medians = dataset.groupby([x_attrib])[y_attrib].median().values
nobs = dataset[x_attrib].value_counts().values
nobs = [str(x) for x in nobs.tolist()]
#nobs = ["Nb: " + i for i in nobs]
nobs = [i for i in nobs]
# Add it to the plot
pos = range(len(nobs))
for tick,label in zip(pos,ax.get_xticklabels()):
ax.text(pos[tick], medians[tick] + 0.03, nobs[tick], horizontalalignment='center', size='x-large', color='black', weight='semibold')
I plot variable with these value counts:
0 355
1 174
2 36
-1 19
3 15
4 5
...
As you can see on the plot, for -1 value: real count is 19 and the plot return 355 (count for 0 value)
How can i modify the code to get a good plot please?
Thanks a lot.
Theo
Based on a selection ds of a dataframe d with:
{ 'x': d.x, 'y': d.y, 'a':d.a, 'b':d.b, 'c':d.c 'row:d.n'})
Having n rows, x ranges from 0 to n-1. The column n is needed since it's a selection and indices need to be kept for a later query.
How do you efficiently compute the difference between each row (e.g.a_0, a_1, etc) of each column (a, b, c) without losing the rows information (e.g. new column with the indices of the rows that were used) ?
MWE
Sample selection ds:
x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291
Desired output:
dist euclidean distance math.hypot(x2 - x1, y2 - y1)
da, db, dc for da: np.abs(a1-a2)
ns a string with both ns of the employed rows
the result would look like:
dist da db dc ns
42.61365102824963 993 340 241 146-225
293.82347069813255 8181 2132 4740 146-291
.. .. .. .. 225-291
You can use itertools.combinations() to generate the pairs:
Read data first:
import pandas as pd
from io import StringIO
import numpy as np
text = """ x y a b c n
554.607085 400.971878 9789 4151 6837 146
512.231450 405.469524 8796 3811 6596 225
570.427284 694.369140 1608 2019 2097 291"""
df = pd.read_csv(StringIO(text), delim_whitespace=True)
Create the index and calculate the results:
from itertools import combinations
index = np.array(list(combinations(range(df.shape[0]), 2)))
df1, df2 = [df.iloc[idx].reset_index(drop=True) for idx in index.T]
res = pd.concat([
np.hypot(df1.x - df2.x, df1.y - df2.y),
df1[["a", "b", "c"]] - df2[["a", "b", "c"]],
df1.n.astype(str) + "-" + df2.n.astype(str)
], axis=1)
res.columns = ["dist", "da", "db", "dc", "ns"]
res
the output:
dist da db dc ns
0 42.613651 993 340 241 146-225
1 293.823471 8181 2132 4740 146-291
2 294.702805 7188 1792 4499 225-291
This approach makes good use of Pandas and the underlying numpy capabilities, but the matrix manipulations are a little hard to keep track of:
import pandas as pd, numpy as np
ds = pd.DataFrame(
[
[554.607085, 400.971878, 9789, 4151, 6837, 146],
[512.231450, 405.469524, 8796, 3811, 6596, 225],
[570.427284, 694.369140, 1608, 2019, 2097, 291]
],
columns = ['x', 'y', 'a', 'b', 'c', 'n']
)
def concat_str(*arrays):
result = arrays[0]
for arr in arrays[1:]:
result = np.core.defchararray.add(result, arr)
return result
# Make a panel with one item for each column, with a square data frame for
# each item, showing the differences between all row pairs.
# This creates perpendicular matrices of values based on the underlying numpy arrays;
# then numpy broadcasts them along the missing axis when calculating the differences
p = pd.Panel(
(ds.values[np.newaxis,:,:] - ds.values[:,np.newaxis,:]).transpose(),
items=['d'+c for c in ds.columns], major_axis=ds.index, minor_axis=ds.index
)
# calculate euclidian distance
p['dist'] = np.hypot(p['dx'], p['dy'])
# create strings showing row relationships
p['ns'] = concat_str(ds['n'].values.astype(str)[:,np.newaxis], '-', ds['n'].values.astype(str)[np.newaxis,:])
# remove unneeded items
del p['dx'], p['dy'], p['dn']
# convert to frame
diffs = p.to_frame().reindex_axis(['dist', 'da', 'db', 'dc', 'ns'], axis=1)
diffs
This gives:
dist da db dc ns
major minor
0 0 0.000000 0 0 0 146-146
1 42.613651 993 340 241 146-225
2 293.823471 8181 2132 4740 146-291
1 0 42.613651 -993 -340 -241 225-146
1 0.000000 0 0 0 225-225
2 294.702805 7188 1792 4499 225-291
2 0 293.823471 -8181 -2132 -4740 291-146
1 294.702805 -7188 -1792 -4499 291-225
2 0.000000 0 0 0 291-291