Pandas Plotting with Multi-Index - python

After performing a groupby.sum() on a DataFrame I'm having some trouble trying to create my intended plot.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 100
data = {'Month': np.random.choice(['2014-01', '2014-02', '2014-03', '2014-04'], size=rows),
'Code': np.random.choice(['A', 'B', 'C'], size=rows),
'ColA': np.random.randint(5, 125, size=rows),
'ColB': np.random.randint(0, 51, size=rows),}
df = pd.DataFrame(data)
Month Code ColA ColB
0 2014-03 C 59 47
1 2014-01 A 24 9
2 2014-02 C 77 50
dfg = df.groupby(['Code', 'Month']).sum()
ColA ColB
Code Month
A 2014-01 124 102
2014-02 398 282
2014-03 474 198
2014-04 830 237
B 2014-01 477 300
2014-02 591 167
2014-03 522 192
2014-04 367 169
C 2014-01 412 180
2014-02 275 205
2014-03 795 291
2014-04 901 309
How can I create a subplot (kind='bar') for each Code, where the x-axis is the Month and the bars are ColA and ColB?

I found the unstack(level) method to work perfectly, which has the added benefit of not needing a priori knowledge about how many Codes there are.
ax = dfg.unstack(level=0).plot(kind='bar', subplots=True, rot=0, figsize=(9, 7), layout=(2, 3))
plt.tight_layout()

Using the following DataFrame ...
# using pandas version 0.14.1
from pandas import DataFrame
import pandas as pd
import matplotlib.pyplot as plt
data = {'ColB': {('A', 4): 3.0,
('C', 2): 0.0,
('B', 4): 51.0,
('B', 1): 0.0,
('C', 3): 0.0,
('B', 2): 7.0,
('Code', 'Month'): '',
('A', 3): 5.0,
('C', 1): 0.0,
('C', 4): 0.0,
('B', 3): 12.0},
'ColA': {('A', 4): 66.0,
('C', 2): 5.0,
('B', 4): 125.0,
('B', 1): 5.0,
('C', 3): 41.0,
('B', 2): 52.0,
('Code', 'Month'): '',
('A', 3): 22.0,
('C', 1): 14.0,
('C', 4): 51.0,
('B', 3): 122.0}}
df = DataFrame(data)
... you can plot the following (using cross-section):
f, a = plt.subplots(3,1)
df.xs('A').plot(kind='bar',ax=a[0])
df.xs('B').plot(kind='bar',ax=a[1])
df.xs('C').plot(kind='bar',ax=a[2])
One for A, one for B and one for C, x-axis: 'Month', the bars are ColA and ColB.
Maybe this is what you are looking for.

Creating the desired visualization is all about shaping the dataframe to fit the plotting API.
seaborn can easily aggregate long form data from a dataframe without .groupby or .pivot_table.
Given the original dataframe df, the easiest option is the convert it to a long form with pandas.DataFrame.melt, and then plot with seaborn.catplot, which is a high-level API for matplotlib.
Change the default estimator from mean to sum
The 'Month' column in the OP is a string type. In general, it's better to convert the column to datetime dtype with pd._to_datetime
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.2, seaborn 0.11.2
seaborn.catplot
import seaborn as sns
dfm = df.melt(id_vars=['Month', 'Code'], var_name='Cols')
Month Code Cols value
0 2014-03 C ColA 59
1 2014-01 A ColA 24
2 2014-02 C ColA 77
3 2014-04 B ColA 114
4 2014-01 C ColA 67
# specify row and col to get a plot like that produced by the accepted answer
sns.catplot(kind='bar', data=dfm, col='Code', x='Month', y='value', row='Cols', order=sorted(dfm.Month.unique()),
col_order=sorted(df.Code.unique()), estimator=sum, ci=None, height=3.5)
sns.catplot(kind='bar', data=dfm, col='Code', x='Month', y='value', hue='Cols', estimator=sum, ci=None,
order=sorted(dfm.Month.unique()), col_order=sorted(df.Code.unique()))
pandas.DataFrame.plot
pandas uses matplotlib and the default plotting backend.
To produce the plot like the accepted answer, it's better to use pandas.DataFrame.pivot_table instead of .groupby, because the resulting dataframe is in the correct shape, without the need to unstack.
dfp = df.pivot_table(index='Month', columns='Code', values=['ColA', 'ColB'], aggfunc='sum')
dfp.plot(kind='bar', subplots=True, rot=0, figsize=(9, 7), layout=(2, 3))
plt.tight_layout()

Related

Can't plot comparative (double) histogram from Pandas table

Here's the table from the dataframe:
Points_groups
Qty Contracts
Qty Gones
1
350+
108
275
2
300-350
725
1718
3
250-300
885
3170
4
200-250
2121
10890
5
150-200
3120
7925
6
100-150
653
1318
7
50-100
101
247
8
0-50
45
137
I'd like to get something like this out of it:
But that the columns correspond to the 'x' axis,
which was built from the 'Scores_groups' column like this
I tried a bunch of options already, but I couldn't get it.
For example:
df.plot(kind ='hist')
plt.xlabel('Points_groups')
plt.ylabel("Number Of Students");
or
sns.distplot(df['Кол-во Ушедшие'])
sns.distplot(df['Кол-во Контракт'])
plt.show()
or
df.hist(column='Баллы_groups', by= ['Кол-во Контракт', 'Кол-во Ушедшие'], bins=2, grid=False, rwidth=0.9,color='purple', sharex=True);
Since you already have the distribution in your pandas dataframe, the plot you need can be achieved with the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Df = pd.DataFrame({'key': ['red', 'green', 'blue'], 'A': [1, 2, 1], 'B': [2, 4, 3]})
X_axis = np.arange(len(Df['key']))
plt.bar(X_axis - 0.2, Df['A'], 0.4, label = 'A')
plt.bar(X_axis + 0.2, Df['B'], 0.4, label = 'B')
X_label = list(Df['key'].values)
plt.xticks(X_axis, X_label)
plt.legend()
plt.show()
Since I don't have access to your data, I made some mock dataframe. This results in the following figure:

creating a dataframe using lists

I am trying to create a dataframe that looks like this excel sheet but I can't figure out how to do so. Here is the code I am attempting to use
import pandas as pd
ls_super = ['Supernatant',total_volume_super_ml, Total_activity_super, total_protein_super_mg, specific_activity_sup, 100, 1]
df3 = pd.DataFrame(ls_super, columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield,','purification'
])
Here is the error message: ValueError Traceback (most recent call last)
/tmp/ipykernel_125580/4224246098.py in
20
21 # list of strings
---> 22 df3 = pd.DataFrame(ls_super, columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield,','purification'
23 ])
24
~/.local/lib/python3.8/site-packages/pandas/core/frame.py in init(self, data, index, columns, dtype, copy)
709 )
710 else:
--> 711 mgr = ndarray_to_mgr(
712 data,
713 index,
~/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
322 )
323
--> 324 _check_values_indices_shape_match(values, index, columns)
325
326 if typ == "array":
~/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
391 passed = values.shape
392 implied = (len(index), len(columns))
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
394
395
ValueError: Shape of passed values is (7, 1), indices imply (7, 7)
Problem: the DataFrame() constructor insists on interpreting the one-dimensional python list ls_super as 1 column and 7 rows ("Shape of passed values is (7, 1)"), as opposed to 1 row with 7 columns (which would have shape=(1, 7)).
Solution: add a second set of brackets ([]) around your definition of the ls_super list. In other words, make ls_super a two-dimensional list. The DataFrame constructor then sees a single value in the first dimension, and seven values in the second dimension, producing the desired shape of (1, 7).
ls_super = [['Supernatant',1, 2, 3, 4, 100, 1]]
df3 = pd.DataFrame(ls_super,
columns=['Sample', 'Total Volume', 'Total activity', 'Total protein', 'Specific Activity', 'percent yield', 'purification'])
import pandas as pd
first_row_list = [
"Supernatant",
total_volume_super_ml,
Total_activity_super,
total_protein_super_mg,
specific_activity_sup,
100,
1
]
columns = [
"Sample",
"Total Volume",
"Total Activity",
"Total protein",
"Specific Activity",
"percent yield",
"purification"
]
d = dict(zip(columns, [[f] for f in first_row_list]))
df = pd.DataFrame(d)
or
d = {'Sample': ["Supernatant"],
'Total Volume': [total_volume_super_ml],
'Total Acitivity': [Total_activity_super],
'Total protein': [total_protein_super_mg],
'Specific Activity': [specific_activity_sup],
'percent yield': [100],
'purification': [1]}
df = pd.DataFrame(d)
In pandas dataframe your input could be an a list of list.
import pandas as pd
from random import uniform
total_volume_super_ml = uniform(0,1)
Total_activity_super = uniform(0,1)
total_protein_super_mg = uniform(0,1)
specific_activity_sup = uniform(0,1)
ls_super = ['Supernatant',total_volume_super_ml, Total_activity_super, total_protein_super_mg, specific_activity_sup, 100, 1]
df3 = pd.DataFrame([ls_super], columns =['Sample','Total Volume','Total activity','Total protein','Specific Activity','percent yield','purification'])
df3

How can I speed up this iteration?

I have a dataframe with over ten million rows containing 2 columns 'left_index' and 'right_index'.
'left_index' is the index of a value and 'right_index' contains indexes of rows that have a possible match.
The problem is that this contains duplicate matches (Ex: 0,1 and 1,0).
I want to filter this dataframe and only keep one combination of each match.
I'm using a list here for an example.
In: [(0,1), (1,0), (3,567)]
Out: [(0,1), (3, 567)]
The below code produces what I want however it is very slow. Is there a faster way to solve this?
lst2 = []
for i in lst1:
if(i in lst2):
lst1.remove(i)
else:
lst2.append((i[1],i[0]))
Using numpy to keep the first occurrence of a non-unique array:
import numpy as np
lst1 = [(1,0), (0,1), (2, 5), (3,567), (5,2)]
arr = np.array(lst1)
result = arr[np.unique(np.sort(arr), 1, axis=0)[1]]
>>> result
array([[ 1, 0],
[ 2, 5],
[ 3, 567]])
I believe Pandas saves you from using loop.
import pandas as pd
df = pd.DataFrame([
[(0, 0), (0, 0), 123],
[(0, 0), (0, 1), 234],
[(1, 0), (0, 1), 345],
[(1, 1), (0, 1), 456],
], columns=['left_index', 'right_index', 'value'])
print(df)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) (0, 1) 234
2 (1, 0) (0, 1) 345
3 (1, 1) (0, 1) 456
df['left_index_set'] = df['left_index'].apply(set)
df['right_index_set'] = df['right_index'].apply(set)
I am not sure what you need after this point. If you want to filter duplicates you do the following.
df = df[df['left_index_set'] != df['right_index_set']]
df_final1= df[['left_index', 'right_index', 'value']]
print(df_final1)
left_index right_index value
1 (0, 0) (0, 1) 234
3 (1, 1) (0, 1) 456
However, If you do not want to filter dataframe but to modify it:
df.loc[df['left_index_set'] != df['right_index_set'], 'right_index'] = None # None, '' or what you want. It's up to you
df_final2 = df[['left_index', 'right_index', 'value']]
print(df_final2)
left_index right_index value
0 (0, 0) (0, 0) 123
1 (0, 0) None 234
2 (1, 0) (0, 1) 345
3 (1, 1) None 456
You mention the data is in a dataframe and tagged pandas so we can use numpy to do this work for us using vectorization.
First, since you did not provide a way to create the data, I generated a dataframe per your description using:
import numpy as np
import pandas
def build_dataframe():
def rand_series():
"""Create series of 1 million random integers in range [0, 9999]."""
return (np.random.rand(1000000) * 10000).astype('int')
data = pandas.DataFrame({
'left_index': rand_series(),
'right_index': rand_series()
})
return data.set_index('left_index')
data = build_dataframe()
Since (0,1) is the same as (1,0) per your requirements, lets just create an index that has the values sorted for us. First create two new columns with the minimum and maximum value of left and right index:
data['min_index'] = np.minimum(data.index, data.right_index)
data['max_index'] = np.maximum(data.index, data.right_index)
print(data)
right_index min_index max_index
left_index
4270 438 438 4270
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ... ...
3656 2492 2492 3656
2345 210 210 2345
9241 1934 1934 9241
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we can reset the index to these two new columns (really we just want a multi-index, and this is one way of getting it for us).
data = data.reset_index().set_index(keys=['min_index', 'max_index'])
print(data)
left_index right_index
min_index max_index
438 4270 4270 438
1277 9378 1277 9378
20 7080 20 7080
4646 6623 4646 6623
3280 4481 3280 4481
... ... ...
2492 3656 3656 2492
210 2345 2345 210
1934 9241 9241 1934
369 8362 369 8362
5251 6047 5251 6047
[1000000 rows x 2 columns]
Then we just want the unique values of the index. This is the operation that takes the most time, but should still be significantly faster than the naive implementation using lists.
unique = data.index.unique()
print (unique)
MultiIndex([( 438, 4270),
(1277, 9378),
( 20, 7080),
(4646, 6623),
(3280, 4481),
(4410, 9367),
(1864, 7881),
( 516, 3287),
(1678, 6946),
(1253, 7890),
...
(6669, 9527),
(1095, 8866),
( 455, 7800),
(2862, 8587),
(8221, 9808),
(2492, 3656),
( 210, 2345),
(1934, 9241),
( 369, 8362),
(5251, 6047)],
names=['min_index', 'max_index'], length=990197)

How to calculate and plot accuracy between two columns

I wanted to create an accuracy rate of each letter in a bar graph using matplotlib.
Example Dataset
data = {'Actual Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D'], 'Predicted Letter': ['U', 'A', 'X', 'P', 'C', 'R', 'C', 'U', 'J', 'D']}
df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
Actual Letter Predicted Letter
10113 U U
19164 A A
12798 X X
12034 P P
17719 C C
17886 R R
4624 C C
6047 U U
15608 J J
11815 D D
df.plot(kind='bar')
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-a5f21be4f14b> in <module>
3 df = pd.DataFrame(data, index=[10113, 19164, 12798, 12034, 17719, 17886, 4624, 6047, 15608, 11815])
4
----> 5 df.plot(kind='bar')
e:\Anaconda3\lib\site-packages\pandas\plotting\_core.py in __call__(self, *args, **kwargs)
970 data.columns = label_name
971
--> 972 return plot_backend.plot(data, kind=kind, **kwargs)
973
974 __call__.__doc__ = __doc__
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\__init__.py in plot(data, kind, **kwargs)
69 kwargs["ax"] = getattr(ax, "left_ax", ax)
70 plot_obj = PLOT_CLASSES[kind](data, **kwargs)
---> 71 plot_obj.generate()
72 plot_obj.draw()
73 return plot_obj.result
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in generate(self)
284 def generate(self):
285 self._args_adjust()
--> 286 self._compute_plot_data()
287 self._setup_subplots()
288 self._make_plot()
e:\Anaconda3\lib\site-packages\pandas\plotting\_matplotlib\core.py in _compute_plot_data(self)
451 # no non-numeric frames or series allowed
452 if is_empty:
--> 453 raise TypeError("no numeric data to plot")
454
455 self.data = numeric_data.apply(self._convert_to_ndarray)
TypeError: no numeric data to plot
I wanted a bar graph that would be like this. However I don't know how to do it.
Imports and Sample DataFrame
import pandas as pd
import numpy as np # for sample data only
import string # for sample data only
# create sample dataframe for testing
np.random.seed(365)
rows = 1100
data = {'Actual': np.random.choice(list(string.ascii_uppercase), size=rows),
'Predicted': np.random.choice(list(string.ascii_uppercase), size=rows)}
df = pd.DataFrame(data)
Calculations and Plotting
Updated
The following implementation is more succinct; unnecessary steps have be removed.
Create a Boolean 'Match' column depending on if there is a match between 'Predicted' and 'Actual'
.groupby on 'Actual', aggregate .mean(), multiply by 100, and round, to get the percent.
The group for each letter will sum the Booleans and divide by the count. For 'A', the sum is 1, because there is 1 True, which is divided by the total count of the group, 33. Therefore, 1/33 = 0.030303030303030304
Plot the bar for the selected data with pandas.DataFrame.plot
Note that step (1) and (2) can be reduced and combined to the following:
dfa = df.Predicted.eq(df.Actual).groupby(df.Actual).mean().mul(100).round(2)
# determine where Predicted equals Actual
df['Match'] = df.Predicted.eq(df.Actual)
# display(df.head())
Actual Predicted Match
0 S Z False
1 U J False
2 B L False
3 M V False
4 F C False
# groupby and get percent
dfa = df.groupby('Actual').Match.mean().mul(100).round(2)
# display(dfa.head())
Actual
A 3.03
B 2.63
C 4.44
D 6.82
E 5.77
Name: Match, dtype: float64
# plot
ax = dfa.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
Original Code
This works as well
# determine where Predicted equals Actual and convert to an int; True = 1 and False = 0
df['Match'] = df.Predicted.eq(df.Actual).astype(int)
# get the normalized value counts
dfg = df.groupby('Actual').Match.value_counts(normalize=True).mul(100).round(2).reset_index(name='%')
# get the accuracy scores where there is a Match
df_accuracy = dfg[dfg.Match.eq(1)]
# display(df_accuracy.head())
Actual Match %
1 A 1 3.03
3 B 1 2.63
5 C 1 4.44
7 D 1 6.82
9 E 1 5.77
# plot
ax = df_accuracy.plot(kind='bar', x='Actual', y='%', rot=0, legend=False, grid=True, figsize=(8, 5),
ylabel='Percent %', xlabel='Letter', title='Accuracy Rate % per letter')
have simulated data that you note
graph is exceptionally simple if you calc the percentages first
import numpy as np
import pandas as pd
# simulate some data...
df = pd.DataFrame(
{"Actual Letter": np.random.choice(list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 200)}
).assign(
**{
"Predicted Letter": lambda d: d["Actual Letter"].apply(
lambda l: np.random.choice(
[l] + list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"), 1, p=tuple([0.95]+ [0.05/26]*26)
)[0]
)
}
)
# now just calc percentage of where actual and predicted are the same
# graph it...
df.groupby("Actual Letter").apply(lambda d: (d["Actual Letter"]==d["Predicted Letter"]).sum()/len(d)).plot(kind="bar")

Seaborn scatter plot from pandas dataframe colours based on third column

I have a pandas dataframe, with columns 'groupname', 'result', and 'temperature'. I've plotted a Seaborn swarmplot, where x='groupname' and y='result', which shows the results data separated into the groups.
What I also want to do is to colour the markers according to their temperature, using a colormap, so that for example the coldest are blue and hottest red.
Plotting the chart is very simple:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
data = {'groupname': ['G0', 'G0', 'G0', 'G0', 'G1', 'G1', 'G1'], 'shot': [1, 2, 3, 4, 1, 2, 3], 'temperature': [20, 25, 35, 10, -20, -17, -6], 'result': [10.0, 10.1, 10.5, 15.0, 15.1, 13.5, 10.5]}
df = pd.DataFrame(data)
groupname shot temperature result
0 G0 1 20 10.0
1 G0 2 25 10.1
2 G0 3 35 10.5
3 G0 4 10 15.0
4 G1 1 -20 15.1
5 G1 2 -17 13.5
6 G1 3 -6 10.5
plt.figure()
sns.stripplot(data=results, x="groupname", y="result")
plt.show()
But now I'm stuck trying to colour the points, I've tried a few things like:
sns.stripplot(data=results, x="groupname", y="result", cmap=matplotlib.cm.get_cmap('Spectral'))
which doesn't seem to do anything.
Also tried:
sns.stripplot(data=results, x="groupname", y="result", hue='temperature')
which does colour the points depending on the temperature, however the colours are random rather than mapped.
I feel like there is probably a very simple way to do this, but haven't been able to find any examples.
Ideally looking for something like:
sns.stripplot(data=results, x="groupname", y="result", colorscale='temperature')
Hello the keyword you are looking for is "palette"
Below should work:
sns.stripplot(data=results, x="groupname", y="result", hue='temperature',palette="vlag")
http://man.hubwiz.com/docset/Seaborn.docset/Contents/Resources/Documents/generated/seaborn.stripplot.html

Categories