Pandas plot subplots of a 'group by' result - python

I struggle with my (poor) Pandas knowledge, as I try to get a bar plot on a hierachial index by a group by operation.
My data look like this
id, val, cat1, cat2
Then I create a hierachical index:
df_group = df_len.groupby(['cat1','cat2'])
I would like to get a hbar plot per cat1 object that lists all cat2 objects that lists the values of all cat1 objects within.
None of my approaches worked:
df_group.plot(...)
for name, group in df_group: .... group.plot(...)
df_group.xs(...) experiments
The result should look somewhat like this one
I guess I just lack of knowledge of pandas, matplotlib, ... -internals and it's not that difficult to plot a few 100 items (cat2<10, cat1=30)
.

I'd recommend using seaborn to do this type of faceted plot. Doing it in matplotlib is very tricky as the library is quite low level. Seaborn excels at this use case.

Ok guys, so here it's how I solved it finally:
dfc = df_len.groupby(['cat1','cat2']).count().reset_index()
dfp=dfc.pivot(index="cat1",columns="cat2")
dfp.columns = dfp.columns.get_level_values(1)
dfp.plot(kind='bar', figsize=(15, 5), stacked=True);
In short: I used a pivot table to transpose my matrix and then I was able to plot the single cols automaticly, at example 2 here.

Not so tricky in matplotlib, see:
In [54]:
print df
cat1 cat2 val
0 A 1 0.011887
1 A 2 0.880121
2 A 3 0.034244
3 A 4 0.530230
4 B 1 0.510812
5 B 2 0.405322
6 B 3 0.406259
7 B 4 0.406405
In [55]:
col_list = ['r', 'g']
ax = plt.subplot(111)
for (idx, (grp, val)) in enumerate(df.groupby('cat1')):
ax.bar(val.cat2+0.25*idx-0.25,
val.val, width=0.25,
color=col_list[idx],
label=grp)
plt.legend()

Related

Reshape dataframe and plot stacked bar graph

What I have
I have a frame df of the following style, where each row represents a malfunction occured with specimen:
index specimen malfunction
1 'first' 'cracked'
2 'first' 'cracked'
3 'first' 'bent'
4 'second' 'bent'
5 'second' 'bent'
6 'second' 'bent'
7 'second' 'cracked'
8 'third' 'cracked'
9 'third' 'broken'
In real dataset I have about 15 different specimens and about 10 types of different malfunctions.
What I need
I want to plot a bar graph which represents how many malfunctions occured with specimen (so x-axis for specimen label, y-axis for number of malfunctions occured. I need a stacked bar chart so malfunctions must be separated by color.
What I tried
I tried to use seaborn's catplot(kind='count') which would be exactly what I need if only it could plot a stacked chart. Unfortunately it can't, and I can't figure out how to reshape my data to plot it using pandas.plot.bar(stacked=True)
Try something like this:
from matplotlib.pyplot import *
import pandas as pd
df = df.groupby(['specimen', 'malfunction']).count().unstack()
This generates the following table:
Generated table
fig, ax = subplots()
df.plot(kind='bar', stacked=True, ax=ax)
ax.legend(["bent", "broken", "cracked"]);
The result is this graph:
Result
The 1st step is to convert your categorial data in numeric:
import matplotlib.pyplot as plt
df_toPlot = df #another dataframe keep original data in df
df_toPlot['mapMal'] = df_toPlot.malfunction.astype("category").cat.codes
This is the print of df_toPlot.
index specimen malfunction mapMal
0 1 first cracked 2
1 2 first cracked 2
2 3 first bent 0
3 4 second bent 0
4 5 second bent 0
5 6 second bent 0
6 7 second cracked 2
7 8 third cracked 2
8 9 third broken 1
df_toPlot.groupby(['specimen', 'mapMal']).size().unstack().plot(kind='bar', stacked=True)
plt.show()

Hue two panda series

I have two pandas series for which I want to compare them visually by plotting them on top of each other. I already tried the following
>>> s1 = pd.Series([1,2,3,4,5])
>>> s2 = pd.Series([3,3,3,3,3])
>>> df = pd.concat([s1, s2], axis=1)
>>> sns.stripplot(data = df)
which yields the following picture:
Now, I am aware of the hue keyword of sns.stripplot but trying to apply it, requires me to to use the keywords x and y. I already tried to transform my data into a different dataframe like that
>>> df = pd.concat([pd.DataFrame({'data':s1, 'type':'s1'}), pd.DataFrame({'data':s2, 'type':'s2'})])
so I can "hue over" type; but even then I have no idea what to put for the keyword x (assuming y = 'data'). Ignoring the keyword x like that
>>> sns.stripplot(y='data', data=df, hue='type')
fails to hue anything:
seaborn generally works best with long-form data, so you might need to rearrange your dataframe slightly. The hue keyword is expecting a column, so we'll use .melt() to get one.
long_form = df.melt()
long_form['X'] = 1
sns.stripplot(data=long_form, x='X', y='value', hue='variable')
Will give you a plot that roughly reflects your requirements:
When we do pd.melt, we change the frame from having multiple columns of values to having a single column of values, with a "variable" column to identify which of our original columns they came from. We add in an 'X' column because stripplot needs both x and hueto work properly in this case. Our long_form dataframe, then, looks like this:
variable value X
0 0 1 1
1 0 2 1
2 0 3 1
3 0 4 1
4 0 5 1
5 1 3 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1

Boxplot of Multiple Columns of a Pandas Dataframe on the Same Figure (seaborn)

I feel I am probably not thinking of something obvious. I want to put in the same figure, the box plot of every column of a dataframe, where on the x-axis I have the columns' names. In the seaborn.boxplot() this would be equal to groupby by every column.
In pandas I would do
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
df.boxplot()
which yields
Now I would like to get the same thing in seaborn. But when I try sns.boxplot(df), I get only one grouped boxplot. How do I reproduce the same figure in seaborn?
The seaborn equivalent of
df.boxplot()
is
sns.boxplot(x="variable", y="value", data=pd.melt(df))
or just
sns.boxplot(data=df)
which will plot any column of numeric values, without converting the DataFrame from a wide to long format, using seaborn v0.11.1. This will create a single figure, with a separate boxplot for each column.
Complete example with melt:
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()
This works because pd.melt converts a wide-form dataframe
A B C D
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 0.212339 0.181825 0.183405
to long-form
variable value
0 A 0.374540
1 A 0.156019
2 A 0.601115
3 A 0.832443
4 B 0.950714
5 B 0.155995
6 B 0.708073
7 B 0.212339
8 C 0.731994
9 C 0.058084
10 C 0.020584
11 C 0.181825
12 D 0.598658
13 D 0.866176
14 D 0.969910
15 D 0.183405
You could use the built-in pandas method df.plot(kind='box') as suggested in this question.
I realize this answer will not help you if you have to use seaborn, but it may be useful for people with simpler requirements.
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
df.plot(kind='box')
plt.show()
plt.boxplot([df1,df2], boxprops=dict(color='red'), labels=['title 1','title 2'])
The rest of the answers are great and should work well for most use-cases.
But if someone has the same problem as I have where the range of values is very large for one column (possibly a different scale) and you are not able to see anything else for other columns you can do the following: utilize subplots in order to create multiple y-axes within the figure.
# Store the list of columns
columns_to_plot = ['A', 'B', 'C', 'D']
# Create the figure and two subplots
fig, axes = plt.subplots(ncols=len(columns_to_plot))
# Create the boxplot with Seaborn
for column, axis in zip(columns_to_plot, axes):
sns.boxplot(data=df[column], ax=axis)
axis.set_title(column)
# axis.set(xticklabels=[], xticks=[], ylabel=column)
# Show the plot
plt.tight_layout()
plt.show()
I have also added a commented out line for removing the redundant xticks and their labels, which looked really annoying to me, as well as setting the ylabel name.

Grouped bar chart from two pandas data frames

I have two data frames containing different values but the same structure:
df1 =
0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249
df2 =
0 1 2 3 4
D 0.746689 0.185769 0.060107 0.007435 NaN
E 0.764552 0.000000 0.070288 0.101148 0.053499
I want to plot both data frames in a single grouped bar chart. In addition, each row (index) should be a subplot.
This can be easily achieved for one of them using pandas directly:
df1.T.plot(kind="bar", subplots=True, layout=(2,1), width=0.7, figsize=(10,10), sharey=True)
I tried to join them using
pd.concat([df1, df2], axis=1)
which results in a new dataframe:
0 1 2 3 4 0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN 0.746689 0.185769 0.060107 0.007435 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249 0.764552 0.000000 0.070288 0.101148 0.053499
However, plotting the data frame with the above method will not group the bars per column but rather treats them separately. Per subplot this results in a x-axis with duplicated ticks in order of the columns, e.g. 0,1,2,3,4,0,1,2,3,4.
Any ideas?
It is not exactly clear how the data is organized. Pandas and seaborn usually expect tidy datasets. Because you do transpose the data prior to plotting I assume you have two variable (A and B) and four observations (e.g. measurements)
df1 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df2 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df1.T
Maybe this is close to what you want:
df4 = pd.concat([df1.T, df2.T], axis=0, ignore_index=False)
df4['col'] = (len(df1.T)*(0,) + len(df2.T)*(1,))
df4.reset_index(inplace=True)
df4
using seaborns facet grid allows for convenient plotting:
sns.factorplot(x='index', y='A', hue='col', kind='bar', data=df4)

Color coding or labelling the scatter plot of a pandas dataframe?

I have a data frame that I am plotting in pandas:
import pandas as pd
df = pd.read_csv('Test.csv')
df.plot.scatter(x='x',y='y')
the data frame has 3 columns
x y result
0 2 5 Good
1 3 2 Bad
2 4 1 Bad
3 1 1 Good
4 2 23 Bad
5 1 34 Good
I want to format the scatter plot such that each point is green if df['result']='Good' and red if df['result']='Bad'.
Can do that using pd.plot or is there a way of doing it using pyplot?
df.plot.scatter('x', 'y', c=df.result.map(dict(Good='green', Bad='red')))
One approach is to plot twice on the same axes. First we plot only the "good" points, then we plot only the "bad". The trick is to use the ax keyword to the scatter method, as such:
ax = df[df.result == 'Good'].plot.scatter('x', 'y', color='green')
df[df.result == 'Bad'].plot.scatter('x', 'y', ax=ax, color='red')

Categories