Grouped bar chart from two pandas data frames - python

I have two data frames containing different values but the same structure:
df1 =
0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249
df2 =
0 1 2 3 4
D 0.746689 0.185769 0.060107 0.007435 NaN
E 0.764552 0.000000 0.070288 0.101148 0.053499
I want to plot both data frames in a single grouped bar chart. In addition, each row (index) should be a subplot.
This can be easily achieved for one of them using pandas directly:
df1.T.plot(kind="bar", subplots=True, layout=(2,1), width=0.7, figsize=(10,10), sharey=True)
I tried to join them using
pd.concat([df1, df2], axis=1)
which results in a new dataframe:
0 1 2 3 4 0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN 0.746689 0.185769 0.060107 0.007435 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249 0.764552 0.000000 0.070288 0.101148 0.053499
However, plotting the data frame with the above method will not group the bars per column but rather treats them separately. Per subplot this results in a x-axis with duplicated ticks in order of the columns, e.g. 0,1,2,3,4,0,1,2,3,4.
Any ideas?

It is not exactly clear how the data is organized. Pandas and seaborn usually expect tidy datasets. Because you do transpose the data prior to plotting I assume you have two variable (A and B) and four observations (e.g. measurements)
df1 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df2 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df1.T
Maybe this is close to what you want:
df4 = pd.concat([df1.T, df2.T], axis=0, ignore_index=False)
df4['col'] = (len(df1.T)*(0,) + len(df2.T)*(1,))
df4.reset_index(inplace=True)
df4
using seaborns facet grid allows for convenient plotting:
sns.factorplot(x='index', y='A', hue='col', kind='bar', data=df4)

Related

Hue two panda series

I have two pandas series for which I want to compare them visually by plotting them on top of each other. I already tried the following
>>> s1 = pd.Series([1,2,3,4,5])
>>> s2 = pd.Series([3,3,3,3,3])
>>> df = pd.concat([s1, s2], axis=1)
>>> sns.stripplot(data = df)
which yields the following picture:
Now, I am aware of the hue keyword of sns.stripplot but trying to apply it, requires me to to use the keywords x and y. I already tried to transform my data into a different dataframe like that
>>> df = pd.concat([pd.DataFrame({'data':s1, 'type':'s1'}), pd.DataFrame({'data':s2, 'type':'s2'})])
so I can "hue over" type; but even then I have no idea what to put for the keyword x (assuming y = 'data'). Ignoring the keyword x like that
>>> sns.stripplot(y='data', data=df, hue='type')
fails to hue anything:
seaborn generally works best with long-form data, so you might need to rearrange your dataframe slightly. The hue keyword is expecting a column, so we'll use .melt() to get one.
long_form = df.melt()
long_form['X'] = 1
sns.stripplot(data=long_form, x='X', y='value', hue='variable')
Will give you a plot that roughly reflects your requirements:
When we do pd.melt, we change the frame from having multiple columns of values to having a single column of values, with a "variable" column to identify which of our original columns they came from. We add in an 'X' column because stripplot needs both x and hueto work properly in this case. Our long_form dataframe, then, looks like this:
variable value X
0 0 1 1
1 0 2 1
2 0 3 1
3 0 4 1
4 0 5 1
5 1 3 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1

Plotting a stacked bar graph from a dataframe where all bars are of same color

I am struggling to plot a simple stacked bar-graphs. Currently I have a dataframe with same unique values for two columns, like:
df
col1 col2
0 1 2
1 1 2
2 2 1
3 3 3
4 5 5 ...
I want to plot a stacked bar-graph for these two colunns, using unique values present as a single bar.
I have tried to create two different arrays to plot using seaborn as below:
df1= df.col1.value_counts()
df2= df.col2.value_counts()
plt.figure(figsize=(18, 12))
sns.barplot(df1.index, df1.values, alpha = 0.8, color=p[0], label = 'event')
sns.barplot(df2.index, df2.values, alpha = 0.4,color=p[1], label = 'hour')
But I get
TypeError: 'float' object has no attribute '__getitem__'
Suggestions please?
Thanks

Boxplot of Multiple Columns of a Pandas Dataframe on the Same Figure (seaborn)

I feel I am probably not thinking of something obvious. I want to put in the same figure, the box plot of every column of a dataframe, where on the x-axis I have the columns' names. In the seaborn.boxplot() this would be equal to groupby by every column.
In pandas I would do
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
df.boxplot()
which yields
Now I would like to get the same thing in seaborn. But when I try sns.boxplot(df), I get only one grouped boxplot. How do I reproduce the same figure in seaborn?
The seaborn equivalent of
df.boxplot()
is
sns.boxplot(x="variable", y="value", data=pd.melt(df))
or just
sns.boxplot(data=df)
which will plot any column of numeric values, without converting the DataFrame from a wide to long format, using seaborn v0.11.1. This will create a single figure, with a separate boxplot for each column.
Complete example with melt:
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
sns.boxplot(x="variable", y="value", data=pd.melt(df))
plt.show()
This works because pd.melt converts a wide-form dataframe
A B C D
0 0.374540 0.950714 0.731994 0.598658
1 0.156019 0.155995 0.058084 0.866176
2 0.601115 0.708073 0.020584 0.969910
3 0.832443 0.212339 0.181825 0.183405
to long-form
variable value
0 A 0.374540
1 A 0.156019
2 A 0.601115
3 A 0.832443
4 B 0.950714
5 B 0.155995
6 B 0.708073
7 B 0.212339
8 C 0.731994
9 C 0.058084
10 C 0.020584
11 C 0.181825
12 D 0.598658
13 D 0.866176
14 D 0.969910
15 D 0.183405
You could use the built-in pandas method df.plot(kind='box') as suggested in this question.
I realize this answer will not help you if you have to use seaborn, but it may be useful for people with simpler requirements.
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data = np.random.random(size=(4,4)), columns = ['A','B','C','D'])
df.plot(kind='box')
plt.show()
plt.boxplot([df1,df2], boxprops=dict(color='red'), labels=['title 1','title 2'])
The rest of the answers are great and should work well for most use-cases.
But if someone has the same problem as I have where the range of values is very large for one column (possibly a different scale) and you are not able to see anything else for other columns you can do the following: utilize subplots in order to create multiple y-axes within the figure.
# Store the list of columns
columns_to_plot = ['A', 'B', 'C', 'D']
# Create the figure and two subplots
fig, axes = plt.subplots(ncols=len(columns_to_plot))
# Create the boxplot with Seaborn
for column, axis in zip(columns_to_plot, axes):
sns.boxplot(data=df[column], ax=axis)
axis.set_title(column)
# axis.set(xticklabels=[], xticks=[], ylabel=column)
# Show the plot
plt.tight_layout()
plt.show()
I have also added a commented out line for removing the redundant xticks and their labels, which looked really annoying to me, as well as setting the ylabel name.

pandas - a better way to plot binned x vs y

New to Pandas and I'm wondering if there's a better way to accomplish the following -
Set up:
import pandas as pd
import numpy as np
x = np.arange(0, 1, .01)
y = np.random.binomial(10, x, 100)
bins = 50
df = pd.DataFrame({'x':x, 'y':y})
print(df.head())
x y
0 -1 1
1 38 1
2 56 0
3 42 0
4 41 0
I would like to group the x values into equal size bins, and for each bin take the average value of both x and y.
my_bins = pd.cut(x, bins=20)
data = df[['x', 'y']].groupby(my_bins).agg(['mean', 'size'])
print(data.head())
x y
mean size mean size
age
(-1.101, 4.05] -1.000000 87990 0.768428 87990
(4.05, 9.1] NaN 0 NaN 0
(9.1, 14.15] NaN 0 NaN 0
(14.15, 19.2] 18.512286 1872 0.493590 1872
(19.2, 24.25] 22.768022 8906 0.496968 8906
Well that works. But from here, how do I plot x's mean vs y's mean? I know I can do something like
data.columns = data.columns.droplevel() # remove the multiple levels that were created
data.columns = ['x_mean', 'x_size', 'y_mean', 'y_size'] # manually set new column names
data.plot.scatter(x='x_mean', y='y_mean') # plot
But this feels wrong and clunky as I have to drop the column levels (which removes useful structure from my data) and I have to manually rename the columns. Is there a better way?
You can specify the x and y parameters pointing the multi-level columns using tuples:
data.plot.scatter(x=('x', 'mean'), y=('y', 'mean'))
This way, you don't need to rename the columns in order to plot it.

Pandas plot subplots of a 'group by' result

I struggle with my (poor) Pandas knowledge, as I try to get a bar plot on a hierachial index by a group by operation.
My data look like this
id, val, cat1, cat2
Then I create a hierachical index:
df_group = df_len.groupby(['cat1','cat2'])
I would like to get a hbar plot per cat1 object that lists all cat2 objects that lists the values of all cat1 objects within.
None of my approaches worked:
df_group.plot(...)
for name, group in df_group: .... group.plot(...)
df_group.xs(...) experiments
The result should look somewhat like this one
I guess I just lack of knowledge of pandas, matplotlib, ... -internals and it's not that difficult to plot a few 100 items (cat2<10, cat1=30)
.
I'd recommend using seaborn to do this type of faceted plot. Doing it in matplotlib is very tricky as the library is quite low level. Seaborn excels at this use case.
Ok guys, so here it's how I solved it finally:
dfc = df_len.groupby(['cat1','cat2']).count().reset_index()
dfp=dfc.pivot(index="cat1",columns="cat2")
dfp.columns = dfp.columns.get_level_values(1)
dfp.plot(kind='bar', figsize=(15, 5), stacked=True);
In short: I used a pivot table to transpose my matrix and then I was able to plot the single cols automaticly, at example 2 here.
Not so tricky in matplotlib, see:
In [54]:
print df
cat1 cat2 val
0 A 1 0.011887
1 A 2 0.880121
2 A 3 0.034244
3 A 4 0.530230
4 B 1 0.510812
5 B 2 0.405322
6 B 3 0.406259
7 B 4 0.406405
In [55]:
col_list = ['r', 'g']
ax = plt.subplot(111)
for (idx, (grp, val)) in enumerate(df.groupby('cat1')):
ax.bar(val.cat2+0.25*idx-0.25,
val.val, width=0.25,
color=col_list[idx],
label=grp)
plt.legend()

Categories