Reshape dataframe and plot stacked bar graph - python

What I have
I have a frame df of the following style, where each row represents a malfunction occured with specimen:
index specimen malfunction
1 'first' 'cracked'
2 'first' 'cracked'
3 'first' 'bent'
4 'second' 'bent'
5 'second' 'bent'
6 'second' 'bent'
7 'second' 'cracked'
8 'third' 'cracked'
9 'third' 'broken'
In real dataset I have about 15 different specimens and about 10 types of different malfunctions.
What I need
I want to plot a bar graph which represents how many malfunctions occured with specimen (so x-axis for specimen label, y-axis for number of malfunctions occured. I need a stacked bar chart so malfunctions must be separated by color.
What I tried
I tried to use seaborn's catplot(kind='count') which would be exactly what I need if only it could plot a stacked chart. Unfortunately it can't, and I can't figure out how to reshape my data to plot it using pandas.plot.bar(stacked=True)

Try something like this:
from matplotlib.pyplot import *
import pandas as pd
df = df.groupby(['specimen', 'malfunction']).count().unstack()
This generates the following table:
Generated table
fig, ax = subplots()
df.plot(kind='bar', stacked=True, ax=ax)
ax.legend(["bent", "broken", "cracked"]);
The result is this graph:
Result

The 1st step is to convert your categorial data in numeric:
import matplotlib.pyplot as plt
df_toPlot = df #another dataframe keep original data in df
df_toPlot['mapMal'] = df_toPlot.malfunction.astype("category").cat.codes
This is the print of df_toPlot.
index specimen malfunction mapMal
0 1 first cracked 2
1 2 first cracked 2
2 3 first bent 0
3 4 second bent 0
4 5 second bent 0
5 6 second bent 0
6 7 second cracked 2
7 8 third cracked 2
8 9 third broken 1
df_toPlot.groupby(['specimen', 'mapMal']).size().unstack().plot(kind='bar', stacked=True)
plt.show()

Related

Python Pandas. Describe() by date

I would like to plot summary statistics over time for panel data. The X axis would be time and the Y axis would be the variable of interest with lines for Mean, min/max, P25, P50, P75 etc.
This would basically loop through and calc the stats for each date over all the individual observations and then plot them.
What I am trying to do is similar to below, but y axis would be dates instead of 1-10.
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
rd.describe().T.drop('count', axis=1).plot()
In my dataset, the time series of each individual is stacked on one another.
I tried running the following but I seem to get the descriptive stats of entire dataset and not broken down by date.
rd = rd.groupby('period').count().describe()
print (rd)
rd.show()
Using the dataframe below as the example:
df = pd.DataFrame({'Values':[10,20,30,20,40,60,40,80,120],'period': [1,2,3,1,2,3,1,2,3]})
df
Values period
0 10 1
1 20 2
2 30 3
3 20 1
4 40 2
5 60 3
6 40 1
7 80 2
8 120 3
Now,plotting the descriptive statistics by date using groupby:
df.groupby('period').describe()['Values'].drop('count', axis = 1).plot()

Plotting a stacked bar graph from a dataframe where all bars are of same color

I am struggling to plot a simple stacked bar-graphs. Currently I have a dataframe with same unique values for two columns, like:
df
col1 col2
0 1 2
1 1 2
2 2 1
3 3 3
4 5 5 ...
I want to plot a stacked bar-graph for these two colunns, using unique values present as a single bar.
I have tried to create two different arrays to plot using seaborn as below:
df1= df.col1.value_counts()
df2= df.col2.value_counts()
plt.figure(figsize=(18, 12))
sns.barplot(df1.index, df1.values, alpha = 0.8, color=p[0], label = 'event')
sns.barplot(df2.index, df2.values, alpha = 0.4,color=p[1], label = 'hour')
But I get
TypeError: 'float' object has no attribute '__getitem__'
Suggestions please?
Thanks

Pandas DataFrame - How to make a stacked area graph stack (matplotlib)

I am trying to convert data in a pandas DataFrame in to a stacked area graph but can not seem to get it to stack.
The data is in the format
index | datetime (yyyy/mm/dd) | name | weight_change
With 6 different people each measured daily.
I want the stacked graph to show the weight_change (y) over the datetime (x) but with weight_change for each of the 6 people stacked on top of each other
The closest I have been able to get to it is with:
df = df.groupby['datetime', 'name'], as_index=False).agg({'weight_change': 'sum'})
agg = df.groupby('datetime').sum()
agg.plot.area()
This produces the area graph for the aggregate of the weight_change values (sum of each persons weight_change for each day) but I can't figure out how to split this up for each person like the different values here:
I have tried various things with no luck. Any ideas?
A simplified version of your data:
df = pd.DataFrame(dict(days=range(4)*2,
change=np.random.rand(8)*2.,
name=['John',]*4 + ['Jane',]*4))
df:
change days name
0 0.238336 0 John
1 0.293901 1 John
2 0.818119 2 John
3 1.567114 3 John
4 1.295725 0 Jane
5 0.592008 1 Jane
6 0.674388 2 Jane
7 1.763043 3 Jane
Now we can simply use pyplot's stackplot:
import matplotlib.pyplot as plt
days = df.days[df.name == 'John']
plt.stackplot(days, df.change[df.name == 'John'],
df.change[df.name == 'Jane'])
This produces the following plot:

Why can't I set the y-axis range on a plot produced from a Pandas Series?

I'm trying to create a bar graph where the y-axis ranges from 0% - 100% using matplotlib and pandas. The range I get is only 0% - 50%. Now, since all of my bars top out at ~10%, this isn't disastrous. It's just frustrating and may interfere with comparisons to other plots with the complete range.
The code I'm using is (roughly) as follows:
from matplotlib import pyplot as plt
import pandas as pd
labels = list(cm.index) #Where cm is a DataFrame
for curr in sorted(labels):
xa = cm[curr] # Pulls 1 column out of DataFrame to be plotted
xplt = xa.plot(kind='bar', rot = 0, ylim = (0,1))
xplt.set_yticklabels(['{:3.0f}%'.format(x*10) for x in range(11)])
plt.show()
Is there anything obviously wrong or missing?
A sample of a plot I get is this:
Oddly, when I omit the set_yticklabels statement, I get this:
I now realize that the first graph is not just oddly scaled, but is also giving incorrect results. The values shown in the 2nd graph are the correct ones. I guess the error is in the set_yticklabels statement, but I have no idea what it could be.
Looks like the keyword ylim works fine for pandas.DataFrame.plot.bar():
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(10, 2)), columns=['low', 'high'])
df.high = df.high * 10
low high
0 3 10
1 2 0
2 7 20
3 3 90
4 7 60
5 0 40
6 1 0
7 3 70
8 1 80
9 6 90
for col in df:
df[col].plot.bar(ylim=(0, 100))
gives:

Pandas plot subplots of a 'group by' result

I struggle with my (poor) Pandas knowledge, as I try to get a bar plot on a hierachial index by a group by operation.
My data look like this
id, val, cat1, cat2
Then I create a hierachical index:
df_group = df_len.groupby(['cat1','cat2'])
I would like to get a hbar plot per cat1 object that lists all cat2 objects that lists the values of all cat1 objects within.
None of my approaches worked:
df_group.plot(...)
for name, group in df_group: .... group.plot(...)
df_group.xs(...) experiments
The result should look somewhat like this one
I guess I just lack of knowledge of pandas, matplotlib, ... -internals and it's not that difficult to plot a few 100 items (cat2<10, cat1=30)
.
I'd recommend using seaborn to do this type of faceted plot. Doing it in matplotlib is very tricky as the library is quite low level. Seaborn excels at this use case.
Ok guys, so here it's how I solved it finally:
dfc = df_len.groupby(['cat1','cat2']).count().reset_index()
dfp=dfc.pivot(index="cat1",columns="cat2")
dfp.columns = dfp.columns.get_level_values(1)
dfp.plot(kind='bar', figsize=(15, 5), stacked=True);
In short: I used a pivot table to transpose my matrix and then I was able to plot the single cols automaticly, at example 2 here.
Not so tricky in matplotlib, see:
In [54]:
print df
cat1 cat2 val
0 A 1 0.011887
1 A 2 0.880121
2 A 3 0.034244
3 A 4 0.530230
4 B 1 0.510812
5 B 2 0.405322
6 B 3 0.406259
7 B 4 0.406405
In [55]:
col_list = ['r', 'g']
ax = plt.subplot(111)
for (idx, (grp, val)) in enumerate(df.groupby('cat1')):
ax.bar(val.cat2+0.25*idx-0.25,
val.val, width=0.25,
color=col_list[idx],
label=grp)
plt.legend()

Categories