Pandas DataFrame - How to make a stacked area graph stack (matplotlib) - python

I am trying to convert data in a pandas DataFrame in to a stacked area graph but can not seem to get it to stack.
The data is in the format
index | datetime (yyyy/mm/dd) | name | weight_change
With 6 different people each measured daily.
I want the stacked graph to show the weight_change (y) over the datetime (x) but with weight_change for each of the 6 people stacked on top of each other
The closest I have been able to get to it is with:
df = df.groupby['datetime', 'name'], as_index=False).agg({'weight_change': 'sum'})
agg = df.groupby('datetime').sum()
agg.plot.area()
This produces the area graph for the aggregate of the weight_change values (sum of each persons weight_change for each day) but I can't figure out how to split this up for each person like the different values here:
I have tried various things with no luck. Any ideas?

A simplified version of your data:
df = pd.DataFrame(dict(days=range(4)*2,
change=np.random.rand(8)*2.,
name=['John',]*4 + ['Jane',]*4))
df:
change days name
0 0.238336 0 John
1 0.293901 1 John
2 0.818119 2 John
3 1.567114 3 John
4 1.295725 0 Jane
5 0.592008 1 Jane
6 0.674388 2 Jane
7 1.763043 3 Jane
Now we can simply use pyplot's stackplot:
import matplotlib.pyplot as plt
days = df.days[df.name == 'John']
plt.stackplot(days, df.change[df.name == 'John'],
df.change[df.name == 'Jane'])
This produces the following plot:

Related

Incorrect labels for bars in bar plot

I'm taking a biostatistics class and we've been asked to manipulate some data from a CSV into various different types of plots. I'm having issues getting each bar on a bar plot to show the correct categorical variable. I'm following an example the professor provided and not getting what I want. I'm totally new to this, so my apologies for formatting errors.
I've created the dataframe variable and am now trying to plot it as a bar graph (and later on other variables in the CSV as other types of plots). Not sure if I'm providing the code in the correct manner, but here's what I have so far. We're supposed to create a bar plot of PET using the number of cases (number of each pet/type of pet).
This is the data for this particular question. In the CSV it's shown as just the type of pet each student has (not sure how to share the CSV, but if it'd help I can post it).
I'm editing the post to show the code I've run to get the plot, and include the CSV info (hope I'm doing this right):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
HW2 = pd.read_csv("/Path/to/file")
HW2Grouped = HW2.groupby('Pet').count()
HW2Grouped['Pet'] = HW2Grouped.index
HW2Grouped.columns = ['Pet', 'Count', 'col_1', 'col_2', 'col_3', 'col_4']
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'Pet', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
This is the data I have to work with (sorry it's just a screenshot).
This is the bar plot I got from the code I ran.
It seems to me that when you added a new column, Pet, it became the new last column. Then you renamed columns of the HW2Grouped, and the first column (where the results of count aggregation are) was renamed to Pet, and the actual Pet column became col_4.
Let me now trace back to what appeared to be wrong in the steps you tried — to make it clear what was going on.
When you grouped your DataFrame with this code:
HW2Grouped = HW2.groupby('Pet').count()
You received this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown
Pet
Cat 1 1 1 1 1
Dog 17 17 17 17 17
Horse 2 2 2 2 2
None 4 4 4 4 4
After you performed adding a new column Pet (what you might thought was creating a variable) to HW2Grouped, it started to look like this:
Height Ice Cream n of letters Favorite TA Minutes to Hometown Pet
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when you changed the .columns attribute, your grouped DataFrame became like this:
Pet Count col_1 col_2 col_3 col_4
Pet
Cat 1 1 1 1 1 Cat
Dog 17 17 17 17 17 Dog
Horse 2 2 2 2 2 Horse
None 4 4 4 4 4 None
Then, when plotting HW2Grouped, you passed Pet as an x, but Pet now wasn't there after renaming the columns, it now was the former Height column. This led to the wrong bar names.
You may try:
%matplotlib inline
HW2bar = HW2Grouped.plot.bar(x = 'col_4', y = 'Count', title = "Pet count for students")
HW2bar.set_xlabel('Pet Type')
t = HW2bar.set_ylabel('Count')
I think what you originally intended to do was this (except you didn't indicate the column to perform the count on):
HW2Grouped = HW2.groupby('Pet')['Pet'].count()
However, this won't sort the bars in a descending order.
There is a short way without column additions and renaming, the bars will be sorted:
HW2['Pet'].value_counts().plot.bar()

Python Pandas. Describe() by date

I would like to plot summary statistics over time for panel data. The X axis would be time and the Y axis would be the variable of interest with lines for Mean, min/max, P25, P50, P75 etc.
This would basically loop through and calc the stats for each date over all the individual observations and then plot them.
What I am trying to do is similar to below, but y axis would be dates instead of 1-10.
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
rd.describe().T.drop('count', axis=1).plot()
In my dataset, the time series of each individual is stacked on one another.
I tried running the following but I seem to get the descriptive stats of entire dataset and not broken down by date.
rd = rd.groupby('period').count().describe()
print (rd)
rd.show()
Using the dataframe below as the example:
df = pd.DataFrame({'Values':[10,20,30,20,40,60,40,80,120],'period': [1,2,3,1,2,3,1,2,3]})
df
Values period
0 10 1
1 20 2
2 30 3
3 20 1
4 40 2
5 60 3
6 40 1
7 80 2
8 120 3
Now,plotting the descriptive statistics by date using groupby:
df.groupby('period').describe()['Values'].drop('count', axis = 1).plot()

Pandas: how to GROUPBY by number of not NaNs for each row?

I have a result from check-all-that-apply questions:
A | B | C ...
1 | NaN | 1
NaN | 1 | NaN
Where NaN means the responder did not select that option, and 1 if they selected it.
I want to group by the number of not NaNs in each row. Specifically, this is the kind of output visualization I am trying to do:
I tried using count():
df.count(axis=1).reset_index()
And I get the number of selected boxes per user, but I don't know what's next.
If the dataframe is like this, I included 1 more row so that we get values of 4+ :
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame({'A':[1,np.nan,1,np.nan],
'B':[np.nan,1,np.nan,np.nan],
'C':[np.nan,1,1,np.nan],
'D':[np.nan,np.nan,1,np.nan]})
df.isna().sum(axis=1) would give you the count for number of NAs per row. But you want to be these values, you can use pd.cut :
labels = pd.cut(df.isna().sum(axis=1),[-np.inf,1,3,+np.inf],labels=['0-1','2-3','4+'])
labels
0 2-3
1 2-3
2 0-1
3 4+
And just plot this:
ax = (labels.value_counts(sort=False) / len(labels)).plot.bar()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

Reshape dataframe and plot stacked bar graph

What I have
I have a frame df of the following style, where each row represents a malfunction occured with specimen:
index specimen malfunction
1 'first' 'cracked'
2 'first' 'cracked'
3 'first' 'bent'
4 'second' 'bent'
5 'second' 'bent'
6 'second' 'bent'
7 'second' 'cracked'
8 'third' 'cracked'
9 'third' 'broken'
In real dataset I have about 15 different specimens and about 10 types of different malfunctions.
What I need
I want to plot a bar graph which represents how many malfunctions occured with specimen (so x-axis for specimen label, y-axis for number of malfunctions occured. I need a stacked bar chart so malfunctions must be separated by color.
What I tried
I tried to use seaborn's catplot(kind='count') which would be exactly what I need if only it could plot a stacked chart. Unfortunately it can't, and I can't figure out how to reshape my data to plot it using pandas.plot.bar(stacked=True)
Try something like this:
from matplotlib.pyplot import *
import pandas as pd
df = df.groupby(['specimen', 'malfunction']).count().unstack()
This generates the following table:
Generated table
fig, ax = subplots()
df.plot(kind='bar', stacked=True, ax=ax)
ax.legend(["bent", "broken", "cracked"]);
The result is this graph:
Result
The 1st step is to convert your categorial data in numeric:
import matplotlib.pyplot as plt
df_toPlot = df #another dataframe keep original data in df
df_toPlot['mapMal'] = df_toPlot.malfunction.astype("category").cat.codes
This is the print of df_toPlot.
index specimen malfunction mapMal
0 1 first cracked 2
1 2 first cracked 2
2 3 first bent 0
3 4 second bent 0
4 5 second bent 0
5 6 second bent 0
6 7 second cracked 2
7 8 third cracked 2
8 9 third broken 1
df_toPlot.groupby(['specimen', 'mapMal']).size().unstack().plot(kind='bar', stacked=True)
plt.show()

Hue two panda series

I have two pandas series for which I want to compare them visually by plotting them on top of each other. I already tried the following
>>> s1 = pd.Series([1,2,3,4,5])
>>> s2 = pd.Series([3,3,3,3,3])
>>> df = pd.concat([s1, s2], axis=1)
>>> sns.stripplot(data = df)
which yields the following picture:
Now, I am aware of the hue keyword of sns.stripplot but trying to apply it, requires me to to use the keywords x and y. I already tried to transform my data into a different dataframe like that
>>> df = pd.concat([pd.DataFrame({'data':s1, 'type':'s1'}), pd.DataFrame({'data':s2, 'type':'s2'})])
so I can "hue over" type; but even then I have no idea what to put for the keyword x (assuming y = 'data'). Ignoring the keyword x like that
>>> sns.stripplot(y='data', data=df, hue='type')
fails to hue anything:
seaborn generally works best with long-form data, so you might need to rearrange your dataframe slightly. The hue keyword is expecting a column, so we'll use .melt() to get one.
long_form = df.melt()
long_form['X'] = 1
sns.stripplot(data=long_form, x='X', y='value', hue='variable')
Will give you a plot that roughly reflects your requirements:
When we do pd.melt, we change the frame from having multiple columns of values to having a single column of values, with a "variable" column to identify which of our original columns they came from. We add in an 'X' column because stripplot needs both x and hueto work properly in this case. Our long_form dataframe, then, looks like this:
variable value X
0 0 1 1
1 0 2 1
2 0 3 1
3 0 4 1
4 0 5 1
5 1 3 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1

Categories