Stacked bar plots from list of dataframes with groupby command - python

I wish to create a (2x3) stacked barchart subplot from results using a groupby.size command, let me explain. I have a list of dataframes: list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]. A small example of these df's would be:
... Create Time Location Area Id Beat Priority ... Closed Time
2011-01-01 00:00:00 ST&SAN PABLO AV 1.0 06X 1.0 ... 2011-01-01 00:28:17
2011-01-01 00:01:11 ST&HANNAH ST 1.0 07X 1.0 ... 2011-01-01 01:12:56
.
.
.
(can only add a few columns as the layout messes up)
I'm using a groupby.size command to get a required count of events for these databases, see below:
list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
for i in list_df:
print(i.groupby(['Beat', 'Priority']).size())
print(' ')
Producing:
Beat Priority
01X 1.0 394
2.0 1816
02X 1.0 644
2.0 1970
02Y 1.0 661
2.0 2309
03X 1.0 857
2.0 2962
.
.
.
I wish to identify which is the top 10 TOTALS using the beat column. So for e.g. the totals above are:
Beat Priority Total for Beat
01X 1.0 394
2.0 1816 2210
02Y 1.0 661
2.0 2309 2970
03X 1.0 857
2.0 2962 3819
.
.
.
So far I have used plot over my groupby.size but it hasn't done the collective total as I described above. Check out below:
list_df = [df_2011, df_2012, df_2013, df_2014, df_2015, df_2016]
fig, axes = plt.subplots(2, 3)
for d, i in zip(list_df, range(6)):
ax = axes.ravel()[i];
d.groupby(['Beat', 'Priority']).size().nlargest(10).plot(ax=ax, kind='bar', figsize=(15, 7), stacked=True, legend=True)
ax.set_title(f"Top 10 Beats for {i+ 2011}")
plt.tight_layout()
I wish to have the 2x3 subplot layout, but with stacked barcharts like this one I have done previously:
Thanks in advance. This has been harder than I thought it would be!

The data series need to be the columns, so you probably want
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# create fake input data
ncols = 300
list_df = [pd.DataFrame({'Beat': np.random.choice(['{:02d}X'.format(i) for i in range(15)], ncols),
'Priority': np.random.choice(['1', '2'], ncols),
'othercolumn1': range(ncols),
'othercol2': range(ncols),
'year': [yr] * ncols}) for yr in range(2011, 2017)]
In [22]: print(list_df[0].head(5))
Beat Priority othercolumn1 othercol2 year
0 06X 1 0 0 2011
1 05X 1 1 1 2011
2 04X 1 2 2 2011
3 01X 2 3 3 2011
4 00X 1 4 4 2011
fig, axes = plt.subplots(2, 3)
for i, d in enumerate(list_df):
ax = axes.flatten()[i]
dplot = d[['Beat', 'Priority']].pivot_table(index='Beat', columns='Priority', aggfunc=len)
dplot = (dplot.assign(total=lambda x: x.sum(axis=1))
.sort_values('total', ascending=False)
.head(10)
.drop('total', axis=1))
dplot.plot.bar(ax=ax, figsize=(15, 7), stacked=True, legend=True)

Related

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

Multi-year time series charge with shaded range in python

I have these charts that I've created in Excel from dataframes of a structure like such:
so that the chart can be created like this, stacking the 5-Year Range area on top of the Min range (no fill) so that the range area can be shaded. The min/max/range/avg columns all calculate off of 2016-2020.
I know that I can plot lines for multiple years on the same axis by using a date index and applying month labels, but is there a way to replicate the shading of this chart, more specifically if my dataframes are in a simple date index-value format, like so:
Quantity
1/1/2016 6
2/1/2016 4
3/1/2016 1
4/1/2016 10
5/1/2016 7
6/1/2016 10
7/1/2016 10
8/1/2016 2
9/1/2016 1
10/1/2016 2
11/1/2016 3
… …
1/1/2020 4
2/1/2020 8
3/1/2020 3
4/1/2020 5
5/1/2020 8
6/1/2020 6
7/1/2020 6
8/1/2020 7
9/1/2020 8
10/1/2020 5
11/1/2020 4
12/1/2020 3
1/1/2021 9
2/1/2021 7
3/1/2021 7
I haven't been able to find anything similar in the plot libraries.
Two step process
restructure DF so that years are columns, rows indexed by uniform date time
plot using matplotlib
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# straight date as index, quantity as column
d = pd.date_range("1-Jan-2016", "1-Mar-2021", freq="MS")
df = pd.DataFrame({"Quantity":np.random.randint(1, 10, len(d))}, index=d)
# re-structure as multi-index, make year column
# add calculated columns
dfg = (df.set_index(pd.MultiIndex.from_arrays([df.index.map(lambda d: dt.date(dt.date.today().year, d.month, d.day)),
df.index.year], names=["month","year"]))
.unstack("year")
.droplevel(0, axis=1)
.assign(min=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].min(axis=1),
max=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].max(axis=1),
avg=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].mean(axis=1).round(1),
)
)
fig, ax = plt.subplots(1, figsize=[14,4])
# now plot all the parts
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5y range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="r")
ax.plot(dfg.index, dfg[2021], label="2021", c="g")
ax.plot(dfg.index, dfg.avg, label="5 yr avg", c="y", ls=(0,(1,2)), lw=3)
# adjust axis
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.legend(loc = 'best')

How to make multiple scatter subplots with sharing one-axis?

date
name
amount
1
harry
100
1
joe
20
2
harry
50
3
joe
60
3
lee
25
4
lee
60
4
harry
200
4
joe
90
I was trying to share 'date' axis (x-axis) with 432 person name. Image was too large to show.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec=pd.read_csv('december.csv')
sns.lmplot(x='date', y='amount',
data= dec, fit_reg=False, hue='name', legend=True, palette='Set1')
This code is giving one graph with 432 hue. But I want 432 graphs. How to do it?
Using the same code you wrote, but instead of putting hue='name', you put col='name' and it should give you the expected behavior:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
dec = pd.DataFrame(
[
[1,'harry',100],
[1,'joe',20],
[2,'harry',50],
[3,'joe',60],
[3,'lee',25],
[4,'lee',60],
[4,'harry',200],
[4,'joe',90],
],
columns=['date','name','amount'],
)
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
If you want to break the rows, you can define a column wrapper with col_wrap (number of plots per row):
sns.lmplot(
x='date',
y='amount',
data= dec,
fit_reg=False,
col='name',
col_wrap=1,
legend=True,
palette='Set1',
)
EDIT: using the groupby() method, you can easily get aggregates such as number of dots per plot and total amount per group.
The main idea is to group the records in the dec dataframe by name (has it was implicitly done in the plot above).
Continuing on the code above, you can have a preview of the groupby operation using the describe method:
dec.groupby('name').describe()
Out[2]:
date amount
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
name
harry 3.0 2.333333 1.527525 1.0 1.50 2.0 3.00 4.0 3.0 116.666667 76.376262 50.0 75.00 100.0 150.00 200.0
joe 3.0 2.666667 1.527525 1.0 2.00 3.0 3.50 4.0 3.0 56.666667 35.118846 20.0 40.00 60.0 75.00 90.0
lee 2.0 3.500000 0.707107 3.0 3.25 3.5 3.75 4.0 2.0 42.500000 24.748737 25.0 33.75 42.5 51.25 60.0
Using the pandas groupby method, we group records by 'name' and pick any column (here: 'amount') to get the count (the count is the same aggregate for each column, as it counts each individual occurence of each different 'name'):
counts = dec.groupby('name')['amount'].count()
counts
Out[3]:
name
harry 3
joe 3
lee 2
Name: amount, dtype: int64
To get the total amount, we do the same, we pick the 'amount' column and call the sum() method instead of the count() method:
total_amounts = dec.groupby('name')['amount'].sum()
total_amounts
Out[4]:
name
harry 350
joe 170
lee 85
Name: amount, dtype: int64
We now have two series indexed by 'name' containing the information we want: counts and total_amounts.
We're gonna use these two series to build a title for each subplot:
plot = sns.lmplot(
x='date',
y='amount',
data=dec,
fit_reg=False,
col='name',
legend=True,
palette='Set1',
)
for name in plot.axes_dict:
sublot_title = f'name = {name}, number of dots = {counts[name]}, total amount = {total_amounts[name]}'
plot.axes_dict[name].set_title(sublot_title)
plot.fig
It prints:

Rolling Mean ValueError

I am trying to plot the rolling mean on a double-axis graph. However, I get the ValueError: view limit minimum -36867.6 is less than 1 and is an invalid Matplotlib date value. This often happens if you pass a non-datetime value to an axis that has datetime units error. My columns do have datetime objects in them so I am not sure why this is happening.
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
lns1 = ax1.plot(df5['TIME'],
df5["y"])
lns2 = ax2.plot(df3_plot.rolling(window=3).mean(),
color='black')
df5 looks like this:
TIME y
0 1990-01-01 3.380127
1 1990-02-01 3.313274
2 1990-03-01 4.036463
3 1990-04-01 3.813060
4 1990-05-01 3.847867
...
355 2019-08-01 8.590325
356 2019-09-01 7.642616
357 2019-10-01 8.362921
358 2019-11-01 7.696176
359 2019-12-01 8.206370
And df3_plot looks like this:
date y
0 1994-01-01 239.274414
1 1994-02-01 226.126581
2 1994-03-01 211.591748
3 1994-04-01 214.708679
4 1995-01-01 223.093071
...
99 2018-04-01 181.889699
100 2019-01-01 174.500096
101 2019-02-01 179.803310
102 2019-03-01 175.570419
103 2019-04-01 176.697451
Futhermore, the graph comes out fine if I don't try using rolling mean for df3_plot. This means that the x-axis is a datetime for both. When I have
lns2 = ax2.plot(df3_plot['date'],
df3_plot['y'],
color='black')
I get this graph
Edit
Suppose that df5 has another column 'y2' that is correctly rolling meaned with 'y'. How can I graph and label it properly? I currently have
df6 = df5.rolling(window=12).mean()
lns1 = ax1.plot(
df6,
label = 'y', # how do I add 'y2' label correctly?
linewidth = 2.0)
df6 looks like this:
TIME y y2
0 1990-01-01 NaN NaN
1 1990-02-01 NaN NaN
2 1990-03-01 NaN NaN
3 1990-04-01 NaN NaN
4 1990-05-01 NaN NaN
... ... ... ...
355 2019-08-01 10.012447 8.331901
356 2019-09-01 9.909044 8.263813
357 2019-10-01 9.810155 8.185539
358 2019-11-01 9.711690 8.085016
359 2019-12-01 9.619968 8.035330
Making 'date' into the index of my dataframe did the trick: df3_plot.set_index('date', inplace=True).
However, I'm not sure why the error messages are different for #dm2 and I.
You already caught this, but the problem is that rolling by default works on the index. There is also an on parameter for setting a column to work on instead:
rolling = df3_plot.rolling(window=3, on='date').mean()
lns2 = ax2.plot(rolling['date'], rolling['y'], color='black')
Note that if you just do df3_plot.rolling(window=3).mean(), you get this:
y
0 NaN
1 NaN
2 0.376586
3 0.168073
4 0.258431
.. ...
299 0.285585
300 0.327987
301 0.518088
302 0.300169
303 0.299366
[304 rows x 1 columns]
Seems like matplotlib tries to plot y here since there is only one column. But the index is int, not dates, so I believe that leads to the error you saw when trying to plot over the other date axis.
When you use on to create rolling in my example, the result still has date and y columns, so you still need to reference the appropriate columns when plotting.

Scatter plotting data from two different data frames in python

I have two different data frames in following format.
dfclean
Out[1]:
obj
0 682
1 101
2 33
dfmalicious
Out[2]:
obj
0 17
1 43
2 8
3 9
4 211
My use-case is to plot a single scatter graph that distinctly shows the obj values from both the dataframes. I am using python for this purpose. I looked at a few examples where two columns of same dataframe were used to plot the data but couldnt replicate it for my use-case. Any help is greatly appreciated.
How to plot two DataFrame on same graph for comparison
To plot multiple column groups in a single axes, repeat plot method specifying target ax
Option 1]
In [2391]: ax = dfclean.reset_index().plot(kind='scatter', x='index', y='obj',
color='Red', label='G1')
In [2392]: dfmalicious.reset_index().plot(kind='scatter', x='index', y='obj',
color='Blue', label='G2', ax=ax)
Out[2392]: <matplotlib.axes._subplots.AxesSubplot at 0x2284e7b8>
Option 2]
In [2399]: dff = dfmalicious.merge(dfclean, right_index=True, left_index=True,
how='outer').reset_index()
In [2406]: dff
Out[2406]:
index obj_x obj_y
0 0 17 682.0
1 1 43 101.0
2 2 8 33.0
3 3 9 NaN
4 4 211 NaN
In [2400]: ax = dff.plot(kind='scatter', x='index', y='obj_x', color='Red', label='G1')
In [2401]: dff.plot(kind='scatter', x='index', y='obj_y', color='Blue', label='G2', ax=ax)
Out[2401]: <matplotlib.axes._subplots.AxesSubplot at 0x11dbe1d0>

Categories