Multi-year time series charge with shaded range in python - python

I have these charts that I've created in Excel from dataframes of a structure like such:
so that the chart can be created like this, stacking the 5-Year Range area on top of the Min range (no fill) so that the range area can be shaded. The min/max/range/avg columns all calculate off of 2016-2020.
I know that I can plot lines for multiple years on the same axis by using a date index and applying month labels, but is there a way to replicate the shading of this chart, more specifically if my dataframes are in a simple date index-value format, like so:
Quantity
1/1/2016 6
2/1/2016 4
3/1/2016 1
4/1/2016 10
5/1/2016 7
6/1/2016 10
7/1/2016 10
8/1/2016 2
9/1/2016 1
10/1/2016 2
11/1/2016 3
… …
1/1/2020 4
2/1/2020 8
3/1/2020 3
4/1/2020 5
5/1/2020 8
6/1/2020 6
7/1/2020 6
8/1/2020 7
9/1/2020 8
10/1/2020 5
11/1/2020 4
12/1/2020 3
1/1/2021 9
2/1/2021 7
3/1/2021 7
I haven't been able to find anything similar in the plot libraries.

Two step process
restructure DF so that years are columns, rows indexed by uniform date time
plot using matplotlib
import datetime as dt
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
# straight date as index, quantity as column
d = pd.date_range("1-Jan-2016", "1-Mar-2021", freq="MS")
df = pd.DataFrame({"Quantity":np.random.randint(1, 10, len(d))}, index=d)
# re-structure as multi-index, make year column
# add calculated columns
dfg = (df.set_index(pd.MultiIndex.from_arrays([df.index.map(lambda d: dt.date(dt.date.today().year, d.month, d.day)),
df.index.year], names=["month","year"]))
.unstack("year")
.droplevel(0, axis=1)
.assign(min=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].min(axis=1),
max=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].max(axis=1),
avg=lambda dfa: dfa.loc[:,[c for c in dfa.columns if dfa[c].count()==12]].mean(axis=1).round(1),
)
)
fig, ax = plt.subplots(1, figsize=[14,4])
# now plot all the parts
ax.fill_between(dfg.index, dfg["min"], dfg["max"], label="5y range", facecolor="oldlace")
ax.plot(dfg.index, dfg[2020], label="2020", c="r")
ax.plot(dfg.index, dfg[2021], label="2021", c="g")
ax.plot(dfg.index, dfg.avg, label="5 yr avg", c="y", ls=(0,(1,2)), lw=3)
# adjust axis
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.legend(loc = 'best')

Related

Getting labels for legend after graphing pivot of dataframe in pandas

I am trying to have my plot show a legend where the column each value came from would have a label. I did not separate the plt.plot() from the pivot step but want to know if it is still possible to have a legend. One does not show up at all and if I add
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'), label='DC')
it just uses that string as every label, if I add df_EPErrorPercentByWeekAndDC['DC'] then it just shows one letter of it per legend item. Here is the code I have:
print("### Graphing Error Rates by Week and DC EP ###")
# remove percent sign from percent in place
df_EPErrorPercentByWeekAndDC['Error Percent'] = df_EPErrorPercentByWeekAndDC['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation = 90)
plt.plot(df_EPErrorPercentByWeekAndDC.pivot(index='hellofresh delivery week', columns='DC', values='Error Percent'))
plt.legend()
plt.savefig('EPErrorPercentByWeekAndDC.png', bbox_inches="tight", dpi=500)
plt.close()
and I cant share any of the data but it is in the format of a pivot table with columns with state names and each column is full of percentages, the graph works fine but the legend isnt there.
Either save the pivoted frame then specify the legend as the columns:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
plt.xticks(rotation=90)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
plt.plot(pivoted_df)
plt.legend(pivoted_df.columns, title=pivoted_df.columns.name)
plt.tight_layout()
plt.show()
Or use DataFrame.plot directly on the pivoted DataFrame which will handle the legend automatically:
df['Error Percent'] = df['Error Percent'].str[:-1].astype(float)
pivoted_df = df.pivot(index='hellofresh delivery week', columns='DC',
values='Error Percent')
ax = pivoted_df.plot(rot=90)
plt.tight_layout()
plt.show()
Sample DataFrame and imports used:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame({
'hellofresh delivery week': np.repeat(
pd.date_range('2021-05-01', periods=5, freq='W'), 2),
'DC': ['A', 'B'] * 5,
'Error Percent': pd.Series(np.random.randint(10, 100, 10), dtype=str) + '%'
})
df:
hellofresh delivery week DC Error Percent
0 2021-05-02 A 88%
1 2021-05-02 B 71%
2 2021-05-09 A 26%
3 2021-05-09 B 83%
4 2021-05-16 A 18%
5 2021-05-16 B 72%
6 2021-05-23 A 37%
7 2021-05-23 B 40%
8 2021-05-30 A 90%
9 2021-05-30 B 17%
pivoted_df:
DC A B
hellofresh delivery week
2021-05-02 88.0 71.0
2021-05-09 26.0 83.0
2021-05-16 18.0 72.0
2021-05-23 37.0 40.0
2021-05-30 90.0 17.0

How to I make a line graph out of this?

I've imported seaborn and typed in this:
Bunker2019_Jan_to_Jun.plot(x='2019', y='Total')
Bunker2019_Jan_to_Jun.plot(x='2019', y='MGO')
and it shows two graphs. Is there any way I can show the year 2019(Jan to Dec) and 2020(Jan to Jun)?
If you like them on the same plot, you need to combine the data frame, not very sure what is your "2019" column (date or string?), so below I tried to create a data.frame thats like yours:
import seaborn as sns
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
mths = pd.date_range(start='1/1/2019', periods=12,freq="M").strftime("%b").to_list()
Bunker2019 = pd.DataFrame({'2019':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Bunker2020 = pd.DataFrame({'2020':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Simple way to add the year to create a new date:
Bunker2019['Date'] = '2019_'+ Bunker2019['2019'].astype(str)
Bunker2020['Date'] = '2020_'+ Bunker2020['2020'].astype(str)
We concat and melt, setting an order:
df = pd.concat([Bunker2019[['Date','Total','MGO']],Bunker2020[['Date','Total','MGO']]])
df = df.melt(id_vars='Date')
df['Date'] = pd.Categorical(df['Date'],categories=df['Date'].unique(),ordered=True)
So now it is a long format, containing information for both 2020 and 2019:
Date variable value
0 2019_Jan Total 0.187751
1 2019_Feb Total 0.091374
2 2019_Mar Total 0.929739
3 2019_Apr Total 0.621981
4 2019_May Total 0.371236
5 2019_Jun Total 0.027078
6 2019_Jul Total 0.719046
7 2019_Aug Total 0.138531
Now to plot:
plt.figure(figsize=(12,5))
ax = sns.lineplot(data=df,x='Date',y='value',hue='variable')
sns.scatterplot(data=df,x='Date',y='value',hue='variable',ax=ax,legend=False)
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()
I created the source DataFrame as:
Month MGO MFO
0 2019-01 79.1 85.0
1 2019-02 69.9 91.2
2 2019-03 68.9 90.4
3 2019-04 71.1 87.0
4 2019-05 75.9 85.6
5 2019-06 60.9 82.1
6 2019-07 68.4 75.0
7 2019-08 75.8 60.7
8 2019-09 82.0 58.8
9 2019-10 95.3 56.6
10 2019-11 90.2 59.7
11 2019-12 86.5 57.7
12 2020-01 79.1 50.0
13 2020-02 88.9 52.2
14 2020-03 74.9 54.4
15 2020-04 87.1 51.0
16 2020-05 92.9 52.6
17 2020-06 105.9 53.1
(for now Month column as string).
If you have 2 separate source DataFrames, concatenate them.
The first processing step is to convert Month column to datetime
type and set it as the index:
df.Month = pd.to_datetime(df.Month)
df.set_index('Month', inplace=True)
The first, more straightforward possibility to create
the drawing is:
df.plot(style='-x');
For my data sample I got:
The second possibility is to generate the picture with smoothened lines.
To do this, you can draw two plots in a single axex:
first - smoothened line, from resampled DataFrame, with
interpolation, but without markers, as now there are much more points,
second - only markers, taken from the original DataFrame,
both with the same list of colors.
The code to do it is:
fig, ax = plt.subplots()
color = ['blue', 'orange']
df.resample('D').interpolate('quadratic').plot(ax=ax, color=color)
df.plot(ax=ax, marker='x', linestyle='None', legend=False, color=color);
This time the result is:

In pandas, how can I choose the curve color in a plot based on MultiIndex?

I have a MultiIndex dataframe like this one
Freights Passengers
Station Hogwarts Hogwarts
Direction Northbound Southbound Northbound Southbound
Date
2020-04-10 0 1 0 9
2020-04-11 0 2 0 8
2020-04-12 5 3 15 17
2020-04-13 6 4 4 26
I would like to get a plot with a curve for each column, the Freights columns should be red while the Passengers should be blue.
I have managed to get the result as in the following snippet but I would like to know whether there is a more idiomatic way (for example avoiding the for-loop) of doing it.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Date': ['2020-04-10', '2020-04-11', '2020-04-12', '2020-04-13','2020-04-10', '2020-04-11', '2020-04-12', '2020-04-13'],
'Station': ['Hogwarts', 'Hogwarts', 'Hogwarts', 'Hogwarts','Hogwarts', 'Hogwarts', 'Hogwarts', 'Hogwarts'],
'Direction': ['Southbound', 'Southbound', 'Southbound', 'Southbound','Northbound','Northbound','Northbound','Northbound'],
'Daily trains': [10,10,20,30,0,0,20,10],
'Freights': [1,2,3,4,0,0,5,6],
'Passengers': [9,8,17,26,0,0,15,4]})
df['Date'] = pd.to_datetime(df['Date'],format="%Y-%m-%d")
df1 = df.pivot_table(index='Date', columns=['Station','Direction'],values=['Freights','Passengers'])
colors = {'Freights':'red','Passengers':'blue'}
fig, ax = plt.subplots(1)
for i in df1:
df1[i].plot(ax=ax,color=colors[i[0]])
ax.figure.savefig('so.png',bbox_inches='tight')

How to generate discrete data to pass into a contour plot using pandas and matplotlib?

I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0

Pandas scatter plot by category and point size

So I had the idea to using a single Pandas plot to show two different datum, one in Y axis and the other as the point size, but I wanted to categorize them, i.e., the X axis is not a numerical value but some categories. I'll start by illustrating my two example dataframes:
earnings:
DayOfWeek Hotel Bar Pool
0 Sunday 41 32 15
1 Monday 45 38 24
2 Tuesday 42 32 27
3 Wednesday 45 37 23
4 Thursday 47 34 26
5 Friday 43 30 19
6 Saturday 48 30 28
and
tips:
DayOfWeek Hotel Bar Pool
0 Sunday 7 8 6
1 Monday 9 7 5
2 Tuesday 5 4 1
3 Wednesday 8 6 7
4 Thursday 4 5 10
5 Friday 3 1 1
6 Saturday 10 2 6
Earnings is the total earnings in the hotel, the bar and the pool, and tips is the average tip value in the same locations. I'll post my code as an answer, please fell free to improve/update.
Cheers!
See also:
Customizing Plot Legends
Here's the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
earnings = pd.read_csv('earnings.csv', sep=';')
tips = pd.read_csv('tips.csv', sep=';')
print(earnings)
print(tips)
earnings['index'] = earnings.index
height, width = earnings.shape
cols = list(earnings.columns.values)
colors = ['r', 'g', 'b']
# Thanks for
# https://stackoverflow.com/questions/43812911/adding-second-legend-to-scatter-plot
plt.rcParams["figure.subplot.right"] = 0.8
plt.figure(figsize=(8,4))
# get axis
ax = plt.gca()
# plot each column, each row will be in a different X coordinate, creating a category
for x in range(1, width-1):
earnings.plot.scatter(x='index', y=cols[x], label=None,
xticks=earnings.index, c=colors[x-1],
s=tips[cols[x]].multiply(10), linewidth=0, ax=ax)
# This second 'dummy' plot is to create the legend. If we use the one above,
# [enter image description here][1]the circles in the legend might have different sizes
for x in range(1,width-1):
earnings.loc[:1].plot.scatter([], [], label=cols[x], c=colors[x-1], s=30,
linewidth=0, ax=ax)
# Label the X ticks with the categories' names
ax.set_xticklabels(earnings.loc[:,'DayOfWeek'])
ax.set_ylabel("Total Earnings")
ax.set_xlabel("Day of Week")
leg = plt.legend(title="Location", loc=(1.03,0))
ax.add_artist(leg)
# Create a second legent for the points' scale.
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(1,10, 2)]
plt.legend(handles=h, labels=range(1,10, 2), loc=(1.03,0.5), title="Avg. Tip")
plt.show()
# See also:
# https://jakevdp.github.io/PythonDataScienceHandbook/04.06-customizing-legends.html
Resulting figure
This is the kind of plot that is suited for a grammar of graphics.
import pandas as pd
from plotnine import *
# Create data
s1 = StringIO("""
DayOfWeek Hotel Bar Pool
0 Sunday 41 32 15
1 Monday 45 38 24
2 Tuesday 42 32 27
3 Wednesday 45 37 23
4 Thursday 47 34 26
5 Friday 43 30 19
6 Saturday 48 30 28
""")
s2 = StringIO("""
DayOfWeek Hotel Bar Pool
0 Sunday 7 8 6
1 Monday 9 7 5
2 Tuesday 5 4 1
3 Wednesday 8 6 7
4 Thursday 4 5 10
5 Friday 3 1 1
6 Saturday 10 2 6
""")
# Read data
earnings = pd.read_csv(s1, sep="\s+")
tips = pd.read_csv(s2, sep="\s+")
# Make tidy data
kwargs = dict(value_vars=['Hotel', 'Bar', 'Pool'], id_vars=['DayOfWeek'], var_name='location')
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
earnings = pd.melt(earnings, value_name='earnings', **kwargs)
tips = pd.melt(tips, value_name='tip', **kwargs)
df = pd.merge(earnings, tips, on=['DayOfWeek', 'location'])
df['DayOfWeek'] = pd.Categorical(df['DayOfWeek'], categories=days, ordered=True)
# Create plot
p = (ggplot(df)
+ geom_point(aes('DayOfWeek', 'earnings', color='location', size='tip'))
)
print(p)

Categories