Time-series boxplot in pandas - python

How can I create a boxplot for a pandas time-series where I have a box for each day?
Sample dataset of hourly data where one box should consist of 24 values:
import pandas as pd
n = 480
ts = pd.Series(randn(n),
index=pd.date_range(start="2014-02-01",
periods=n,
freq="H"))
ts.plot()
I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great.
There is a similar question for R/ggplot2 here, if it helps to clarify what I want.

If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
Which gives:
Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like:
ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d'))
Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. :
[(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]

(Not enough rep to comment on accepted solution, so adding an answer instead.)
The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)

I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling!
Try this:
n = 480
ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)

Related

Python: How to construct a joyplot with values taken from a column in pandas dataframe as y axis

I have a dataframe df in which the column extracted_day consists of dates ranging between 2022-05-08 to 2022-05-12. I have another column named gas_price, which consists of the price of the gas. I want to construct a joyplot such that for each date, it shows the gas_price in the y axis and has minutes_elapsed_from_start_of_day in the x axis. We may also use ridgeplot or any other plot if this doesn't work.
This is the code that I have written, but it doesn't serve my purpose.
from joypy import joyplot
import matplotlib.pyplot as plt
df['extracted_day'] = df['extracted_day'].astype(str)
joyplot(df, by = 'extracted_day', column = 'minutes_elapsed_from_start_of_day',figsize=(14,10))
plt.xlabel("Number of minutes elapsed throughout the day")
plt.show()
Create dataframe with mock data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from joypy import joyplot
np.random.seed(111)
df = pd.DataFrame({
'minutes_elapsed_from_start_of_day': np.tile(np.arange(1440), 5),
'extracted_day': np.repeat(['2022-05-08', '2022-05-09', '2022-05-10','2022-05-11', '2022-05-12'], 1440),
'gas_price': abs(np.cumsum(np.random.randn(1440*5)))})
Then create the joyplot. It is important that you set kind='values', since you do not want joyplot to show KDEs (kernel density estimates, joyplot's default) but the raw gas_price values:
joyplot(df, by='extracted_day',
column='gas_price',
kind='values',
x_range=np.arange(1440),
figsize=(7,5))
The resulting joyplot looks like this (the fake gas prices are represented by the y-values of the lines):

How to sequentially add seaborn boxplots to the same axis?

Is there a way how to add multiple seaborn boxplots to one figure sequentially?
Taking example from Time-series boxplot in pandas:
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
This gives me one series of box-plots?
Now, is there any way to plot two time-series like this one the same plot side-by-side? I want to plot it in the function that would have make_new_plot boolean parameter for separating the boxplots that are plotted from the for-loop.
If I try to just call it on the same axis, it gives me the overlapping plots:
I know that it is possible to concatenate the dataframes and make box plots of the concatenated dataframe together, but I would not want to have this plotting function returning any dataframes.
Is there some other way to make it? Maybe it is possible to somehow manipulate the width&position of boxes to achieve this? The fact tact that I need a time-series of boxplots & matplotlib "positions" parameter is on purpose not supported by seaborn makes it a bit tricky for me to figure out how to do it.
Note that it is NOT the same as eg. Plotting multiple boxplots in seaborn?, because I want to plot it sequentially without returning any dataframes from the plotting function.
You could do something like the following if you want to have hue nesting of different time-series in your boxplots.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
n = 480
ts0 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts1 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts2 = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
def ts_boxplot(ax, list_of_ts):
new_list_of_ts = []
for i, ts in enumerate(list_of_ts):
ts = ts.to_frame(name='ts_variable')
ts['ts_number'] = i
ts['doy']=ts.index.dayofyear
new_list_of_ts.append(ts)
plot_data = pd.concat(new_list_of_ts)
sns.boxplot(data=plot_data, x='doy', y='ts_variable', hue='ts_number', ax=ax)
return ax
fig, ax = plt.subplots(figsize=(12,5))
ax = ts_boxplot(ax, [ts0, ts1, ts2])

Dates in X-axis using pandas and matplotlib

I am trying to plot some data from pandas. First I group by weeks and count for each grouped week, them I want to plot for each date, however when I try to plot I get just some dates, not all of them.
I am using the following code:
my_data = res1.groupby(pd.Grouper(key='d', freq='W-MON')).agg('count').u
p1, = plt.plot(my_data, '.-')
a = plt.xticks(rotation=45)
My result is the following:
I wanted a value in the x-axis for each date in the grouped dataframe.
EDIT: I tried to use plt.xticks(list(my_data.index.astype(str)), rotation=45)
The plot I get is the following:
Please find a working chunk of code below:
from datetime import date, timedelta
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
a = pd.Series(np.random.randint(10, 99, 10))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.plot(pd.date_range(date(2016,1,1), periods=10, freq='D'), a)
plt.gcf().autofmt_xdate()
Hope it helps :)

How to combine bar and line plots with x-axis as datetime in matplotlib

I have a dataFrame with datetimeIndex and two columns with int values. I would like to plot on the same graph Col1 as a bar plot, and Col2 as a line plot.
Important feature is to have correctly labeled x-axis as datetime, also when zooming in-out. I think solutions with DateFormatter would not work, since I want a dynamic xtick labeling.
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import numpy as np
startDate = dt.datetime(2018,1,1,0,0)
nrHours = 144
datetimeIndex = [startDate + dt.timedelta(hours=x) for x in range(0,nrHours)]
dF = pd.DataFrame(index=datetimeIndex)
dF['Col1'] = np.random.randint(1,3,nrHours)
dF['Col2'] = np.random.randint(3,6,nrHours)
axes = dF[['Col1']].plot(kind='bar')
dF[['Col2']].plot(ax=axes)
What seemed to be a simple task turns out being very challenging. Actually, after extensive search on the net, I still haven't found any clean solutions.
I have tried to use both pandas plot and matplotlib.
The main issue arises from the bar plot that seems to have difficulties handling datetime index (prefers integers, in some cases it plot dates but in Epoch 1970-1-1 style which is equivalent to 0).
I finally found a way using mdates and date2num. The solution is not very clean but provides an efficient solution to:
Combine bar and line plot on same graph
Using datetime on x-axis
Correctly and dynamically displaying x-ticks time labels (also when zooming in and out)
Working example :
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import datetime as dt
import numpy as np
startDate = dt.datetime(2018,1,1,0,0)
nrHours = 144
datetimeIndex = [startDate + dt.timedelta(hours=x) for x in range(0, nrHours)]
dF = pd.DataFrame(index=datetimeIndex)
dF['Col1'] = np.random.randint(1,3,nrHours)
dF['Col2'] = np.random.randint(3,6,nrHours)
fig,axes = plt.subplots()
axes.xaxis_date()
axes.plot(mdates.date2num(list(dF.index)),dF['Col2'])
axes.bar(mdates.date2num(list(dF.index)),dF['Col1'],align='center',width=0.02)
fig.autofmt_xdate()
Sample output:

datetime x-axis matplotlib labels causing uncontrolled overlap

I'm trying to plot a pandas series with a 'pandas.tseries.index.DatetimeIndex'. The x-axis label stubbornly overlap, and I cannot make them presentable, even with several suggested solutions.
I tried stackoverflow solution suggesting to use autofmt_xdate but it doesn't help.
I also tried the suggestion to plt.tight_layout(), which fails to make an effect.
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
#plt.tight_layout()
print(type(test_df[(test_df.index.year ==2017) ]['error'].index))
UPDATE: That I'm using a bar chart is an issue. A regular time-series plot shows nicely-managed labels.
A pandas bar plot is a categorical plot. It shows one bar for each index at integer positions on the scale. Hence the first bar is at position 0, the next at 1 etc. The labels correspond to the dataframes' index. If you have 100 bars, you'll end up with 100 labels. This makes sense because pandas cannot know if those should be treated as categories or ordinal/numeric data.
If instead you use a normal matplotlib bar plot, it will treat the dataframe index numerically. This means the bars have their position according to the actual dates and labels are placed according to the automatic ticker.
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=42).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(42)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gcf().autofmt_xdate()
plt.show()
The advantage is then in addition that matplotlib.dates locators and formatters can be used. E.g. to label each first and fifteenth of a month with a custom format,
import pandas as pd
import numpy as np; np.random.seed(42)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=93).tolist()
df = pd.DataFrame(np.cumsum(np.random.randn(93)),
columns=['error'], index=pd.to_datetime(datelist))
plt.bar(df.index, df["error"].values)
plt.gca().xaxis.set_major_locator(mdates.DayLocator((1,15)))
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%d %b %Y"))
plt.gcf().autofmt_xdate()
plt.show()
In your situation, the easiest would be to manually create labels and spacing, and apply that using ax.xaxis.set_major_formatter.
Here's a possible solution:
Since no sample data was provided, I tried to mimic the structure of your dataset in a dataframe with some random numbers.
The setup:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
# A dataframe with random numbers ro run tests on
np.random.seed(123456)
rows = 100
df = pd.DataFrame(np.random.randint(-10,10,size=(rows, 1)), columns=['error'])
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
test_df = df.copy(deep = True)
# Plot of data that mimics the structure of your dataset
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
ax.figure.autofmt_xdate()
plt.figure(figsize=(15,8))
A possible solution:
test_df = df.copy(deep = True)
ax = test_df[(test_df.index.year ==2017) ]['error'].plot(kind="bar")
plt.figure(figsize=(15,8))
# Make a list of empty myLabels
myLabels = ['']*len(test_df.index)
# Set labels on every 20th element in myLabels
myLabels[::20] = [item.strftime('%Y - %m') for item in test_df.index[::20]]
ax.xaxis.set_major_formatter(ticker.FixedFormatter(myLabels))
plt.gcf().autofmt_xdate()
# Tilt the labels
plt.setp(ax.get_xticklabels(), rotation=30, fontsize=10)
plt.show()
You can easily change the formatting of labels by checking strftime.org

Categories