How to create "Weekly Boxplots"? - python

I have dataset which looks like this. I have a data for month of two categories, 62 rows, 31 for each category. I would like to create a weekly boxplots with week number and month on the y-axis [like 01-12, 02-12, 03-12 and so on].
So far I have come up with the following code.
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
fig, ax = plt.subplots(figsize=(18,6))
df.index = pd.to_datetime(df.Timestamp)
sns.boxplot(x=df.index.week, y='Values', data=df, hue='Category', ax=ax)
By Using df.index.week, I am not getting the expected week value, instead it is giving me the week number of year like this.
Guidance please?

You can create a grouping column in your df by formatting values from the Date column:
date_range = pd.date_range(start='2013-12-01', end='2013-12-31').to_list()
df = pd.DataFrame(
{
"Date": date_range + date_range,
"Values": np.random.randint(1000, 20000, 62),
"Category": ["anti"] * 31 + ["pro"] * 31,
}
)
Use pandas.Series.dt.strftime to get the week of the year (%U) and month (%m) joined by a -:
df["week_month"] = df["Date"].dt.strftime("%U-%m")
(Thanks for the better method #Cameron Riddell)
Then plot:
sns.boxplot(x="week_month", y="Values", data=df, hue="Category")

Related

Seaborn heatmap change date frequency of yticks

My problem is similar to the one encountered on this topic: Change heatmap's yticks for multi-index dataframe
I would like to have yticks every 6 months, with them being the index of my dataframe. But I can't manage to make it work.
The issue is that my dataframe is 13500*290 and the answer given in the link takes a long time and doesn't really work (see image below).
This is an example of my code without the solution from the link, this part works fine for me:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
df = pd.DataFrame(index = pd.date_range(datetime(1984, 6, 10), datetime(2021, 1, 14), freq='1D') )
for i in range(0,290):
df['Pt{0}'.format(i)] = np.random.random(size=len(df))
f, ax = plt.subplots(figsize=(20,20))
sns.heatmap(df, cmap='PuOr', vmin = np.min(np.min(df)), vmax = np.max(np.max(df)), cbar_kws={"label": "Ice Velocity (m/yr)"})
This part does not work for me and produces the figure below, which shouldn't have the stack of ylabels on the yaxis:
f, ax = plt.subplots(figsize=(20,20))
years = df.index.get_level_values(0)
ytickvalues = [year if index in (2, 7, 12) else '' for index, year in enumerate(years)]
sns.heatmap(df, cmap='PuOr', vmin = np.min(np.min(df)), vmax = np.max(np.max(df)), cbar_kws={"label": "Ice Velocity (m/yr)"}, yticklabels = ytickvalues)
Here are a couple ways to adapt that link for your use case (1 label per 6 months):
Either: Show an empty string except on Jan 1 and Jul 1 (i.e., when %m%d evals to 0101 or 0701)
labels = [date if date.strftime('%m%d') in ['0101', '0701'] else ''
for date in df.index.date]
Or: Show an empty string except every ~365/2 days (i.e., when row % 183 == 0)
labels = [date if row % 183 == 0 else ''
for row, date in enumerate(df.index.date)]
Note that you don't have a MultiIndex, so you can just use df.index.date (no need for get_level_values).
Here is the output with a minimized version of your df:
sns.heatmap(df, cmap='PuOr', cbar_kws={'label': 'Ice Velocity (m/yr)'},
vmin=df.values.min(), vmax=df.values.max(),
yticklabels=labels)

Plotting a pandas Series using dates and values too squished

I am trying to plot a simple pandas Series object, its something like this:
2018-01-01 10
2018-01-02 90
2018-01-03 79
...
2020-01-01 9
2020-01-02 72
2020-01-03 65
It includes only the first month of each year, so it only contains the month January and all its values through the days.
When i try to plot it
# suppose the name of the series is dates_and_values
dates_and_values.plot()
It returns a plot like this (made using my current data)
It is clearly plotting by year and then the month, so it looks pretty squished and small, since i don't have any other months except January, is there a way to plot it by the year and day so it outputs a better plot to observe the days.
the x-axis is the index of the dataframe
dates are a continuous series, x-axis is continuous
change index to be a string of values, means it it no longer continuous and squishes your graph
have generated some sample data that only has January to demonstrate
import matplotlib.pyplot as plt
cf = pd.tseries.offsets.CustomBusinessDay(weekmask="Sun Mon Tue Wed Thu Fri Sat",
holidays=[d for d in pd.date_range("01-jan-1990",periods=365*50, freq="D")
if d.month!=1])
d = pd.date_range("01-jan-2015", periods=200, freq=cf)
df = pd.DataFrame({"Values":np.random.randint(20,70,len(d))}, index=d)
fig, ax = plt.subplots(2, figsize=[14,6])
df.set_index(df.index.strftime("%Y %d")).plot(ax=ax[0])
df.plot(ax=ax[1])
I suggest that you convert the series to a dataframe and then pivot it to get one column for each year. This lets you plot the data for each year with a separate line, either in the same plot using different colors or in subplots. Here is an example:
import numpy as np # v 1.19.2
import pandas as pd # v 1.2.3
# Create sample series
rng = np.random.default_rng(seed=123) # random number generator
dt = pd.date_range('2018-01-01', '2020-01-31', freq='D')
dt_jan = dt[dt.month == 1]
series = pd.Series(rng.integers(20, 90, size=dt_jan.size), index=dt_jan)
# Convert series to dataframe and pivot it
df_raw = series.to_frame()
df_pivot = df_raw.pivot_table(index=df_raw.index.day, columns=df_raw.index.year)
df = df_pivot.droplevel(axis=1, level=0)
df.head()
# Plot all years together in different colors
ax = df.plot(figsize=(10,4))
ax.set_xlim(1, 31)
ax.legend(frameon=False, bbox_to_anchor=(1, 0.65))
ax.set_xlabel('January', labelpad=10, size=12)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
# Plot years separately
axs = df.plot(subplots=True, color='tab:blue', sharey=True,
figsize=(10,8), legend=None)
for ax in axs:
ax.set_xlim(1, 31)
ax.grid(axis='x', alpha=0.3)
handles, labels = ax.get_legend_handles_labels()
ax.text(28.75, 80, *labels, size=14)
if ax.is_last_row():
ax.set_xlabel('January', labelpad=10, size=12)
ax.figure.subplots_adjust(hspace=0)

hourly heatmap from multi years timeseries python

I need to create a hourly mean multi plot heatmap of Temperature as in:
for sevel years. The data to plot are read from excel sheet. The excel sheet is formated as "year", "month", "day", "hour", "Temp".
I created a mounthly mean heatmap using seaborn library, using this code :
df = pd.read_excel('D:\\Users\\CO2_heatmap.xlsx')
co2=df.pivot_table(index="month",columns="year",values='CO2',aggfunc="mean")
ax = sns.heatmap(co2,cmap='bwr',vmin=370,vmax=430, cbar_kws={'label': '$\mathregular{CO_2}$ [ppm]', 'orientation': 'vertical'})
Obtaining this graph:
How can I generate a
co2=df.pivot_table(index="hour",columns="day",values='CO2',aggfunc="mean")
for each month and for each year?
The seaborn heat map did not allow me to draw multiple graphs of different axes. I created a graph by SNSing that one graph with multiple graphs. It was not customizable like the reference graph. Sorry we are not able to help you.
import pandas as pd
import numpy as np
import random
date_rng = pd.date_range('2018-01-01', '2019-12-31',freq='1H')
temp = np.random.randint(-30.0, 40.0,(17497,))
df = pd.DataFrame({'CO2':temp},index=pd.to_datetime(date_rng))
df.insert(1, 'year', df.index.year)
df.insert(2, 'month', df.index.month)
df.insert(3, 'day', df.index.day)
df.insert(4, 'hour', df.index.hour)
df = df.copy()
yyyy = df['year'].unique()
month = df['month'].unique()
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(figsize=(20,10), nrows=2, ncols=12)
for m, ax in zip(range(1,25), axes.flat):
if m <= 12:
y = yyyy[0]
df1 = df[(df['year'] == y) & (df['month'] == m)]
else:
y = yyyy[1]
m -= 12
df1 = df[(df['year'] == y) & (df['month'] == m)]
df1 = df1.pivot_table(index="hour",columns="day",values='CO2',aggfunc="mean")
plt.figure(m)
sns.heatmap(df1, cmap='RdBu', cbar=False, ax=ax)
This might help- /hourly-heatmap-graph-using-python-s-ggplot2-implementation-plotnine
There's also a guide to producing this exact plot (for two years of data) on the
Python graph gallery-heatmap-for-timeseries-matplotlib
I'm afraid I don't know any Python, so didn't want to copy/paste in case I missed anything. I did, however, create the original plot in R :) The main trick was to use facet_grid to split the data by year and month, and reverse the y axis labels.
It looks like
fig, axes = plt.subplots(2, 12, figsize=(14, 10), sharey=True)
for i, year in enumerate([2004, 2005]):
for j, month in enumerate(range(1, 13)):
single_plot(data, month, year, axes[i, j])
does the work of splitting by year and month.
I hope this helps you get further forward

Python convert the day of year to month on an axis

I have a time series that I would like to plot year on year. I want the data to be daily, but the axis to show each month as "Jan", "Feb" etc.
At the moment I can get the daily data, BUT the axis is 1-366 (the day of the year).
Or I can get the monthly axis as 1, 2, 3 etc (by changing the index to df.index.month), BUT then the data is monthly.
How can I convert the day of year axis into months? Or how can I do this?
Code showing the daily data, but the axis is wrong:
# import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create fake time series dataframe
index = pd.date_range(start='01-Jan-2012', end='31-12-2018', freq='D')
data = np.random.randn(len(index))
df = pd.DataFrame(data, index, columns=['Data'])
# pivot to get by day in rows, then year in columns
df_pivot = pd.pivot_table(df, index=df.index.dayofyear, columns=df.index.year, values='Data')
df_pivot.plot()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
This can be done using the xticks function. Simply add the following code before plt.show():
plt.xticks(np.linspace(0,365,13)[:-1], ('Jan', 'Feb' ... 'Nov', 'Dec'))
Or the following to have the month names appear in the middle of the month:
plt.xticks(np.linspace(15,380,13)[:-1], ('Jan', 'Feb' ... 'Nov', 'Dec'))
It may be more straightforward to simply add a datetime index to your pivoted dataframe.
df_pivot.index = pd.date_range(
df.index.max() - pd.Timedelta(days=df_pivot.shape[0]),
freq='D', periods=df_pivot.shape[0])
df_pivot.plot()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
The resulting plot has the axis as desired:
This method also has the advantage over the accepted answer of working irrespective of your start and end date. For example, if you change your index's end date to end='30-Jun-2018', the axis adapts nicely to fit the data:

how to analyze time-series data as a function of the time of day in pandas

Suppose I have a random sample of data collected every 1 minute for a month. Then suppose I want to use pandas to analyze this data as a function of the time of day, and see the differences between a weekend and weekday. I can do this in pandas if my index is a DateTimeIndex by calculating the time of day as a 0-1 decimal value, manually binning the results in intervals of 10 minutes (or whatever) and then plotting the results using the bins column to actually calculate averages over the time intervals of the day, and then manually setting my tick positions and labels into something understandable.
However, this feels a little bit hacky and I am wondering if there are built-in pandas functions to achieve this same kind of analysis. I haven't been able to find them so far.
dates = pd.date_range(start='2018-10-01', end='2018-11-01', freq='min')
vals = np.random.rand(len(dates))
df = pd.DataFrame(data={'dates': dates, 'vals': vals})
df.set_index('dates', inplace=True)
# set up a column to make the time of day a value from 0 to 1
df['day_fraction'] = (df.index.hour + df.index.minute / 60) / 24
# bin the time of day to analyze data during 10 minute intervals
df['day_bins'] = df['day_fraction'] - df['day_fraction'] % (1 / 24 / 6)
ax = df.plot('day_fraction', 'vals', marker='o', color='pink', alpha=0.05, label='')
df.groupby('day_bins')['vals'].mean().plot(ax=ax, label='average')
df[df.index.weekday < 5].groupby('day_bins')['vals'].mean().plot(ax=ax, label='weekday average')
df[df.index.weekday >= 5].groupby('day_bins')['vals'].mean().plot(ax=ax, label='weekend average')
xlabels = [label if label else 12 for label in [i % 12 for i in range(0, 25, 2)]]
xticks = [i / 24 for i in range(0, 25, 2)]
ax.set_xticks(xticks)
ax.set_xticklabels(xlabels)
ax.set_xlabel('time of day')
ax.legend()
I think you just need to use groupby with a lot of the built in .dt accessors. Group based on weekday or weekend and then form bins every 10 minutes (with .floor) and calculate the mean.
Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = pd.date_range(start='2018-10-01', end='2018-11-01', freq='min')
vals = np.random.rand(len(dates))
df = pd.DataFrame(data={'dates': dates, 'vals': vals})
df.set_index('dates', inplace=True)
Plot
df1 = (df.groupby([np.where(df.index.weekday < 5, 'weekday', 'weekend'),
df.index.floor('10min').time])
.mean()
.rename(columns={'vals': 'average'}))
fig, ax = plt.subplots(figsize=(12,7))
df1.unstack(0).plot(ax=ax)
# Plot Full Average
df.groupby(df.index.floor('10min').time).mean().rename(columns={'vals': 'average'}).plot(ax=ax)
plt.show()

Categories