Ensuring first and last date ticks in x-axis - Matplotlib - python

Currently I am charting data from some historical point to a point in current time. For example, January 2019 to TODAY (February 2021). However, my matplotlib chart only shows dates from January 2019 to January 2021 on the x-axis (with the last February tick missing) even though the data is charted to today's date on the actual plot.
Is there any way to ensure that the first and last month is always reflected on the x-axis chart? In other words, I would like the x-axis to have the range displayed (inclusive).
Picture of x axis (missing February 2021)
The data charted here is from January 2019 to TODAY (February 12th).
Here is my code for the date format:
fig.autofmt_xdate()
date_format = mdates.DateFormatter("%b-%y")
ax.xaxis.set_major_formatter(date_format)
EDIT: The numbers after each month represent years.

I am not aware of any way to do this other than by creating the ticks from scratch.
In the following example, a list of all first-DatetimeIndex-timestamp-of-the-month is created from the DatetimeIndex of a pandas dataframe, starting from the month of the first date (25th of Jan.) up to the start of the last ongoing month. An appropriate number of ticks is automatically selected by the step variable and the last month is appended and then removed with np.unique when it is a duplicate. The labels are formatted from the tick timestamps.
This solution works for any frequency smaller than yearly:
import numpy as np # v 1.19.2
import pandas as pd # v 1.1.3
import matplotlib.pyplot as plt # v 3.3.2
# Create sample dataset
start_date = '2019-01-25'
end_date = '2021-02-12'
rng = np.random.default_rng(seed=123) # random number generator
dti = pd.date_range(start_date, end_date, freq='D')
variable = 100 + rng.normal(size=dti.size).cumsum()
df = pd.DataFrame(dict(variable=variable), index=dti)
# Create matplotlib plot
fig, ax = plt.subplots(figsize=(10, 2))
ax.plot(df.index, df.variable)
# Create list of monthly ticks made of timestamp objects
monthly_ticks = [timestamp for idx, timestamp in enumerate(df.index)
if (timestamp.month != df.index[idx-1].month) | (idx == 0)]
# Select appropriate number of ticks and include last month
step = 1
while len(monthly_ticks[::step]) > 10:
step += 1
ticks = np.unique(np.append(monthly_ticks[::step], monthly_ticks[-1]))
# Create tick labels from tick timestamps
labels = [timestamp.strftime('%b\n%Y') if timestamp.year != ticks[idx-1].year
else timestamp.strftime('%b') for idx, timestamp in enumerate(ticks)]
plt.xticks(ticks, labels, rotation=0, ha='center');
As you can see, the first and last months are located at an irregular distance from the neighboring tick.
In case you are plotting a time series with a discontinous date range (e.g. weekend and holidays not included) and you are not using the DatetimeIndex for the x-axis (like this for example: ax.plot(range(df.index.size), df.variable)) so as to avoid gaps with straight lines showing up on short time series and/or very wide plots, then replace the last line of code with this:
plt.xticks([df.index.get_loc(tick) for tick in ticks], labels, rotation=0, ha='center');

Matplotlib uses a limited number of ticks. It just happens that for February 2021 no tick is used. There are two things you could try. First try setting the axis limits to past today with:
ax.set_xlim(start_date, end_date)
What you could also try, is using even more ticks:
ax.set_xticks(np.arange(np.min(x), np.max(x), n_ticks))
Where n_ticks stands for the amount of ticks and x for the values on the x-axis.

Related

Resampling timestamps with pandas: Why are Mondays counted for a wrong week?

I'm having a hard time trying to take my first steps with pandas.
I would like to create a bar diagram showing how often something has happened a week.
I want to identify the week by the first day of the week, which is a Monday in my case.
Also, I want to ensure that the last week displayed is always the current week, even if nothing has happened this week.
import datetime
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator
# create some example data frame
timestamps = [
"2021-01-08 11:21:14",
"2021-02-15 08:04:46",
"2021-02-18 16:49:39",
"2021-02-24 11:59:39",
"2021-03-03 08:29:39",
]
df = pd.DataFrame(dict(timestamp=timestamps))
df.timestamp = df.timestamp.astype('datetime64')
# ensure that this week is contained
df = df.append(dict(timestamp=datetime.datetime.now()), ignore_index=True)
# process data to histogram
# TODO: Mondays are counted as the week before
histogram_df = df.resample('W-MON', label='left', on='timestamp').count()
# remove fake entry which I added to ensure that the current week appears
histogram_df['timestamp'][-1] -= 1
# plot the data
ax = histogram_df.plot(y='timestamp', legend=False, kind='bar', rot=0, title='number')
ax.set_xlabel('')
ax.set_xticklabels(map(lambda t: t.strftime('KW %V\n%d.%m.%Y'), histogram_df.index))
ax.yaxis.set_major_locator(MultipleLocator(1))
plt.show()
This does almost what I want but Mondays (see February 15th) are counted for the week before. Why is that? How can I get Mondays to be counted for the week they are in?
The documentation of resample does not say what str values it's first argument called rule accepts or what they mean.
'W-MON' is mentioned here but without much explanation.
My initial understanding was 'W-MON' means "weekly with weeks starting on Mondays" and label='left' means "take the first day of the week instead of the last day". But that has proven to be wrong. So what do 'W-MON' and label='left' really mean?
Bonus question: Appending a row to ensure that the current week appears in the diagram and then decrementing the last count is not exactly safe in case a value from the future were to appear in the data. Is there a better way to do this?
Try using also closed='left' in the df.resample() call, like below:
histogram_df = df.resample('W-MON', label='left', closed='left', on='timestamp').count()
From the doc on the parm closed (extracted below), the default is right for frequency 'W':
closed{‘right’, ‘left’}, default None
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
#SeaBean has correctly pointed out that I need to add closed='left' in order to achieve what I wanted but without really saying what was going on and what this does.
I think I am starting to understand what's going on here so let me give it a try at explaining it.
We have a timeline of events:
With df.resample(...).count() I am splitting the timeline into several intervals and count the number of events in each interval.
Several questions arise on how to do the resampling and their answers lead to the arguments we need to pass to the function call.
The first question is: How big are the time intervals and where do they start/end?
rule='W-MON' means "weekly, on every Monday".
The second question is: How do I label these time intervals?
label='left' means "label them by the left border of the interval", in this case the start of the week.
The default (for weekly intervals) label='right' would mean "label the intervals by their right border, i.e. the next Monday".
This makes sense if you don't explicitly specify a weekday because rule='W' is equivalent to rule='W-SUN'.
So if you use rule='W' without other arguments that means "label the intervals by the end of the week (Sunday)".
The third question is: Which interval does an event belong to which is on the border between two intervals?
The time of the day does not seem to matter for weekly intervals, so let me "normalize" the time stamps first (i.e. set the time to 00:00) to make it clearer that an event is on the border between two intervals:
The answer to this question is the closed parameter and I am getting the feeling it makes sense to always pass the same value to it like to label.
closed='left' means "the left border of the interval belongs to the interval and the right border of the interval belongs to the next interval".
I have tried to visualize that with parentheses and square brackets:
For more about open and closed intervals see Wikipedia.
I have created the graphics with the following code:
#!/usr/bin/env python3
import datetime
import pandas as pd
from matplotlib import pyplot as plt
FIGURE_TIMELINE = 1
FIGURE_TIMELINE_WEEKS = 2
FIGURE_TIMELINE_NORMALIZED = 3
FIGURE_TIMELINE_INTERVALS = 4
FIGURE = FIGURE_TIMELINE_INTERVALS
# create some example data frame
timestamps = [
"2021-01-08 11:21:14",
"2021-02-15 08:04:46",
"2021-02-18 16:49:39",
"2021-02-24 11:59:39",
"2021-03-03 08:29:39",
]
df = pd.DataFrame(dict(timestamp=timestamps))
df.timestamp = df.timestamp.astype('datetime64')
if FIGURE >= FIGURE_TIMELINE_NORMALIZED:
df.timestamp = df.timestamp.dt.normalize()
# draw time line
x0 = df.timestamp.min().normalize() - pd.offsets.Week(weekday=0)
x1 = datetime.datetime.now()
df.insert(0, 'zero', 0)
ax = df.plot(x='timestamp', y='zero', style='D', markersize=10, xlabel='', legend=False, xlim=(x0, x1), rot=0)
ax.spines['left'].set_position('zero')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.spines['bottom'].set_color('none')
if FIGURE == FIGURE_TIMELINE:
ax.grid(which='major', axis='y')
ax.set_yticks([0])
ax.set_xticks([])
else:
ax.grid(which='major', axis='both')
ax.set_yticks([0])
xticks = pd.date_range(x0, x1, freq='W-MON')
ax.set_xticks(xticks)
label_fmt = '%A\n%d.%m.%Y'
if FIGURE >= FIGURE_TIMELINE_INTERVALS:
label_fmt += '\n\n)['
ax.set_xticklabels(xticks.strftime(label_fmt))
for label in ax.xaxis.get_majorticklabels():
label.set_horizontalalignment('center')
week_labels = pd.date_range(x0+pd.offsets.Hour(12), x1, freq='W-THU')
ax.set_xticks(week_labels, minor=True)
ax.set_xticklabels(week_labels.strftime('CW%V'), minor=True)
ax.tick_params(axis='x', which='minor',length=0, pad=-20)
border = .03
border_bottom = .3 if FIGURE >= FIGURE_TIMELINE_INTERVALS else .2 if FIGURE >= FIGURE_TIMELINE_WEEKS else border
plt.subplots_adjust(left=border, right=1-border, top=1-border, bottom=border_bottom)
plt.show()

Plotting pd.Series object does not show year correctly

I am graphing the results of the measurements of a humidity sensor over time.
I'm using Python 3.7.1 and Pandas 0.24.2.
I have a list called dateTimeList with date and time strings:
dateTimeList = ['15.3.2019 11:44:27', '15.3.2019 12:44:33', '15.3.2019 13:44:39']
I wrote this code where index is a DatetimeIndex object and humList is a list of floats.
index = pd.to_datetime(dateTimeList, format='%d.%m.%Y %H:%M:%S')
ts = pd.Series(humList, index)
plt.figure(figsize=(12.80, 7.20))
ts.plot(title='Gráfico de Humedad en el Tiempo', style='g', marker='o')
plt.xlabel('Tiempo [días]')
plt.ylabel('Humedad [V]')
plt.grid()
plt.savefig('Hum_General'+'.png', bbox_inches='tight')
plt.show()
And I have this two results, one with data from February1 and the other one with data from March2.
The problem is that in March instead of leaving the year 2019, sequences of 00 12 00 12 appear on the x axis. I think it is important to note that this only happens on the data of March, since February is ok, and the data of both months have the same structure. Day and Month are shown correctly on both plots.
I also tried with:
index = [ pd.to_datetime(date, format='%d.%m.%Y %H:%M:%S') for date in dateTimeList]
Now index is a list of Timestamps objects. Same Results.
Add this immediately after creating the plot
import matplotlib.dates as mdates # this should be on the top of the script
xfmt = mdates.DateFormatter('%Y-%m-%d')
ax = plt.gca()
ax.xaxis.set_major_formatter(xfmt)
My guess is that since March has less data points, Matplotlib prefers to label dates as month-day-hour instead of year-month-date, so probably when you have more data in March the issue should fix itself. The code I posted should keep a year-month-day format regardless the number of data points used to plot.

Remove Saturdays (but not Sundays or other dataless periods) from Timeserie plot

I am plotting a financial timeserie (see below, here 1 month worth of data)
I would like to remove the periods I show with red cross etc., which are Saturdays. Note that those periods are not all the time periods without data but only the Saturdays.
I know there are some example of how to remove the gaps , for instance: http://matplotlib.org/examples/api/date_index_formatter.html.
This is not what I am after since they remove all the gaps. (NOT MY INTENT!).
I was thinking that the way to go might be to create a custom sequence of values for the xaxis. Since the days are ordinals (ie 1 day = a value of 1), it might be possible to create a sequence such as 1,2,3,4,5,6,8,9,10,11,12,13,15,16,etc. skipping 1 every seven days - that day skipped needing to be a Saturday of course.
The skipping of every Saturday i can imagine how to do it using rrule from timeutil. It is done here (see below) as every Monday is marked with a stronger vertical line. But How would i go at passing it to the Tick locator? There is in fact a RRuleLocator class in the matplotlib API but no indication on how to use it is given in the doc: http://matplotlib.org/api/dates_api.html#matplotlib.dates.RRuleLocator.
Every suggestion welcome.
Here the code that I use for the current chart:
fig, axes = plt.subplots(2, figsize=(20, 6))
quotes = price_data.as_matrix() # as matrix() to remove the columns header of the df
mpf.candlestick_ohlc(axes[0], quotes, width=0.01)
plt.bar(quotes[:,0] , quotes[:,5], width = 0.01)
for i , axes[i] in enumerate(axes):
axes[i].xaxis.set_major_locator(mdates.DayLocator(interval=1) )
axes[i].xaxis.set_major_formatter(mdates.DateFormatter('%a, %b %d'))
axes[i].grid(True)
# show night times with a grey shade
majors=axes[i].xaxis.get_majorticklocs()
chart_start, chart_end = (axes[i].xaxis.get_view_interval()[0],
axes[i].xaxis.get_view_interval()[1])
for major in majors:
axes[i].axvspan(max (chart_start, major-(0.3333)),
min(chart_end, major+(0.3333)),color="0.95", zorder=-1 ) #0.33 corresponds to 1/3 of a day i.e. 8h
# show mondays with a line
mondays = list(rrule(WEEKLY, byweekday=MO, dtstart= mdates.num2date(chart_start),
until=mdates.num2date(chart_end)))
for j, monday in enumerate(mondays):
axes[i].axvline(mdates.date2num(mondays[j]), linewidth=0.75, color='k', zorder=1)
If your dates are datetime objects, or a DateTimeIndex in a pandas DataFrame, you could check which weekday a certain date is using .weekday. Then you just set the data on Saturdays to nan. See the example below.
Code
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import random
import datetime
import numpy as np
# generate some data with a datetime index
x = 400
data = pd.DataFrame([
random.random() for i in range(x)],
index=[datetime.datetime(2018, 1, 1, 0)
+ datetime.timedelta(hours=i) for i in range(x)])
# Set all data on a Saturday (5) to nan, so it doesn't show in the graph
data[data.index.weekday == 5] = np.nan
# Plot the data
fig, ax = plt.subplots(figsize=(12, 2.5))
ax.plot(data)
# Set a major tick on each weekday
days = mdates.DayLocator()
daysFmt = mdates.DateFormatter('%a')
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
Result

Formatting X axis labels Pandas time series plot

I am trying to plot a multiple time series dataframe in pandas. The time series is a 1 year daily points of length 365. The figure is coming alright but I want to suppress the year tick showing on the x axis.
I want to suppress the 1950 label showing in the left corner of x axis. Can anybody suggest something on this? My code
dates = pandas.date_range('1950-01-01', '1950-12-31', freq='D')
data_to_plot12 = pandas.DataFrame(data=data_array, # values
index=homo_regions) # 1st column as index
dataframe1 = pandas.DataFrame.transpose(data_to_plot12)
dataframe1.index = dates
ax = dataframe1.plot(lw=1.5, marker='.', markersize=2, title='PRECT time series PI Slb Ocn CNTRL 60 years')
ax.set(xlabel="Months", ylabel="PRECT (mm/day)")
fig_name = 'dataframe1.pdf'
plt.savefig(fig_name)
You should be able to specify the xaxis major formatter like so
import matplotlib.dates as mdates
...
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))

Pandas Tick Data Averaging By Hour and Plotting For Each Week Of History

I have been following the answer here:
Pandas: how to plot yearly data on top of each other
Which takes a time series and plots the last data point for each day on a new plot. Each line on the plot represents a week's worth of data (so for example 5 data points per week):
I used the following code to do this:
#Chart by last price
daily = ts.groupby(lambda x: x.isocalendar()[1:]).agg(lambda s: s[-1])
daily.index = pd.MultiIndex.from_tuples(daily.index, names=['W', 'D'])
dofw = "Mon Tue Wed Thu Fri Sat Sun".split()
grid = daily.unstack('D').rename(columns=lambda x: dofw[x-1])
grid[-5:].T.plot()
What I would like to do is instead of aggregating by the last data point in a day I would like to aggregateby hour (so averaging the data for each hour) and chart the hourly data for each week. So the chart will look similar to the one in the linked image only it will have 24 data points per day per line and not just one data point per day per line
Is there any way that I can paste the Pandas DataFrame into this post? When I click copy paste it pastes as a list
EDIT:
Final code taking into account incomplete data on for the latest week for charting purposes:
# First we read the DataFrame and resample it to get a mean on every hour
df = pd.read_csv(r"MYFILE.csv", header=None,
parse_dates=[0], index_col=0).resample('H', how='mean').dropna()
# Then we add a week field so we can filter it by the week
df['week']= df.index.map(lambda x: x.isocalendar()[1])
start_range = list(set(df['week']))[-3]
end_range = list(set(df['week']))[-1]
# Create week labels
weekdays = 'Mon Tue Wed Thu Fri Sat Sun'.split()
# Create the figure
fig, ax = plt.subplots()
# For every week we want to plot
for week in range(start_range,end_range+1):
# Select out the week
dfw = df[df['week'] == week].copy()
# Here we align all the weeks to span over the same time period so they
# can be shown on the graph one over the other, and not one next to
# the other.
dfw['timestamp'] = dfw.index.values - (week * np.timedelta64(1, 'W'))
dfw = dfw.set_index(['timestamp'])
# Then we plot our data
ax.plot(dfw.index, dfw[1], label='week %s' % week)
# Now to set the x labels. First we resample the timestamp to have
# a date frequency, and set it to be the xtick values
if week == end_range:
resampled = resampled.index + pd.DateOffset(weeks=1)
else:
resampled = dfw.resample('D')
# newresampled = resampled.index + pd.DateOffset(weeks=1)
ax.set_xticks(resampled.index.values)
# But change the xtick labels to be the weekdays.
ax.set_xticklabels(weekdays)
# Plot the legend
plt.legend()
The solution is explained in the code.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# First we read the DataFrame and resample it to get a mean on every hour
df = pd.read_csv('trayport.csv', header=None,
parse_dates=[0], index_col=0).resample('H', how='mean').dropna()
# Then we add a week field so we can filter it by the week
df['week']= df.index.map(lambda x: x.isocalendar()[1])
# Create week labels
weekdays = 'Mon Tue Wed Thu Fri Sat Sun'.split()
# Create the figure
fig, ax = plt.subplots()
# For every week we want to plot
for week in range(1, 4):
# Select out the week
dfw = df[df['week'] == week].copy()
# Here we align all the weeks to span over the same time period so they
# can be shown on the graph one over the other, and not one next to
# the other.
dfw['timestamp'] = dfw.index.values - (week * np.timedelta64(1, 'W'))
dfw = dfw.set_index(['timestamp'])
# Then we plot our data
ax.plot(dfw.index, dfw[1], label='week %s' % week)
# Now to set the x labels. First we resample the timestamp to have
# a date frequency, and set it to be the xtick values
resampled = dfw.resample('D')
ax.set_xticks(resampled.index.values)
# But change the xtick labels to be the weekdays.
ax.set_xticklabels(weekdays)
# Plot the legend
plt.legend()
The result looks like:
You can use the resample (DataFrame or Series) method:
df.resample('H')
by default it uses the how='mean' (ie this will average results by hour).

Categories