Produce daily forecasts from monthly averages using Python Pandas - python

I have daily data going back years. If I firstly wanted to see what the monthly average of these was, then to project out this monthly average forecast for the next few years I have written the following code.
For example, my forecast for the next few January's will be the average of the last few January's, and the same for Feb, Mar etc. Over the past few years my January number is 51.8111, so for the January's in my forecast period I want every day in every January to be this 51.8111 number (i.e. moving the monthly to daily granularity).
My question is, my code seems a bit long winded and with the loop, could potentially be a little slow? For my own learning I was wondering, what is a better way of taking daily data, averaging it by a time period, then projecting out this time period? I was looking at map and apply functions within Pandas, but couldn't quite work it out.
import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
# create random dataframe of daily values
df = pd.DataFrame(np.random.randint(low=0, high=100,size=2317),
columns=['value'],
index=pd.date_range(start='2014-01-01', end=dt.date.today()-dt.timedelta(days=1), freq='D'))
# gain average by month over entire date range
df_by_month = df.groupby(df.index.month).mean()
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = 0
# project forward the monthly average to each day
for val in df_forecast.index:
df_forecast.loc[val]['value'] = df_by_month.loc[val.month]
# create new dataframe joining together the historical value and forecast
df_complete = df.append(df_forecast)

I think you need Index.map by months by column value from df_by_month:
# create new dataframe with date range for forecast
df_forecast = pd.DataFrame(index=pd.date_range(start=dt.date.today(), periods=1095, freq='D'))
df_forecast['value'] = df_forecast.index.month.map(df_by_month['value'])

Related

Creating a matplotlib line graph using datetime objects while ignoring the year value

I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
As #Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005) isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29 which will throw
ValueError: day is out of range for month. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()

How to create an exogenous day of work week index variable from decomposition of daily time-series data?

I want to extract a day of work week index, that repeats itself every 5 days (business day). A monthly index would give 12 index values that repeat every 12 periods, instead of months I want to create the same index for the workweek. I have time-series data with only business day frequency (Monday -Friday). So annually I use 252 as the freq. If I decompose this daily series it creates a daily index by year, which creates an index for every business day of the year. In Python/Statsmodels, is there a way to create an index from this time series in which I can get the relative index value every 5 days?
I've tried various decompositions, by first changing the resampling frequency, before decomposing the time series, but I don't have a clue as to how to do this. The pseudo code below represents the workweek data I have for 2 years.
import numpy as np
from scipy import stats
import pandas as pd
import statsmodels.tsa.api as tsa
import statsmodels.api as sm
vals = np.random.rand(504)
ts = pd.Series(vals)
df = pd.DataFrame(ts, columns=["Stock"])
df.index = pd.Index(pd.date_range("2012/01/01", periods = len(vals), freq = 'D'))
comp_Stock = tsa.seasonal_decompose(df, model='additive', freq = 252)
comp_Stock.seasonal[:5]
It's hard to describe the expected results other than to say, the seasonal index above comp_Stock would create an index that repeats every 252 values, using this same timeseries data, I want an index that repeats every 5 days. The end goal is to extract this index as an exogenous variable for forecasting.

Python netcdf - Monthly median over the all time period of daily data

I have a NetCDF file input.nc. This file represents nearly 18 years of data sampled every 4 days. From this file, I would like to calculate the monthly median value over the all time period. So the output file should only contain 12-time steps.
I am using the following code:
import xarray as xr
os.chdir(inbasedir)
data = xr.open_dataset('input.nc')
monthly_data = data.resample(freq='m', dim ='time', how = 'median')
monthly_data.to_netcdf("test.nc")
Unfortunately when I look at the output file, my code has done the median value for each month of the all time serie and I end up with more than 200 values. How can I change my code, in order to calculate the 12 monthly medians over the all time period?
You want to use the groupby method:
monthly_data = data.groupby('time.month').median()
There are some good examples of how to use xarray with timeseries data here: http://xarray.pydata.org/en/stable/time-series.html

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Matplotlib: plot number of observations per minute across all Mondays in a year

I have a pandas DataFrame where one of the columns is a bunch of dates (datetime type). I am trying to plot the number of observations per minute across all Mondays in a year against the minutes in the day.
For example, suppose I have two Mondays in my data and there are 3 observations at 09:01 on the first Monday and 4 observations at 09:01 on the second Monday. I would want to plot 7 (3+4) against 9*60+1=541 (That is, 09:01 is the 541st minute since the start of the day). Here is how I started:
def minutes_in_day(arg):
#returns minute number in day
return arg.hour*60+arg.minute
def get_day(arg):
return arg.isocalendar()[2]
# df is my pandas dataframe
df['day']=df['my_datetime_variable'].apply(get_day)
df['minute']=df['my_datetime_variable'].apply(minutes_in_day)
group=df.groupby(['day','minute'])
my_data=group['whatever_variable'].count()
my_data has two indices: a day index going from 1(Monday) to 7(Sunday) and a minute index going from potentially 0 to potentially 24*60-1=1439. How could I use matplotlib(pyplot) to plot the observation count against the minute index only when day index is 1?
I think this is more or less what you want:
#import modules
import random as randy
import numpy as np
import pandas as pd
#create sample dataset
idx=randy.sample(pd.date_range(start='1/1/2015',end='5/5/2015',freq='T'),2000)
idx.sort()
dfm=pd.DataFrame({'data':np.random.randint(0,2,len(idx))},index=idx)
#resample to fill in the gaps and groupby day of the week (0-6) and time
dfm=dfm.resample('T')
dfm=dfm.groupby([dfm.index.dayofweek,dfm.index.time]).count()
#Select monday (the '0th' day of the week)
dfm=dfm.loc[0]
#plot
dfm.plot(title="Number of observations on Mondays",figsize=[12,5])
Gives
As you can read in the pandas.DatetimeIndex docs, the dayofweekattribute returns the day of the week with Monday=0 - Sunday=6 and the time attribute returns a numpy array of datetime.time.

Categories