I have data with regular gaps (hourly data from 6am – 8pm) for several years. I created a future dataframe with the same pattern, so the predictions are only made for 6am – 8pm for the future.
Now I want to cross validate my predictions. How would I “instruct” the cross validation to only predict for 6am-8pm (I would like to get hourly forecasts for several weeks).
If I set all values from 8pm – 6pm to 0, I can use this
df_cv = cross_validation(m, initial = '43722 hours', period = '1 hours', horizon = '1 hours')
However, this very long to execute, as I’m (unnecessarily) predicting for hours I know will be 0.
Any suggestions?
Related
I am working with prophet library for educational purpose on a classic dataset:
the air passenger dataset available on Kaggle.
Data are on monthly frequency which is not possible to cross validate as standard frequency on Prophet, based on that discussion.
During the cross validation for Time Series I used the prophet function cross_validation() passing the arguments on weekly frequency.
But when I call the function performance_metrics it returns the horizion column on daily frequency.
How can I get on weekly frequency?
I also tried to read the documentation and the function description:
Metrics are calculated over a rolling window of cross validation
predictions, after sorting by horizon. Averaging is first done within each
value of horizon, and then across horizons as needed to reach the window
size. The size of that window (number of simulated forecast points) is
determined by the rolling_window argument, which specifies a proportion of
simulated forecast points to include in each window. rolling_window=0 will
compute it separately for each horizon. The default of rolling_window=0.1
will use 10% of the rows in df in each window. rolling_window=1 will
compute the metric across all simulated forecast points. The results are
set to the right edge of the window.
Here how I modelled the dataset:
model = Prophet()
model.fit(df)
future_dates = model.make_future_dataframe(periods=36, freq='MS')
df_cv = cross_validation(model,
initial='300 W',
period='5 W',
horizon = '52 W')
df_cv.head()
And then when I call the performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
This is the output that I get with a daily frequency.
I am probably missing something or I made a mistake in the code.
This is a question concerning forecasting using Facebook Prophet.
I have a 10-year dataset with daily data that are integers ranging from 0 to 100+.
On certain days, y = 0.
I am looking to produce monthly forecasts for the next 3 years. Should I:
a) Aggregate the data by months before running Prophet? (120 data points); or
b) Run Prophet using daily data (3652 data points)
Thanks.
I have around 23300 hourly datapoints in my dataset and I try to forecast using Facebook Prophet.
To fine-tune the hyperparameters one can use cross validation:
from fbprophet.diagnostics import cross_validation
The whole procedure is shown here:
https://facebook.github.io/prophet/docs/diagnostics.html
Using cross_validation one needs to specify initial, period and horizon:
df_cv = cross_validation(m, initial='xxx', period='xxx', horizon = 'xxx')
I am now wondering how to configure these three values in my case? As stated I have data of about 23.300 hourly datapoints. Should I take a fraction of that as the horizon or is it not that important to have correct fractions of the data as horizon and I can take whatever value seems to be appropriate?
Furthermore, cutoffs has also be defined as below:
cutoffs = pd.to_datetime(['2013-02-15', '2013-08-15', '2014-02-15'])
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
Should these cutoffs be equally distributed as above or can we set the cutoffs individually as someone likes to set them?
initial is the first training period. It is the minimum
amount of data needed to begin your training on.
horizon is the length of time you want to evaluate your forecast
over. Let's say that a retail outlet is building their model so
that they can predict sales over the next month. A horizon set to 30
days would make sense here, so that they are evaluating their model
on the same parameter setting that they wish to use it on.
period is the amount of time between each fold. It can be either
greater than the horizon or less than it, or even equal to it.
cutoffs are the dates where each horizon will begin.
You can understand these terms by looking at this image -
credits: Forecasting Time
Series Data with
Facebook Prophet by Greg Rafferty
Let's imagine that a retail outlet wants a model that is able to predict the next month
of daily sales, and they plan on running the model at the beginning of each quarter. They
have 3 years of data
They would set their initial training data to be 2 years, then. They want to predict the
next month of sales, and so would set horizon to 30 days. They plan to run the model
each business quarter, and so would set the period to be 90 days.
Which is also shown in above image.
Let's apply these parameters into our model:
df_cv = cross_validation(model,
horizon='30 days',
period='90 days',
initial='730 days')
I am working on time-series classification problem using CNN. The dataset used is financial stock market data (like Yahoo Finance). I am using some technical indicators calculated using raw values high,low,volume,open,close.
One of the technical indicators is MACD (Moving Average Convergence Divergence) using TA Library. However, it is written, in most places, that it is calculated for n_fast = 12 and n_slow = 26 periods with RSI (Relative Strength Index) being calculated for 14 days and n_sign = 9 (parameter of macd_diff() in ta library).
So, if I am calculating RSI for 5 days period then how do we set these n_fast and n_slow values according to it? Should these be n_fast = 3 and n_slow = 8. Also, what should be the value of n_sign then? I am new to finance domain.
For a machine learning problem, I want to derive the hourly PV power of a specific system given various weather parameter, including hourly GHI and DHI, but no DNI. If I would take one of the pvLib DNI estimation models, I always need the zenith angle. Since I have hourly values for Irradiance, I cannot be very specific regarding the angle. Would you take an hourly average? There is always the problem that angles close to 90° result in super high DNI values.
So far I tried to manually calculate hourly DNI = (GHI-DHI)/cos(zenith), taking the mean of 5 min resolution zenith angles for the hourly zenith. The sunrise in the location is almost always before 7 am, so I should get some very small PV power in hour 6 of the day. However, due to the fact that I take the average which is almost always over 90°, I get 0 kW AC power or for the few days when the mean angle is just below 90° I get 40 kW AC power, which is the system's maximum limited by the inverters and this in these early hours even more unrealistic.
ModelChain Parameters:
pvsys_ref=pvsyst
loc_ref=loc
orient_strat_ref=None
sky_mod_ref='ineichen'
transp_mod_ref='haydavies'
sol_pos_mod_ref='nrel_numpy'
airm_mod_ref='kastenyoung1989'
dc_mod_ref='cec'
ac_mod_ref=None
aoi_mod_ref='physical'
spec_mod_ref='no_loss'
temp_mod_ref='sapm'
loss_mod_ref='no_loss'
The required weather panda consists out of the hourly simulated ghi, dhi, temp and windspeed as well as the manually calculated dni.
Usually the midpoint of the hour is used to calculate the sun position/sun zenith, and for the sunset and sunrise hours, the midpoint of the period when the sun is above the horizon.
To calculate DNI from GHI and DHI, try using the function dni in pvlib.irradiance:
https://pvlib-python.readthedocs.io/en/latest/generated/pvlib.irradiance.dni.html#pvlib.irradiance.dni