I have around 23300 hourly datapoints in my dataset and I try to forecast using Facebook Prophet.
To fine-tune the hyperparameters one can use cross validation:
from fbprophet.diagnostics import cross_validation
The whole procedure is shown here:
https://facebook.github.io/prophet/docs/diagnostics.html
Using cross_validation one needs to specify initial, period and horizon:
df_cv = cross_validation(m, initial='xxx', period='xxx', horizon = 'xxx')
I am now wondering how to configure these three values in my case? As stated I have data of about 23.300 hourly datapoints. Should I take a fraction of that as the horizon or is it not that important to have correct fractions of the data as horizon and I can take whatever value seems to be appropriate?
Furthermore, cutoffs has also be defined as below:
cutoffs = pd.to_datetime(['2013-02-15', '2013-08-15', '2014-02-15'])
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
Should these cutoffs be equally distributed as above or can we set the cutoffs individually as someone likes to set them?
initial is the first training period. It is the minimum
amount of data needed to begin your training on.
horizon is the length of time you want to evaluate your forecast
over. Let's say that a retail outlet is building their model so
that they can predict sales over the next month. A horizon set to 30
days would make sense here, so that they are evaluating their model
on the same parameter setting that they wish to use it on.
period is the amount of time between each fold. It can be either
greater than the horizon or less than it, or even equal to it.
cutoffs are the dates where each horizon will begin.
You can understand these terms by looking at this image -
credits: Forecasting Time
Series Data with
Facebook Prophet by Greg Rafferty
Let's imagine that a retail outlet wants a model that is able to predict the next month
of daily sales, and they plan on running the model at the beginning of each quarter. They
have 3 years of data
They would set their initial training data to be 2 years, then. They want to predict the
next month of sales, and so would set horizon to 30 days. They plan to run the model
each business quarter, and so would set the period to be 90 days.
Which is also shown in above image.
Let's apply these parameters into our model:
df_cv = cross_validation(model,
horizon='30 days',
period='90 days',
initial='730 days')
Related
I am working with prophet library for educational purpose on a classic dataset:
the air passenger dataset available on Kaggle.
Data are on monthly frequency which is not possible to cross validate as standard frequency on Prophet, based on that discussion.
During the cross validation for Time Series I used the prophet function cross_validation() passing the arguments on weekly frequency.
But when I call the function performance_metrics it returns the horizion column on daily frequency.
How can I get on weekly frequency?
I also tried to read the documentation and the function description:
Metrics are calculated over a rolling window of cross validation
predictions, after sorting by horizon. Averaging is first done within each
value of horizon, and then across horizons as needed to reach the window
size. The size of that window (number of simulated forecast points) is
determined by the rolling_window argument, which specifies a proportion of
simulated forecast points to include in each window. rolling_window=0 will
compute it separately for each horizon. The default of rolling_window=0.1
will use 10% of the rows in df in each window. rolling_window=1 will
compute the metric across all simulated forecast points. The results are
set to the right edge of the window.
Here how I modelled the dataset:
model = Prophet()
model.fit(df)
future_dates = model.make_future_dataframe(periods=36, freq='MS')
df_cv = cross_validation(model,
initial='300 W',
period='5 W',
horizon = '52 W')
df_cv.head()
And then when I call the performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
This is the output that I get with a daily frequency.
I am probably missing something or I made a mistake in the code.
I have a high frequency time series (observations separated by 3 seconds), which I'd like to analyse and eventually forecast short-term periods (10/20/30 min ahead) using different models. My hole dataset containing 20K observations. My goal is to come out with conclusions of how good the different models can forecast the data.
I tried first to plot the hole dataset but i couldn't identify anything :
Hole Dataset
Then I plotted only the first 500 observations and this is the result :
Firt 500 observations
I don't know why it looks just like a whitenoise !
After running the ADF test on the hole dataset it gives me a 0.0 p-value ! this means that my dataset is stationary right ?
I decided to try first the ARIMA model, from the ACF and PACF plots I can't identify p and q :
ACF
PACF
1- Is the dataset a whitenoise ? Is it possible to predict in this time series ?
2- I tried to downsample the dataset (the mean in each 4 minutes), but same think, I couldn't identify anythink, and I think this will result a loss of inforlation no ?
3- What is the length of data on which I should fit the ARIMA on the training set ? Does it make sense to use a short training set for short term forecasting period ?
I have data with regular gaps (hourly data from 6am – 8pm) for several years. I created a future dataframe with the same pattern, so the predictions are only made for 6am – 8pm for the future.
Now I want to cross validate my predictions. How would I “instruct” the cross validation to only predict for 6am-8pm (I would like to get hourly forecasts for several weeks).
If I set all values from 8pm – 6pm to 0, I can use this
df_cv = cross_validation(m, initial = '43722 hours', period = '1 hours', horizon = '1 hours')
However, this very long to execute, as I’m (unnecessarily) predicting for hours I know will be 0.
Any suggestions?
thank you for taking a look at this. I have failure data for tires over a 5 year period. For each tire, I have the start date(day0), the end date(dayn), and the number of miles driven for each day. I used the total miles each car drove to create 2 distributions, one weibull, one ecdf. My hope is to be able to use those distributions to predict the probability a tire will fail 50 miles in the future during the life of the tire. So an an example, if its 2 weeks into the life of a tire, and the total miles is currently 100 miles and the average miles per week is 50. I want to predict the probability it will fail at 150 miles/ in a week.
My thinking is that if I can get the probabilities of all tires active on a given day, I can sum the probability of each tires failure to get a prediction of how many tires will need to be replaced for a given time period in the future of the given day.
My current methodology is to fit a distribution using 3 years of failure data using scipy.weibull_min and statsmodel.ecdf. Then if a tire is currently at 100 miles and we expect the next week to add 50 miles to that I get the cdf of 150.
However, currently after I run this across all tires that are on the road on the date I am predicting from and sum their respective probabilities I get a prediction that is ~50% higher than what the actual number of tire replacements is. My first thought is that it is an issue with my methodology. Does it sound valid or am I doing something dumb?
This might be too late of a reply but perhaps it will help someone in the future reading this.
If you are looking to make predictions, you need to fit a parametric model (like the Weibull Distribution). The ecdf (Empirical CDF / Nonparametric model) will give you an indication of how well the parametric model fits but it will not allow you to make any future predictions.
To fit the parametric model, I recommend you use the Python reliability library.
This library makes it fairly straightforward to fit a parametric model (especially if you have right censored data) and then use the fitted model to make the kind of predictions you are trying to make. Scipy won't handle censored data.
If you have failure data for a population of tires then you will be able to fit a model. The question you asked (about the probability of failure in the next week given that it has survived 2 weeks) is called conditional survival. Essentially you want CS(1|2) which means the probability it will survive 1 more week given that it has survived to week 2. You can find this as the ratio of the survival functions (SF) at week 3 and week 2: CS(1|2) = SF(2+1)/SF(2).
Let's take a look at some code using the Python reliability library. I'll assume we have 10 failure times that we will use to fit our distribution and from that I'll find CS(1|2):
from reliability.Fitters import Fit_Weibull_2P
data = [113, 126, 91, 110, 146, 147, 72, 83, 57, 104] # failure times (in weeks) of some tires from our vehicle fleet
fit = Fit_Weibull_2P(failures=data, show_probability_plot=False)
CS_1_2 = fit.distribution.SF([3])[0] / fit.distribution.SF([2])[0] # conditional survival
CF_1_2 = 1 - CS_1_2 # conditional failure
print('Probability of failure of any given tire failing in the next week give it has survived 2 weeks:', CF_1_2)
'''
Results from Fit_Weibull_2P (95% CI):
Point Estimate Standard Error Lower CI Upper CI
Parameter
Alpha 115.650803 9.168086 99.008075 135.091084
Beta 4.208001 1.059183 2.569346 6.891743
Log-Likelihood: -47.5428956288772
Probability of failure in the next week given it has survived 2 weeks: 1.7337430857633507e-07
'''
Let's now assume you have 250 vehicles in your fleet, each with 4 tires (1000 tires in total). The probability of 1 tire failing is CF_1_2 = 1.7337430857633507e-07
We can find the probability of X tires failing (throughout the fleet of 1000 tires) like this:
X = [0, 1, 2, 3, 4, 5]
from scipy.stats import poisson
print('n failed probability')
for x in X:
PF = poisson.pmf(k=x, mu=CF_1_2 * 1000)
print(x, ' ', PF)
'''
n failed probability
0 0.9998266407198806
1 0.00017334425253100934
2 1.502671996412269e-08
3 8.684157279833254e-13
4 3.764024409898102e-17
5 1.305170259061071e-21
'''
These numbers make sense because I generated the data from a weibull distribution with a characteristic life (alpha) of 100 weeks, so we'd expect that the probability of failure during week 3 should be very low.
If you have further questions, feel free to email me directly.
I've spent months reading an endless number of posts and I still feel as confused as I initially was. Hopefully someone can help.
Problem: I want to use time series to make predictions of weather data at a particular location.
Set-up:
X1 and X2 are both vectors containing daily values of indices for 10 years (3650 total values in each vector).
Y is a time series of temperature at Newark airport (T), every day for 10 years (3650 days).
There's a strong case to be made that X1 and X2 can be used as predictors for Y. So I break everything into windows of 100 days and create the following:
X1 = (3650,100,1)
X2 = (3650,100,1)
Such that window 1 includes the values from t=0 to t=99, window 2 includes values from t=1 to t=100, etc. (Assume that I have enough extra data at the end that we still have 3650 windows).
What I've learned from other tutorials is that to go into Keras I'd do this:
X = (3650,100,2) = (#_of_windows,window_length,#_of_predictors) which I get by merging X1 and X2.
Then I have this code:
model = Sequential()
model.add(LSTM(1,return_sequences=True,input_shape=(100,2)))
model.add(LSTM(4))
model.add(Dropout(0.2))
model.compile(loss='mean_square_error',optimizer='rmsprop',shuffle=True)
model.fit(X,Y,batch_size=128,epochs=2) # Y is shape (3650,)
predictions = model.predict(?????????????)
So my question is, how do I set up the model.predict area to get back forecasts of N number of days in the future? Some times I might want 2 days, sometimes I might need 2 weeks. I only need to get back N values (shape: [N,]), I don't need to get back windows or anything like that.
Thanks so much!
The only format in which you can predict is the format in which you trained the model. If I understand correctly, you trained the model as follows:
You used windows of size 100 (that is, features at times T-99,T-98,...,T) to predict the value of the target at time T.
If this is indeed the case, then the only thing that you can do with the model is the same type of prediction. That is, you can provide the values of your features for 100 days, and ask the model to predict the value of the target for the last day among the 100.
If you want it to be able to forecast N days, you have to train your model accordingly. That is, every element in Y should consist of sequences of N days. Here is a blog post that describes how to do that.