How to get performance_metrics() on weekly frequency in facebook-prophet? - python

I am working with prophet library for educational purpose on a classic dataset:
the air passenger dataset available on Kaggle.
Data are on monthly frequency which is not possible to cross validate as standard frequency on Prophet, based on that discussion.
During the cross validation for Time Series I used the prophet function cross_validation() passing the arguments on weekly frequency.
But when I call the function performance_metrics it returns the horizion column on daily frequency.
How can I get on weekly frequency?
I also tried to read the documentation and the function description:
Metrics are calculated over a rolling window of cross validation
predictions, after sorting by horizon. Averaging is first done within each
value of horizon, and then across horizons as needed to reach the window
size. The size of that window (number of simulated forecast points) is
determined by the rolling_window argument, which specifies a proportion of
simulated forecast points to include in each window. rolling_window=0 will
compute it separately for each horizon. The default of rolling_window=0.1
will use 10% of the rows in df in each window. rolling_window=1 will
compute the metric across all simulated forecast points. The results are
set to the right edge of the window.
Here how I modelled the dataset:
model = Prophet()
model.fit(df)
future_dates = model.make_future_dataframe(periods=36, freq='MS')
df_cv = cross_validation(model,
initial='300 W',
period='5 W',
horizon = '52 W')
df_cv.head()
And then when I call the performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
This is the output that I get with a daily frequency.
I am probably missing something or I made a mistake in the code.

Related

High frequency time series forecasting

I have a high frequency time series (observations separated by 3 seconds), which I'd like to analyse and eventually forecast short-term periods (10/20/30 min ahead) using different models. My hole dataset containing 20K observations. My goal is to come out with conclusions of how good the different models can forecast the data.
I tried first to plot the hole dataset but i couldn't identify anything :
Hole Dataset
Then I plotted only the first 500 observations and this is the result :
Firt 500 observations
I don't know why it looks just like a whitenoise !
After running the ADF test on the hole dataset it gives me a 0.0 p-value ! this means that my dataset is stationary right ?
I decided to try first the ARIMA model, from the ACF and PACF plots I can't identify p and q :
ACF
PACF
1- Is the dataset a whitenoise ? Is it possible to predict in this time series ?
2- I tried to downsample the dataset (the mean in each 4 minutes), but same think, I couldn't identify anythink, and I think this will result a loss of inforlation no ?
3- What is the length of data on which I should fit the ARIMA on the training set ? Does it make sense to use a short training set for short term forecasting period ?

How to determine multiple Periodicities present in Timeseries data?

My objective is to detect all kinds of seasonalities and their time periods that are present in a timeseries waveform.
I'm currently using the following dataset:
https://www.kaggle.com/rakannimer/air-passengers
At the moment, I've tried the following approaches:
1) Use of FFT:
import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
#https://www.kaggle.com/rakannimer/air-passengers
df=pd.read_csv('AirPassengers.csv')
df.head()
frequency_eval_max = 100
A_signal_rfft = scipy.fft.rfft(df['#Passengers'], n=frequency_eval_max)
n = np.shape(A_signal_rfft)[0] # np.size(t)
frequencies_rel = len(A_signal_fft)/frequency_eval_max * np.linspace(0,1,int(n))
fig=plt.figure(3, figsize=(15,6))
plt.clf()
plt.plot(frequencies_rel, np.abs(A_signal_rfft), lw=1.0, c='paleturquoise')
plt.stem(frequencies_rel, np.abs(A_signal_rfft))
plt.xlabel("frequency")
plt.ylabel("amplitude")
This results in the following plot:
But it doesn't result in anything conclusive or comprehensible.
Ideally I wish to see the peaks representing daily, weekly, monthly and yearly seasonality.
Could anyone point out what am I doing wrong?
2) Autocorrelation:
from pandas.plotting import autocorrelation_plot
plt.rcParams.update({'figure.figsize':(10,6), 'figure.dpi':120})
autocorrelation_plot(df['#Passengers'].tolist())
After doing which I get a plot like the following:
But how do I read this plot and how can I derive the presence of the various seasonalities and their periods from this?
3) SLT Decomposition Algorithm
df.set_index('Month',inplace=True)
df.index=pd.to_datetime(df.index)
#drop null values
df.dropna(inplace=True)
df.plot()
result=seasonal_decompose(df['#Passengers'], model='multiplicable', period=12)
result.seasonal.plot()
This gives the following plot:
But here I can only see one kind of seasonality.
So how do we detect all the types of seasonalities and their time periods that are present using this method?
Hence, I've tried 3 different approaches but they seem either erroneous or incomplete.
Could anyone please help me out with the most effective approach (even apart from the ones I've tried) to detect all kinds of seasonalities and their time periods for any given timeseries data?
I still think a Fourier analysis is the way to go, its just that the 0-frequency result is shadowing any insight.
This is essentially the square of the average of your data set, and all records are positive, far from the typical sinusoidal function you would analyze with Fourier Transforms. So simply subtract the average of your dataset to your dataset before doing the FFT and see how it looks. This would also help with the autocorrelation technique.
Also, you MUST give units to your frequency values. Do not settle for the raw values from the FFT. Those are related to the sampling frequency and span of your dataset. Reason about it and adequately label the daily, weekly, monthly and anual frequencies in your chart.
using FFT, you can get the fundamental frequency. you can then use a low-pass filter or just manually select the first n frequencies. these frequencies will correspond to the 'seasonalities'. transform your filtered FFT into time domain and you can visualize the most basic underlying repetitions, you can easily calculate the time period of those repetitions and visualize it by individually plotting the F0,F1,... in time domain.

How to choose initial, period, horizon and cutoffs with Facebook Prophet?

I have around 23300 hourly datapoints in my dataset and I try to forecast using Facebook Prophet.
To fine-tune the hyperparameters one can use cross validation:
from fbprophet.diagnostics import cross_validation
The whole procedure is shown here:
https://facebook.github.io/prophet/docs/diagnostics.html
Using cross_validation one needs to specify initial, period and horizon:
df_cv = cross_validation(m, initial='xxx', period='xxx', horizon = 'xxx')
I am now wondering how to configure these three values in my case? As stated I have data of about 23.300 hourly datapoints. Should I take a fraction of that as the horizon or is it not that important to have correct fractions of the data as horizon and I can take whatever value seems to be appropriate?
Furthermore, cutoffs has also be defined as below:
cutoffs = pd.to_datetime(['2013-02-15', '2013-08-15', '2014-02-15'])
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
Should these cutoffs be equally distributed as above or can we set the cutoffs individually as someone likes to set them?
initial is the first training period. It is the minimum
amount of data needed to begin your training on.
horizon is the length of time you want to evaluate your forecast
over. Let's say that a retail outlet is building their model so
that they can predict sales over the next month. A horizon set to 30
days would make sense here, so that they are evaluating their model
on the same parameter setting that they wish to use it on.
period is the amount of time between each fold. It can be either
greater than the horizon or less than it, or even equal to it.
cutoffs are the dates where each horizon will begin.
You can understand these terms by looking at this image -
credits: Forecasting Time
Series Data with
Facebook Prophet by Greg Rafferty
Let's imagine that a retail outlet wants a model that is able to predict the next month
of daily sales, and they plan on running the model at the beginning of each quarter. They
have 3 years of data
They would set their initial training data to be 2 years, then. They want to predict the
next month of sales, and so would set horizon to 30 days. They plan to run the model
each business quarter, and so would set the period to be 90 days.
Which is also shown in above image.
Let's apply these parameters into our model:
df_cv = cross_validation(model,
horizon='30 days',
period='90 days',
initial='730 days')

Time series feature extraction using Fourier transformation

The dataset of 921rows x 10166columns is used to prediction bacteria plate count based on water temperature. Each row is an observation(first 10080 columns being the time series of water temperature and the remaining 2 columns being y labels- 1 means high bacteria count, 0 means low bacteria count).
There is fluctuation in the temperature for each activation. For the rest of the time, water temperature would remain constant at 25°C. Since there are too many features in the time series, I am thinking about extracting some relevant features from the time series data, such as the first 3 lowest frequency values or amplitude of the time series using fftor ifftetc fromscipy.fftpack, then fit into a logistics regression model. However, due to limited background knowledge in waves/signal, I am confused about a few things:
1)Does applying fft on the time series produce an array of numbers of the frequencies of the time series data? If not, which function should I use instead?
2)I've done forward fill to my time series data(ie. data points are spaced at fixed time intervals) and the number of data for each time series is the same. If 1) is correct, will the number of frequencies returned for different time series be the same?
Below is a basic visualisation of my original data.
Any help is appreciated. Thank you.

Time series analysis forecast real scale of vales

I have time series problem, based on the stationary test, the data need to stationarize, here's the detail
I need to eliminate the stationary from time trend for accurate training and forecasting .. this operation resulted in scaling the real values and the forecasting was in this scale so how can i convert the predicted value to its real scale, (N.B: i used ARIMA for prediction and the eliminating the stationary as per this)
DS_log = np.log(DS["Value"])
expwighted_avg = DS_log.ewm(halflife=1).mean()
DS_log_ewma_diff = DS_log - expwighted_avg
then i push this value DS_log_ewma_diff to ARIMA.

Categories