I have time series problem, based on the stationary test, the data need to stationarize, here's the detail
I need to eliminate the stationary from time trend for accurate training and forecasting .. this operation resulted in scaling the real values and the forecasting was in this scale so how can i convert the predicted value to its real scale, (N.B: i used ARIMA for prediction and the eliminating the stationary as per this)
DS_log = np.log(DS["Value"])
expwighted_avg = DS_log.ewm(halflife=1).mean()
DS_log_ewma_diff = DS_log - expwighted_avg
then i push this value DS_log_ewma_diff to ARIMA.
Related
I am working with prophet library for educational purpose on a classic dataset:
the air passenger dataset available on Kaggle.
Data are on monthly frequency which is not possible to cross validate as standard frequency on Prophet, based on that discussion.
During the cross validation for Time Series I used the prophet function cross_validation() passing the arguments on weekly frequency.
But when I call the function performance_metrics it returns the horizion column on daily frequency.
How can I get on weekly frequency?
I also tried to read the documentation and the function description:
Metrics are calculated over a rolling window of cross validation
predictions, after sorting by horizon. Averaging is first done within each
value of horizon, and then across horizons as needed to reach the window
size. The size of that window (number of simulated forecast points) is
determined by the rolling_window argument, which specifies a proportion of
simulated forecast points to include in each window. rolling_window=0 will
compute it separately for each horizon. The default of rolling_window=0.1
will use 10% of the rows in df in each window. rolling_window=1 will
compute the metric across all simulated forecast points. The results are
set to the right edge of the window.
Here how I modelled the dataset:
model = Prophet()
model.fit(df)
future_dates = model.make_future_dataframe(periods=36, freq='MS')
df_cv = cross_validation(model,
initial='300 W',
period='5 W',
horizon = '52 W')
df_cv.head()
And then when I call the performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
This is the output that I get with a daily frequency.
I am probably missing something or I made a mistake in the code.
I have a high frequency time series (observations separated by 3 seconds), which I'd like to analyse and eventually forecast short-term periods (10/20/30 min ahead) using different models. My hole dataset containing 20K observations. My goal is to come out with conclusions of how good the different models can forecast the data.
I tried first to plot the hole dataset but i couldn't identify anything :
Hole Dataset
Then I plotted only the first 500 observations and this is the result :
Firt 500 observations
I don't know why it looks just like a whitenoise !
After running the ADF test on the hole dataset it gives me a 0.0 p-value ! this means that my dataset is stationary right ?
I decided to try first the ARIMA model, from the ACF and PACF plots I can't identify p and q :
ACF
PACF
1- Is the dataset a whitenoise ? Is it possible to predict in this time series ?
2- I tried to downsample the dataset (the mean in each 4 minutes), but same think, I couldn't identify anythink, and I think this will result a loss of inforlation no ?
3- What is the length of data on which I should fit the ARIMA on the training set ? Does it make sense to use a short training set for short term forecasting period ?
I think to my naked eye that there are seasonal time series that, when I use adfuller(), the results show the series is stationary based on p values.
I have also applied seasonal_decompose() with it. The results were pretty much what I expected
tb3['percent'].plot(figsize=(18,8))
what the series look like
One thing to note is that my data is collected every minute.
tb3.index.freq = 'T'
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(tb3['percent'].values,freq=24*60, model='additive')
result.plot();
the result of ETS decompose are shown in the figure below
ETS decompose
We can see a clear seasonality, which is same as what i expect
But when use adfuller()
from statsmodels.tsa.stattools import adfuller
result = adfuller(tb3['percent'], autolag='AIC')
the p-value is less than the 0.05, which means this series is stationary.
Can anyone tells me why that happened? how can i fix it?
Because I want to use the SARIMA model to predict furture values, while use the ARIMA model predicts always a constant value of furture.
An Augmented Dickey Fuller test examines whether the coefficient in the regression
y_t - y_{t-1} = <deterministic terms> + c y_{t-1} + <lagged differences>
is equal to 1. It does not usually have power against seasonal deterministic terms, and so it is not surprising that you are not rejecting using adfuller.
You can use a stationary SARIMA model, for example
SARIMAX(y, order=(p,0,q), seasonal_order=(ps, 0, qs, 24*60))
where you set the AR, MA, seasonal AR, and seasonal MA orders as needed.
This model will be quite slow and memory intensive since you have 24 hours of minutely data and so a 1440 lag seasonal.
The next version of statsmodels, which has been released as statsmodels 0.12.0rc0, adds initial support for deterministic processes in time series models which may simplify modeling this type of series. In particular, it would be tempting to use a low order Fourier deterministic sequence. Below is an example notebook.
https://www.statsmodels.org/devel/examples/notebooks/generated/deterministics.html
The dataset of 921rows x 10166columns is used to prediction bacteria plate count based on water temperature. Each row is an observation(first 10080 columns being the time series of water temperature and the remaining 2 columns being y labels- 1 means high bacteria count, 0 means low bacteria count).
There is fluctuation in the temperature for each activation. For the rest of the time, water temperature would remain constant at 25°C. Since there are too many features in the time series, I am thinking about extracting some relevant features from the time series data, such as the first 3 lowest frequency values or amplitude of the time series using fftor ifftetc fromscipy.fftpack, then fit into a logistics regression model. However, due to limited background knowledge in waves/signal, I am confused about a few things:
1)Does applying fft on the time series produce an array of numbers of the frequencies of the time series data? If not, which function should I use instead?
2)I've done forward fill to my time series data(ie. data points are spaced at fixed time intervals) and the number of data for each time series is the same. If 1) is correct, will the number of frequencies returned for different time series be the same?
Below is a basic visualisation of my original data.
Any help is appreciated. Thank you.
The Summary of an ARMA prediction for time series (print arma_mod.summary()) shows some numbers about the confidence interval. Is it possible to use these numbers as prediction intervals in the plot which shows predicted values?
ax = indexed_df.ix[:].plot(figsize=(12,8))
ax = predict_price.plot(ax=ax, style='rx', label='Dynamic Prediction');
ax.legend();
I guess the code:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(results)
found here: Confidence intervals for model prediction
...does not apply here as it is made for OLS rather then for ARMA forecasting. I also checked github but did not find any new stuff which might relate to time series prediction.
(Making forecasts requires forecasting intervals i guess, especially when it comes to an out-of sample forecast.)
Help appreciated.
I suppose, for out-of-sample ARMA prediction, you can use ARMA.forecast from statsmodels.tsa
It returns three arrays: predicted values, standard error and confidence interval for the prediction.
Example with ARMA(1,1), time series y and prediction 1 step ahead:
import statsmodels as sm
arma_res = sm.tsa.ARMA(y, order=(1,1)).fit()
preds, stderr, ci = arma_res.forecast(1)