High frequency time series forecasting - python

I have a high frequency time series (observations separated by 3 seconds), which I'd like to analyse and eventually forecast short-term periods (10/20/30 min ahead) using different models. My hole dataset containing 20K observations. My goal is to come out with conclusions of how good the different models can forecast the data.
I tried first to plot the hole dataset but i couldn't identify anything :
Hole Dataset
Then I plotted only the first 500 observations and this is the result :
Firt 500 observations
I don't know why it looks just like a whitenoise !
After running the ADF test on the hole dataset it gives me a 0.0 p-value ! this means that my dataset is stationary right ?
I decided to try first the ARIMA model, from the ACF and PACF plots I can't identify p and q :
1- Is the dataset a whitenoise ? Is it possible to predict in this time series ?
2- I tried to downsample the dataset (the mean in each 4 minutes), but same think, I couldn't identify anythink, and I think this will result a loss of inforlation no ?
3- What is the length of data on which I should fit the ARIMA on the training set ? Does it make sense to use a short training set for short term forecasting period ?


How to choose initial, period, horizon and cutoffs with Facebook Prophet?

I have around 23300 hourly datapoints in my dataset and I try to forecast using Facebook Prophet.
To fine-tune the hyperparameters one can use cross validation:
from fbprophet.diagnostics import cross_validation
The whole procedure is shown here:
Using cross_validation one needs to specify initial, period and horizon:
df_cv = cross_validation(m, initial='xxx', period='xxx', horizon = 'xxx')
I am now wondering how to configure these three values in my case? As stated I have data of about 23.300 hourly datapoints. Should I take a fraction of that as the horizon or is it not that important to have correct fractions of the data as horizon and I can take whatever value seems to be appropriate?
Furthermore, cutoffs has also be defined as below:
cutoffs = pd.to_datetime(['2013-02-15', '2013-08-15', '2014-02-15'])
df_cv2 = cross_validation(m, cutoffs=cutoffs, horizon='365 days')
Should these cutoffs be equally distributed as above or can we set the cutoffs individually as someone likes to set them?
initial is the first training period. It is the minimum
amount of data needed to begin your training on.
horizon is the length of time you want to evaluate your forecast
over. Let's say that a retail outlet is building their model so
that they can predict sales over the next month. A horizon set to 30
days would make sense here, so that they are evaluating their model
on the same parameter setting that they wish to use it on.
period is the amount of time between each fold. It can be either
greater than the horizon or less than it, or even equal to it.
cutoffs are the dates where each horizon will begin.
You can understand these terms by looking at this image -
credits: Forecasting Time
Series Data with
Facebook Prophet by Greg Rafferty
Let's imagine that a retail outlet wants a model that is able to predict the next month
of daily sales, and they plan on running the model at the beginning of each quarter. They
have 3 years of data
They would set their initial training data to be 2 years, then. They want to predict the
next month of sales, and so would set horizon to 30 days. They plan to run the model
each business quarter, and so would set the period to be 90 days.
Which is also shown in above image.
Let's apply these parameters into our model:
df_cv = cross_validation(model,
horizon='30 days',
period='90 days',
initial='730 days')

Why my time series use seasonal_decompose() can see clear seasonal, but when apply it with adfuller(), the result shows it is stationary

I think to my naked eye that there are seasonal time series that, when I use adfuller(), the results show the series is stationary based on p values.
I have also applied seasonal_decompose() with it. The results were pretty much what I expected
what the series look like
One thing to note is that my data is collected every minute.
tb3.index.freq = 'T'
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(tb3['percent'].values,freq=24*60, model='additive')
the result of ETS decompose are shown in the figure below
ETS decompose
We can see a clear seasonality, which is same as what i expect
But when use adfuller()
from statsmodels.tsa.stattools import adfuller
result = adfuller(tb3['percent'], autolag='AIC')
the p-value is less than the 0.05, which means this series is stationary.
Can anyone tells me why that happened? how can i fix it?
Because I want to use the SARIMA model to predict furture values, while use the ARIMA model predicts always a constant value of furture.
An Augmented Dickey Fuller test examines whether the coefficient in the regression
y_t - y_{t-1} = <deterministic terms> + c y_{t-1} + <lagged differences>
is equal to 1. It does not usually have power against seasonal deterministic terms, and so it is not surprising that you are not rejecting using adfuller.
You can use a stationary SARIMA model, for example
SARIMAX(y, order=(p,0,q), seasonal_order=(ps, 0, qs, 24*60))
where you set the AR, MA, seasonal AR, and seasonal MA orders as needed.
This model will be quite slow and memory intensive since you have 24 hours of minutely data and so a 1440 lag seasonal.
The next version of statsmodels, which has been released as statsmodels 0.12.0rc0, adds initial support for deterministic processes in time series models which may simplify modeling this type of series. In particular, it would be tempting to use a low order Fourier deterministic sequence. Below is an example notebook.

Time series feature extraction using Fourier transformation

The dataset of 921rows x 10166columns is used to prediction bacteria plate count based on water temperature. Each row is an observation(first 10080 columns being the time series of water temperature and the remaining 2 columns being y labels- 1 means high bacteria count, 0 means low bacteria count).
There is fluctuation in the temperature for each activation. For the rest of the time, water temperature would remain constant at 25°C. Since there are too many features in the time series, I am thinking about extracting some relevant features from the time series data, such as the first 3 lowest frequency values or amplitude of the time series using fftor ifftetc fromscipy.fftpack, then fit into a logistics regression model. However, due to limited background knowledge in waves/signal, I am confused about a few things:
1)Does applying fft on the time series produce an array of numbers of the frequencies of the time series data? If not, which function should I use instead?
2)I've done forward fill to my time series data(ie. data points are spaced at fixed time intervals) and the number of data for each time series is the same. If 1) is correct, will the number of frequencies returned for different time series be the same?
Below is a basic visualisation of my original data.
Any help is appreciated. Thank you.

Time series analysis forecast real scale of vales

I have time series problem, based on the stationary test, the data need to stationarize, here's the detail
I need to eliminate the stationary from time trend for accurate training and forecasting .. this operation resulted in scaling the real values and the forecasting was in this scale so how can i convert the predicted value to its real scale, (N.B: i used ARIMA for prediction and the eliminating the stationary as per this)
DS_log = np.log(DS["Value"])
expwighted_avg = DS_log.ewm(halflife=1).mean()
DS_log_ewma_diff = DS_log - expwighted_avg
then i push this value DS_log_ewma_diff to ARIMA.

How do I denormalize the sklearn diabetes dataset?

There is a nice example of linear regression in sklearn using a diabetes dataset.
I copied the notebook version and played with it a bit in Jupyterlab. Of course, it works just like the example. But I wondered what I was really seeing.
There is a chart with unlabeled axes.
I wondered what the label (dependent variable) was.
I wondered which of the 10 independent variables was being used.
So I played around with the nice features provided by ipython/jupyter:
Diabetes dataset
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of
n = 442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Body mass index:
:Average blood pressure:
Note: Each of these 10 feature variables have been mean centered and scaled by the standard
deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004)
"Least Angle Regression," Annals of Statistics (with discussion), 407-499.
From the Source URL, we are led to the original raw data which is a tab-separated unnormalized copy of the data. It also further explains what the "S" features were in the problem domain.
Interestingly, sex was one of [1,2] with a guess as to what they meant.
But my real question is whether there is a way within sklearn to determine
how to denormalize the data in sklearn?
Is there a way to denormalize the coefficients and intercept so that one could
express the fit algebraically?
or is this just a demonstration of linear regression?
There is no way to denormalize data without any information about the data prior to the normalization. However, note that the sklearn.preprocessing classes MinMaxScaler, StandardScaler, etc. do include inverse_transform methods (example), so if this were also provided in the example it would be easy to do. As it stands, as you say, this is just a regression demonstration.
