I have a temperature dataset of 427 days(daily temperature data) I am training the ARIMA model for 360 days and trying to predict the rest of the 67 days data and comparing the results. While fitting the model in test data I am just getting a straight line as predictions, Am i doing something wrong? `
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(train['max'],order=(1,1,2),)
results = model.fit()
results.summary()
start = len(train)
end = len(train) + len(test) -1
predictions= pd.DataFrame()
predictions['pred'] = results.predict(start=start, end=end, typ='levels').rename('ARIMA(1,1,1) Predictions')
Your ARIMA model uses the last two observations to make a prediction, that means the prediction for t(361) is based on true values of t(360) and t(359). The prediction of t(362) is based on the already predicted t(361) and the true t(360). The prediction for t(363) is based on two predicted values t(361) and t(360). The prediction is based on previous predictions, and that means that forecasting errors will negatively impact new predictions. The prediction for t(400) is based on predictions that are based on predictions that are based on predictions etc. Imagine your prediction deviates only 1% for each time step, the forecasting error will become bigger and bigger the more time steps you try to predict. In such cases the predictions often form a straight line at some point.
If you use and ARIMA(p, d, q) model, then you can forecast a maximum of q steps into the future. Predicting 67 steps into the future is a very far horizon and ARIMA is most likely not able to do that. Instead, try to predict only the next single or few time steps.
Related
I'm trying to do demand sensing for a dataset. Presently I have 157 weeks of data(~3years) and I have to predict next month(8 weeks).In the training dataset, I'm using 149weeks as a train and the last 8 weeks as Val to get the best hyperparameters. But I have observed that in the pred result, there's a huge gap in wmapes between Val and pred. I'm not sure if im overfitting because Val wmape is good.
the aim is to get best parameters such that the result pred will good for last month(last 4 weeks/8weeks).
note: there is a gap in train and pred i.e. if the train is till 31st jan22, pred will start from 1st mar22.
How can I overcome this problem?
Details: dataset: timeseries , algo: TCNmodel(darts lib),lang:python.
How you should split the data depends on some factors:
If you have a seasonal influence over a year, you can take a complete year for validation and two years for training.
If your data can be predicted from the last n-weeks, you can take some random n-week splits from the full dataset.
Whats more important here is that I think there's an error in your training pipeline. You should have a sliding window over n-weeks over the full training data and always predict the next 8 weeks from every n-week sequence.
I'm doing an autoarima model which has been trained etc.
I'm at the stage whereby I need to use the model to make some predictions
(the model was trained using 5 years of data and I need to forecast for the next year).
The initial dataset was a simple time series dataset;
Volume
01-01-1995 345
.
.
.
31-12-2000 4783
Steps so far;
df_train = df[df.Date < "2019"]
df_test = df[df.Date >= "2019"]
exogenous_features = ["Consumption_mean_lag30", "Consumption_std_lag30",
"Consumption_mean_lag182", "Consumption_std_lag182",
"Consumption_mean_lag365", "Consumption_std_lag365",
"month", "week", "day", "day_of_week"]
model = auto_arima(df_train['Volume'], exogenous=df_train[exogenous_features], trace=True, error_action="ignore", suppress_warnings=True)
model.fit(df_train['Volume'], exogenous=df_train[exogenous_features])
forecast = model.predict(n_periods=len(df_test), exogenous=df_test[exogenous_features])
df_test["Forecast_ARIMAX"] = forecast
df_test[["Consumption", "Forecast_ARIMAX"]].plot(figsize=(14, 7))
from sklearn.metrics import mean_absolute_error, mean_squared_error
print("RMSE of Auto ARIMAX:", np.sqrt(mean_squared_error(df_test.Consumption, df_test.Forecast_ARIMAX)))
print("\nMAE of Auto ARIMAX:", mean_absolute_error(df_test.Consumption, df_test.Forecast_ARIMAX))
The above gives me a satisfactory model.
When I try to predict using the following;
model.predict(n_periods=365)
I keep getting the error;
ValueError: When an ARIMA is fit with an X array, it must also be provided one for predicting or updating observations.
I have tried to troubleshoot everything but can't seem to understand how to provide an 'X array' or what the error is telling me?
If anyone has any insights or can help in anyway I'd really appreciate it.
Thanks.
You trained your model with exogenous data, so you have your time series and additional data. When you make predictions, you have to provide the additional, exogenous data for the time frame you try to predict.
This is the correct way to generate predictions, by providing the exogenous data:
forecast = model.predict(n_periods=len(df_test), exogenous=df_test[exogenous_features])
Here you are missing the exogenous data, hence the error (X array should contain your exogenous_features):
model.predict(n_periods=365)
The point with exogenous data is that it may improve your model significiantly, but you need to know this data in advance to make predictions.
I'm making a model to predict a the irradiance value on a solar field. The thing is that my model, despite being very simple (added code below), performs very well. The problem is that for any reason, it predicts a different scale, giving almos always lower values but in the same trend. I have appended the plot which compares both outputs and real data, in train and test set. Also linked the dataset.
Some details: The model has a total of 24 columns which correspond to 24 pyranometers which are the ones that gives information about the sun. The model has just been trained with the first one for simplicity, therefore with more data we can achieve better performance. Also, I'm processing my data to have a 15 steps back in time and a predict window of 20 steps forward.
input = Input((LAG,1)) # LAG is the number of steps I take backward
hidden = LSTM(32, return_sequences=True)(input)
output = Dense(1, activation='linear')(hidden)
model = Model(input, output)
Dataset
Model output vs real in train set
Model output vs real in test set
I have 3 months of data (each row corresponding to each day) generated and I want to perform a multivariate time series analysis for the same :
the columns that are available are -
Date Capacity_booked Total_Bookings Total_Searches %Variation
Each Date has 1 entry in the dataset and has 3 months of data and I want to fit a multivariate time series model to forecast other variables as well.
So far, this was my attempt and I tried to achieve the same by reading articles.
I did the same -
df['Date'] = pd.to_datetime(Date , format = '%d/%m/%Y')
data = df.drop(['Date'], axis=1)
data.index = df.Date
from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = data
coint_johansen(johan_test_temp,-1,1).eig
#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]
freq=train.index.inferred_freq
from statsmodels.tsa.vector_ar.var_model import VAR
model = VAR(endog=train,freq=train.index.inferred_freq)
model_fit = model.fit()
# make prediction on validation
prediction = model_fit.forecast(model_fit.data, steps=len(valid))
cols = data.columns
pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
for j in range(0,4):
for i in range(0, len(prediction)):
pred.iloc[i][j] = prediction[i][j]
I have a validation set and prediction set. However the predictions are way worse than expected.
The plots of the dataset are -
1. % Variation
Capacity_Booked
Total bookings and searches
The output that I am receiving are -
Prediction dataframe -
Validation Dataframe -
As you can see that predictions are way off what is expected. Can anyone advise a way to improve the accuracy. Also, if I fit the model on whole data and then print the forecasts, it doesn't take into account that new month has started and hence to predict as such. How can that be incorporated in here. any help is appreciated.
EDIT
Link to the dataset - Dataset
Thanks
One manner to improve your accuracy is to look to the autocorrelation of each variable, as suggested in the VAR documentation page:
https://www.statsmodels.org/dev/vector_ar.html
The bigger the autocorrelation value is for a specific lag, the more useful this lag will be to the process.
Another good idea is to look to the AIC criterion and the BIC criterion to verify your accuracy (the same link above has an example of usage). Smaller values indicate that there is a bigger probability that you have found the true estimator.
This way, you can vary the order of your autoregressive model and see the one that provides the lowest AIC and BIC, both analyzed together. If AIC indicates the best model is with lag of 3 and the BIC indicates the best model has a lag of 5, you should analyze the values of 3,4 and 5 to see the one with best results.
The best scenario would be to have more data (as 3 months is not much), but you can try these approaches to see if it helps.
I am currently trying to implement both direct and recursive multi-step forecasting strategies using the statsmodels ARIMA library and it has raised a few questions.
A recursive multi-step forecasting strategy would be training a one-step model, predicting the next value, appending the predicted value onto the end of my exogenous values fed into the forecast method and repeating. This is my recursive implementation:
def arima_forecast_recursive(history, horizon=1, config=None):
# make list so can add / remove elements
history = history.tolist()
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
for i, x in enumerate(history):
yhat = model_fit.forecast(steps=1, exog=history[i:])
yhat.append(history)
return np.array(yhat)
def walk_forward_validation(dataframe, config=None):
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_recursive(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
Similarly to perform a direct strategy I would just fit my model on the available training data and use this to predict the total multi-step forecast at once. I am not sure how to achieve this using the statsmodels library.
My attempt (which produces results) is below:
def walk_forward_validation(dataframe, config=None):
# This currently implements a direct forecasting strategy
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
break
yhat = arima_forecast_direct(train, n_test, config)
results = smape3(test, yhat)
tuple_list.append(results)
return tuple_list
def arima_forecast_direct(history, horizon=1, config=None):
model = ARIMA(history, order=config)
model_fit = model.fit(trend='nc', disp=0)
return model_fit.forecast(steps=horizon)[0]
What confuses me specifically is if the model should just be fit once for all predictions or multiple times to make a single prediction in the multi-step forecast? Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one. As shown above my current implementation only fits one model.
What I do not understand is how, if I call ARIMA.fit() method multiple times on the same training data I will get a model that I will get aa fit that is any different outside of the expected normal stochastic variation?
My final question is with regard to optimisation. Using a method such as walk forward validation gives me statistically very significant results, but for many time series it is very computationally expensive. Both of the above implementations are already called using the joblib parallel loop execution functionality which significantly reduced the runtime on my laptop. However I would like to know if there is anything that can be done with regard to the above implementations to make them even more efficient. When running these methods for ~2000 separate time series (~ 500,000 data points total across all series) there is a runtime of 10 hours. I have profiled the code and most of the execution time is spent in the statsmodels library, which is fine but there a discrepancy between the runtime of the walk_forward_validation() method and ARIMA.fit(). This is expected as obviously the walk_forward_validation() method does stuff other than just call the fit method, but if anything in it can be changed to speed up execution time then please let me know.
The idea of this code is to find an optimal arima order per time series as it isn't feasible to investigate 2000 time series individually and as such the walk_forward_validation() method is called 27 times per time series. So roughly 27,000 times overall. Therefore any performance saving that can be found within this method will have an impact no matter how small it is.
Normally, ARIMA can only perform recursive forecasting, not direct forecasting. There might some research done on variations of ARIMA for direct forecasting, but they wouldn't be implemented in Statsmodels. In statsmodels, (or in R auto.arima()), when you set a value for h > 1, it simply performs a recursive forecast to get there.
As far as I know, none of the standard forecasting libraries have direct forecasting implemented yet, you're going to have to code it yourself.
Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one.
I haven't read Ben Taieb's thesis, but from his paper "Machine Learning Strategies for Time Series Forecasting", for direct forecasting, there is only one model for one value of H. So for H=26, there will be only one model. There will be H models if you need to forecast for every value between 1 and H, but for one H, there is only one model.