Python - How to use fitted ARIMA model on unseen data - python

I am using statsmodels.tsa.arima.model.ARIMA to fit an ARIMA model on a timeseries.
How can I use this model to make predictions on unseen data? It seems that the predict and forecast function can only make predictions from the last seen data in the training set that model was fitted to.
So for instance, I want to use a static model to keep making prediction into the future. This is for the purpose of real time multi step forecasting where re-fitting the model isn't an option.
E.g.,
Say we have a dataset size of 10,000 split into train and test (70/30).
The last reading we train on is 7,000
Is it possible to, say, use the trained model and pass in 6997 to 7000 to predict 7001 to 7004
And then in the following iteration pass it 6998 to 7001 to predict 7002 to 7005 using the same model.
This type of prediction is common in ML workflow, but not apparent to me how to perform this in ARIMA.
Predict and forecast functions only ask for indices parameters, but there is no parameter for fresh data.

You can easily do it with the predict method which was created for this purpose. You first train you ARIMA model on all of you data (without splits). When generating forecasts you use the predict method and set the start and end parameter, e.g. when you want to predict 7001 to 7004 like this:
model.predict(start=7000, end=7004)
The predict method will use all the data available to the start point (including that one) and then make a prediction. That way you do not have to train you model again and again with new data.
The start/end parameter also works with datetime or strings (like "2021-06-30" to "2021-07-31").
https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima.model.ARIMAResults.predict.html

Related

How do I use my supervised ML model with unsupervised data?

I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?
While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6

Understanding Cross Validation for Machine learning

Is the following correct about cross validation?:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning. Once the most optimal hyperparameters have been chosen the test data is applied to the model to give a result which is then compared to other models that have undergone a similar process but with different combinations of training data sets. The model with the best results on the test data is then chosen.
I don't think it is correct. You wrote:
Once the model is trained the ‘left out’ training data is used to perform hyperparameter tuning
You tune the model by picking (manually or using a method like grid search or random search) a set of model's hyperparameters (parameters which values are set by you, before you will even fit the model to data). Then for a selected set of hyperparameters' values you calculate the validation set error using Cross-Validation.
So it should be like this:
The training data is divided into different groups, all but one of the training data sets is used for training the model. Once the model is trained the ‘left out’ training data is used to ...
... calculate the error. At the end of the cross validation, you will have k errors calculated on k left out sets. What you do next is calculating a mean of these k errors which gives you a single value - validation set error.
If you have n sets of hyperparameters, you simply repeat the procedure n times, which gives you n validation set errors. You then pick this set, that gave you the smallest validation error.
At the end, you will typically calculate the test set error to see what is the model's performance on unseen data, which simulates putting a model into production and to see whether there is a difference between test set error and validation set error. If there is a significant difference, it means over-fitting.
Just to add something on cross-validation itself, the reason why we use k-CV or LOOCV
is that it is great test set error estimate, which means that when I manipulate with hyperparameters and the value of validation set error dropped down, I know that I really improved model instead of being lucky and simply better fitting the model to train set.

Use same data for test and validation

Hey I am training a CNN model , and was wondering what will happen if I use the same data for validation and test?
Does the model train on validation data as well? (Does my model see the validation data?) Or just the error and accuracy are calculatd and taken into account for training?
You use your validation_set to tune your model. It means that you don`t train on this data but the model takes it into account. For example, you use it to tune the model's hyperparameters.
In order to have a good evaluation - as test set you should use a data which is totally unknown to this model.
Take a look at this article for more information which here I point out the most relevant parts of it to your question :
A validation dataset is a sample of data held back from training your
model that is used to give an estimate of model skill while tuning
model’s hyperparameters.
The validation dataset is different from the test dataset that is also
held back from the training of the model, but is instead used to give
an unbiased estimate of the skill of the final tuned model when
comparing or selecting between final models.
If you use the same set for validation and test, your model may overfit (since it has seen the test data before the final test stage).

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

How do I train the Convolutional Neural Network with negative and positive elements as the input of the first layer?

Just I am curious why I have to scale the testing set on the testing set, and not on the training set when I’m training a model on, for example, CNN?!
Or am I wrong? And I still have to scale it on the training set.
Also, can I train a dataset in the CNN that contents positive and negative elements as the first input of the network?
Any answers with reference will be really appreciated.
We usually have 3 types of datasets for getting a model trained,
Training Dataset
Validation Dataset
Test Dataset
Training Dataset
This should be an evenly distributed data set which covers all varieties of data. If your train with more epochs, the model will get used to the training dataset and will only give proper proper prediction on the training dataset and this is called Overfitting. Only way to keep a check on overfitting is by having other datasets which the model has never been trained on.
Validation Dataset
This can be used fine tune model hyperparameters
Test Dataset
This is the dataset which the model has not been trained on has never been a part of deciding the hyperparameters and will give the reality of how the model is performing.
If scaling and normalization is used, the testing set should use the same parameters used during training.
A good answer that links to that: https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well
Also, some models tend to require normalization and others do not.
The Neural Network architectures are normally robust and might not need normalization.
Scaling data depends upon the requirement as well the feed/data you got. Test data gets scaled with Test data only, because Test data don't have the Target variable (one less feature in Test data). If we scale our Training data with new Test data, our model will not be able to correlate with any target variable and thus fail to learn. So the key difference is the existence of Target variable.

Categories