Python/Pandas - confusion around ARIMA forecasting to get simple predictions - python

Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:
date bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100
Where anything greater than today (28/11/18) is forecasted.
What I've tried to do:
This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:
import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)
This is the 'modelling' bit:
X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()
Exporting results to a csv:
df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)
The trouble I'm having is:
Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?
2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?
What I think I need to do:
Grab my bookings dataset of 2017 and 2018 data from my database
Split it by 2017 and 2018
Produce some forecasts on 2018
Append this 2018+forecast data to 2017 and export as csv
The how-to and why is the problem I'm having.
Any help would be much appreciated

Here are some thoughts:
Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?
Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.
2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?
Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?
What I think I need to do
Split it by 2017 and 2018
Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.

Related

How to use .predict() in a Linear Regression model?

I'm trying to predict what a 15-minute delay in flight departure does to the flight's arrival time. I have thousands of rows as well as several columns in a DF. Two of these columns are dep_delay and arr_delay for departure delay and arrival delay. I have built a simple LinearRegression model:
y = nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(y, nyc['arr_delay'])
And now I'm trying to find out the predicted arrival delay if the flights departure was delayed 15 minutes. How would I use the model above to predict what the arrival delay would be?
My first thought was to use a for loop / if statement, but then I came across .predict() and now I'm even more confused. Does .predict work like a boolean, where I would use "if departure delay is equal to 15, then arrival delay equals y"? Or is it something like:
arr_dep_model.predict(y)?
When working with LinearRegression models in sklearn you need to perform inference with the predict() function. But you also have to ensure the input you pass to the function has the correct shape (the same as the training data). You can learn more about the proper use of predict function in the official documentation.
arr_dep_model.predict(youtInput)
This line of code would output a value that the model predicted for a corresponding input. You can insert this into a for loop and traverse a set of values to serve as the model's input, it depends on the needs for your project and the data you are working with.
Hi Check below code for an example:`
import pandas as pd
import random
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X_train = df[['x1','x2']][:-3].values #Training on top 7 rows
y_train = df['y'][:-3].values #Training on top 7 rows
X_test = df[['x1','x2']][-3:].values # Values on which prediction will happen - bottom 3 rows
regr = LinearRegression()
regr.fit(X_train, y_train)
regr.predict(X_test)
If you will notice X_test the data on which prediction is happening is of same shape as (number of columns) as X_train both have two columns ['X1','X2']. Same has been converted in array when .values is used. You can create your own data (2 column dataframe for current example) & can use that for prediction (because 3rd column is need to be predicted).
Output will be three values as predicted on three rows:

How to create a model on time series data and update it?

I have a large dataset of 23k rows. That data looks like something below:
import pandas as pd
d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"],
"last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)
Date Stock_id last_price price
0 1-1-2020 5 230 241.0
1 1-1-2020 41 8 9.0
2 1-2-2020 5 241 240.0
3 1-2-2020 41 9 8.5
Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.
Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.
X = df[['Stock_id', 'last_price']]
y = df[['price']]
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
Index Actual Predicted
487 45 32
4154 420 512
Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?
I was hoping to get something like this.
Date Actual Predicted
12-11-2020 45 32
12-11-2020 420 512
12-12-2020 43 34
12-12-2020 423 513
I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:
grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)
After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.
To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.
Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit() method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.
If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.

Fitting Regression Model to Time-Series Data

I am trying to fit a regression model to a time series data in Python (basically to predict the trend). I have applied seasonal decomposition using statsmodels earlier which extracts data to its three components including the data trend.
However, I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data.
I would appreciate if anyone can help me with this or even introduces a tutorial for this issue.
Since you mentioned:
I would like to know how I can come up with the best fit to my data using statistical-based regressions (by defining any functions) and check the sum of squares to compare various models and select the best one which fits my data. I should mention that I am not looking for learning-based regressions which rely on training/testing data.
Maybe ARIMA (Auto Regressive Integrated Moving Average) model with given setup (P,D,Q), which can learn on history and predict()/forecast(). Please notice that split data into train and test are for sake of evaluation with approach of walk-forward validation:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
# load dataset
def parser(x):
return datetime.strptime('190'+x, '%Y-%m')
series = read_csv('/content/shampoo.txt', header=0, index_col=0, parse_dates=True, squeeze=True, date_parser=parser)
series.index = series.index.to_period('M')
# split into train and test sets
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(5,1,0))
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
rmse_ = 'Test RMSE: %.3f' % rmse
# plot forecasts against actual outcomes
pyplot.plot(test, label='test')
pyplot.plot(predictions, color='red', label='predict')
pyplot.xlabel('Months')
pyplot.ylabel('Sale')
pyplot.title(f'ARIMA model performance with {rmse_}')
pyplot.legend()
pyplot.show()
I used the same library package you mentioned with following outputs including Root Mean Square Error (RMSE) evaluation:
import statsmodels as sm
sm.__version__ # '0.10.2'
Please see other post1 & post2 for further info. Maybe you can add trend line too

Time Series Classification

you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?
Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.

Unable to use Datetime data in SVM model

I have a dataframe to predict the energy consumption. The columns. are Timestamp and Daily KWH system.
When used in the SVM model, I'm getting Value error as below:
ValueError: Unknown label type: array([ 0. , 127.2264855 , 80.74373643, ..., 7.67569729,
3.32998307, 2.08538807])
Dataset consists of energy consumption for every half hour from Sept to Dec.
Here's a sample dataset
Timestamp Daily_KWH_System
0 2016-09-07 19:47:07 148.978580
1 2016-09-07 19:47:07 104.084760
2 2016-09-07 19:47:07 111.850947
3 2016-09-07 19:47:07 8.421390
4 2016-12-15 02:48:07 13.778317
5 2016-12-15 02:48:07 0.637790
So far I have done :
Read the CSV
data = pd.read_csv('C:/Users/anagha/Documents/Python Scripts/Half_Ho.csv')
Indexing
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
data.index = data['Timestamp']
del data['Timestamp']
data
Plot the graph
data.resample('D', how='mean').plot()
Splitting into Train and Test
from sklearn.utils import shuffle
test = shuffle(test)
train = shuffle(train)
trainData = train.drop('Daily_KWH_System' , axis=1).values
trainLabel = train.Daily_KWH_System.values
testData = test.drop('Daily_KWH_System' , axis=1).values
testLabel = test.Daily_KWH_System.values
SVM Model
from sklearn import svm
model = svm.SVC(kernel='linear', gamma=1)
model.fit(trainData,trainLabel)
model.score(trainData,trainLabel)
Predict Output
predicted= model.predict(testData)
print(predicted)
SVC is Support Vector Classification. Using it will treat your labels categorically. It looks like you're actually trying to do regression. (Note your error, "unknown label type"). A good first step would be to check out SVR. Or you could bin your values to classes, e.g. 0-10, 10-20, etc.:
sklearn SVR

Categories