Future prediction using time series data set with Tensorflow - python

I have a Time series data for almost 5 years. Using this data I want to forecast next 2 years. How to do this?
I referred many websites regarding this. I noticed that mostly predictions are done only with same set of data used for training they are not forecasting for future such as for next 30 days. If it possible to achieve this via TensorFlow. May I know how to achieve this?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
dataset_train = pd.read_csv(r'C:\Users\Kavin\source\repos\SampleTensorFlow\SampleTensorFlow\data\traindataset.csv')
training_set = dataset_train.iloc[:, 1:2].values
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)
X_train = []
y_train = []
for i in range(60, 2035):
X_train.append(training_set_scaled[i-60:i, 0])
y_train.append(training_set_scaled[i, 0])
X_train, y_train = np.array(X_train), np.array(y_train)
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 100, batch_size = 32)
dataset_test = pd.read_csv(r'C:\Users\Kavin\source\repos\SampleTensorFlow\SampleTensorFlow\data\testdataset.csv')
result = dataset_test[['Date','Open']]
real_stock_price = dataset_test.iloc[:, 1:2].values
dataset_total = pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values
inputs = inputs.reshape(-1,1)
inputs = sc.transform(inputs)
X_test = []
for i in range(60, 76):
X_test.append(inputs[i-60:i, 0])
X_test = np.array(X_test)
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
predicted_stock_price = regressor.predict(X_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
result['PredictedResult'] = pd.Series(predicted_stock_price.ravel(), index=result.index)
result.to_csv(r"C:\Users\Kavin\Downloads\PredictedStocks.csv", index=False)
ax = plt.gca()
result.plot(kind='line', x='Date', y='Open', color='red', label = 'Real Stock Price', ax=ax)
result.plot(kind='line', x='Date', y='PredictedResult', color='blue', label = 'Predicted Stock Price', ax=ax)
plt.show()

for all machine learning problem you want to ask yourself the question "What do i want to predict and what data do i have ?"
In your case you want to predict values at an undefined time in the future, let's call that time T.
We suppose that your current data is labelled ie. for each sample/row (x) you have a corresponding value (y). Let xt be the timestamp of your x data
If you want to predict y at time xt + T then you must feed your algorithm with data such as for each sample x, the corresponding label is y at time xt + T.
This way your algorithm will "learn" to predict the value of y at time xt + T from data at time xt
With Pandas, this can be achieved with shift.

time is mostly an abstraction - means nothing, better think about Sequencies. And in order to predict next yet unknown step in sequence provide to DL model correct input_shape & to predict() method the same set of NEW features that you consider to become base for the prediction next moment... e.g. here or here - ED
-- though I still think that encoder-decoder seq2seq model still gives decoded output ONLY if it was present in past (before encoding) & besides if the task of reconstruction of features by decoder from encoded data is correct (that is not always possible to reconstruct similar to those that were encoded)
So, I still consider example in TF to be the best for your goal - though am not sure in adequacy of prediction (that it will become true - as so as even DL gives only likelihood as well as ML based on Bayesian statistics )
if your Dependency is continuous in time and you found or know the Function that describes it - of course you can get prediction for any steps forward for any horizon that you'd like... e.g. you discovered a tendency or cyclicity (e.g. daily - here time can be considered to be a feature)...
another approach is Differencing - it is a technique that removes the trend and seasonality of TimeSeries in order to provide stationarity to these TimeSeries.
that's all, nothing else about the mystery of Dependency and Backpropagation

Related

Training autoencoder for variant length time series - Tensorflow

I am trying to train a LSTM model to reconstruct time series data. I have a data set of ~1800 univariant time-series.
Basically I'm trying to solve a problem similar to this one Anomaly detection in ECG plots, but my time series have different lengths.
I used this approach to deal with variant length:
How to apply LSTM-autoencoder to variant-length time-series data?
and this approach to split the input data based on shape:
Keras misinterprets training data shape
When looping over the data and fitting a model for every shape. is the model eventually only based on the last shape it trained on or is it using all the data to train the final model?
How would I train the model on all input data regardless shape of data?
I know I can add padding but I am trying to use the data as is at this point.
Any suggestions or other approaches to deal with different length on timeseries?
(It is not an issue of time sampling it is more of one timeseries started recording on day X and some only on day X+100)
Here is the code I am using for my autoencoder:
import keras.backend as K
from keras.layers import (Input, Dense, TimeDistributed, LSTM, GRU, Dropout, merge,
Flatten, RepeatVector, Bidirectional, SimpleRNN, Lambda)
def encoder(model_input, layer, size, num_layers, drop_frac=0.0, output_size=None,
bidirectional=False):
"""Encoder module of autoencoder architecture"""
if output_size is None:
output_size = size
encode = model_input
for i in range(num_layers):
wrapper = Bidirectional if bidirectional else lambda x: x
encode = wrapper(layer(size, name='encode_{}'.format(i),
return_sequences=(i < num_layers - 1)))(encode)
if drop_frac > 0.0:
encode = Dropout(drop_frac, name='drop_encode_{}'.format(i))(encode)
encode = Dense(output_size, activation='linear', name='encoding')(encode)
return encode
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
def decoder(encode, layer, size, num_layers, drop_frac=0.0, aux_input=None,
bidirectional=False):
"""Decoder module of autoencoder architecture"""
decode = Lambda(repeat)([inputs,encode])
if aux_input is not None:
decode = merge([aux_input, decode], mode='concat')
for i in range(num_layers):
if drop_frac > 0.0 and i > 0: # skip these for first layer for symmetry
decode = Dropout(drop_frac, name='drop_decode_{}'.format(i))(decode)
wrapper = Bidirectional if bidirectional else lambda x: x
decode = wrapper(layer(size, name='decode_{}'.format(i),
return_sequences=True))(decode)
decode = TimeDistributed(Dense(1, activation='linear'), name='time_dist')(decode)
return decode
inputs = Input(shape=(None, 1))
encoded = encoder(inputs,LSTM,128, 2, drop_frac=0.0, output_size=None, bidirectional=False)
decoded = decoder(encoded, LSTM, 128, 2, drop_frac=0.0, aux_input=None,
bidirectional=False,)
sequence_autoencoder = Model(inputs, decoded)
sequence_autoencoder.compile(optimizer='adam', loss='mae')
trainByShape = {}
for item in train_data:
if item.shape in trainByShape:
trainByShape[item.shape].append(item)
else:
trainByShape[item.shape] = [item]
for shape in trainByShape:
modelHistory =sequence_autoencoder.fit(
np.asarray(trainByShape[shape]),
np.asarray(trainByShape[shape]),
epochs=100, batch_size=1, validation_split=0.15)
use a bidirectional lstm and increase the number of parameters to gain accuracy. I increased the latent_dim to 1000 and it fit the data closely. More hardware and more memory.
def create_dataset(dataset, look_back=3):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back)]
dataX.append(a)
dataY.append(dataset[i + look_back])
return np.array(dataX), np.array(dataY)
COLUMNS=['Open']
dataset=eqix_df[COLUMNS]
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(np.array(dataset).reshape(-1,1))
train_size = int(len(dataset) * 0.70)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size], dataset[train_size:len(dataset)]
look_back=10
trainX=[]
testX=[]
y_train=[]
trainX, y_train = create_dataset(train, look_back)
testX, y_test = create_dataset(test, look_back)
X_train = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
X_test = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
latent_dim=700
n_future=1
model = Sequential()
model.add(Bidirectional(LSTM(units=latent_dim, return_sequences=True,
input_shape=(X_train.shape[1], 1))))
#LSTM 1
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.4,recurrent_dropout=0.4,name='lstm1')))
#LSTM 2
model.add(Bidirectional(LSTM(latent_dim,return_sequences=True,dropout=0.2,recurrent_dropout=0.4,name='lstm2')))
#LSTM 3
model.add(Bidirectional(LSTM(latent_dim, return_sequences=False,dropout=0.2,recurrent_dropout=0.4,name='lstm3')))
model.add(Dense(units = n_future))
model.compile(optimizer="adam", loss="mean_squared_error", metrics=["acc"])
history=model.fit(X_train, y_train,epochs=50,verbose=0)
plt.plot(history.history['loss'])
plt.title('loss accuracy')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
#print(X_test)
prediction = model.predict(X_test)
# shift train predictions for plotting
trainPredictPlot = np.empty_like(dataset)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(prediction)+look_back, :] = prediction
# shift test predictions for plotting
#plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot, color='red')
#plt.plot(testPredictPlot)
#plt.legend(['Actual','Train','Test'])
x=np.linspace(look_back,len(prediction)+look_back,len(y_test))
plt.plot(x,y_test)
plt.show()
Keras LSTM implementation expect a input of type: (Batch, Timesteps, Features).
One solution would be to set Timesteps = 1 and pass the sequence lengths as the Batch dimensions.
If the sampling procedure is the same (no need for resampling), and the difference in length only comes from when the recording time start (X+100 instead of X), I would try to get rid off the lag in the pre-processing stages to get the section of interest only.
Part 1: Plotting the irregular heartbeat. Part 2 is a DENSE network to classify incoming heartbeat voltage to predict irregular beat patterns. 94% accuracy!
from scipy.io import arff
import pandas as pd
from scipy.misc import electrocardiogram
import matplotlib.pyplot as plt
import numpy as np
data = arff.loadarff('ECG5000_TRAIN.arff')
df = pd.DataFrame(data[0])
#for column in df.columns:
# print(column)
columns=[x for x in df.columns if x!="target"]
print(columns)
#print(df[df.target == "b'1'"].drop(labels='target', axis=1).mean(axis=0).to_numpy())
normal=df.query("target==b'1'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
rOnT=df.query("target==b'2'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
pcv=df.query("target==b'3'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
sp=df.query("target==b'4'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
ub=df.query("target==b'5'").drop(labels='target', axis=1).mean(axis=0).to_numpy()
plt.plot(normal,label="Normal")
plt.plot(rOnT,label="R on T",alpha=.3)
plt.plot(pcv, label="PCV",alpha=.3)
plt.plot(sp, label="SP",alpha=.3)
plt.plot(ub, label="UB",alpha=.3)
plt.legend()
plt.title("ECG")
plt.show()
Frame by frame comparision for normal. There are bands of operation which a normal heart stays with:
def PlotTheFrames(df,title,color):
fig,ax = plt.subplots(figsize=(140,50))
for key,item in df.iterrows():
array=[]
for value in np.array(item).flatten():
array.append(value);
x=np.linspace(0,100,len(array))
ax.plot(x,array,c=color)
plt.title(title)
plt.show()
normal=df.query("target==b'1'").drop(labels='target', axis=1)
PlotTheFrames(normal,"Normal Heart beat",'r')
R on T the valves don't seem to be operating correctly
rOnT=df.query("target==b'2'").drop(labels='target', axis=1)
PlotTheFrames(rOnT,"R on T Heart beat","b")
Use a deep learning dense network instead of LSTM! I used leakyReLU for the smaller gradient descent
X=df[columns]
y=pd.get_dummies(df['target'])
model=Sequential()
model.add(Dense(440, input_shape=(len(columns),),activation='LeakyReLU'))
model.add(Dropout(0.4))
model.add(Dense(280, activation='LeakyReLU'))
model.add(Dropout(0.2))
model.add(Dense(240, activation='LeakyReLU'))
model.add(Dense(32, activation='LeakyReLU'))
model.add(Dense(16, activation='LeakyReLU'))
model.add(Dense(5))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)
scaler = StandardScaler()
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)
history=model.fit(X_train, y_train,epochs = 1000,verbose=0)
model.evaluate(X_test, y_test)
plt.plot(history.history['loss'])
plt.title('loss accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

How to predict a certain time span into the future with recurrent neural networks in Keras

I have the following code for time series predictions with RNNs and I would like to know whether for the testing I predict one day in advance:
# -*- coding: utf-8 -*-
"""
Time Series Prediction with RNN
"""
import pandas as pd
import numpy as np
from tensorflow import keras
#%% Configure parameters
epochs = 5
batch_size = 50
steps_backwards = int(1* 4 * 24)
steps_forward = int(1* 4 * 24)
split_fraction_trainingData = 0.70
split_fraction_validatinData = 0.90
#%% "Reading the data"
dataset = pd.read_csv('C:/User1/Desktop/TestValues.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])
df = dataset
data = df.values
indexWithYLabelsInData = 0
data_X = data[:, 0:2]
data_Y = data[:, indexWithYLabelsInData].reshape(-1, 1)
#%% Prepare the input data for the RNN
series_reshaped_X = np.array([data_X[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
series_reshaped_Y = np.array([data_Y[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
timeslot_x_train_end = int(len(series_reshaped_X)* split_fraction_trainingData)
timeslot_x_valid_end = int(len(series_reshaped_X)* split_fraction_validatinData)
X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards]
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards]
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards]
indexWithYLabelsInSeriesReshapedY = 0
lengthOfTheYData = len(data_Y)-steps_backwards -steps_forward
Y = np.empty((lengthOfTheYData, steps_backwards, steps_forward))
for step_ahead in range(1, steps_forward + 1):
Y[..., step_ahead - 1] = series_reshaped_Y[..., step_ahead:step_ahead + steps_backwards, indexWithYLabelsInSeriesReshapedY]
Y_train = Y[:timeslot_x_train_end]
Y_valid = Y[timeslot_x_train_end:timeslot_x_valid_end]
Y_test = Y[timeslot_x_valid_end:]
#%% Build the model and train it
model = keras.models.Sequential([
keras.layers.SimpleRNN(90, return_sequences=True, input_shape=[None, 2]),
keras.layers.SimpleRNN(60, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(steps_forward))
#keras.layers.Dense(steps_forward)
])
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size,
validation_data=(X_valid, Y_valid))
#%% #Predict the test data
Y_pred = model.predict(X_test)
prediction_lastValues_list=[]
for i in range (0, len(Y_pred)):
prediction_lastValues_list.append((Y_pred[i][0][steps_forward-1]))
#%% Create thw dataframe for the whole data
wholeDataFrameWithPrediciton = pd.DataFrame((X_test[:,0]))
wholeDataFrameWithPrediciton.rename(columns = {indexWithYLabelsInData:'actual'}, inplace = True)
wholeDataFrameWithPrediciton.rename(columns = {1:'Feature 1'}, inplace = True)
wholeDataFrameWithPrediciton['predictions'] = prediction_lastValues_list
wholeDataFrameWithPrediciton['difference'] = (wholeDataFrameWithPrediciton['predictions'] - wholeDataFrameWithPrediciton['actual']).abs()
wholeDataFrameWithPrediciton['difference_percentage'] = ((wholeDataFrameWithPrediciton['difference'])/(wholeDataFrameWithPrediciton['actual']))*100
I define eps_forward = int(1* 4 * 24) which is basically one full day (in 15 minutes resolution which makes 1 * 4 *24 = 96 time stamps). I predict the test data by using Y_pred = model.predict(X_test) and I create a list with the predicted values by using for i in range (0, len(Y_pred)): prediction_lastValues_list.append((Y_pred[i][0][steps_forward-1]))
As for me the input and output data of RNNs is quite confusing I am not sure whether for the test dataset I predict one day in advance meaning 96 time steps into the future. Actually what I want is to read historic data and then predict the next 96 time steps based on the historic 96 time steps. Can anyone of you tell me whether I am doing this by using this code or not?
Here I have a link to some test data that I just created randomly. Do not care about the actual values but just on the structure of the prediction: Download Test Data
Am I forecasting 96 steps in advance with the given code (my code is based on a tutorial that can be found here Tutorial RNN for electricity price prediction)?
Reminder: Can anyone tell me something about my question? Or do you need further information? If so, please tell me.
I'll highly appreciate your comments and will be quite thankful for your help. I will also award a bounty for a useful answer.
So if your goal is to predict the next 96 steps given 96 steps in the past, I think you are over-complicating it with your current model. Why not start off with something simple like this:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
np.random.seed(42)
tf.random.set_seed(42)
df = pd.read_csv('TestValues.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])
df = df.drop('value', 1)
steps = 96
scaler = MinMaxScaler()
data = scaler.fit_transform(df.values)
series_reshaped = np.array([data[i:i + (steps+steps)].copy() for i in range(len(data) - (steps + steps))])
x_train_index = int(len(series_reshaped)* .80)
x_valid_index = int(len(series_reshaped)* .10)
x_test_index = x_train_index + x_valid_index
X_train = series_reshaped[:x_train_index, :steps]
X_valid = series_reshaped[x_train_index: x_test_index, :steps]
X_test = series_reshaped[x_test_index:, :steps]
Y_train = series_reshaped[:x_train_index, steps:]
Y_valid = series_reshaped[x_train_index: x_test_index, steps:]
Y_test = series_reshaped[x_test_index:, steps:]
model = tf.keras.models.Sequential([
tf.keras.layers.SimpleRNN(96, return_sequences=True, input_shape=(None, 1)),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))
])
model.compile(loss='mae', optimizer=tf.keras.optimizers.Adam(0.001))
history = model.fit(X_train, Y_train, epochs=20,
validation_data=(X_valid, Y_valid))
You simply split your data into the 96 steps for training and 96 steps forward as your "labels". After training just make your predictions with your test data:
import matplotlib.pyplot as plt
Y_pred = model.predict(X_test)
prediction_list = []
for i in range (0, len(Y_pred)):
prediction_list.append(Y_pred[i][0])
prediction_df = pd.DataFrame((Y_test[:, 0]))
prediction_df.rename(columns = {0:'actual'}, inplace = True)
prediction_df['predictions'] = prediction_list
prediction_df['difference'] = (prediction_df['predictions'] - prediction_df['actual']).abs()
prediction_df['difference_percentage'] = ((prediction_df['difference'])/(prediction_df['actual']))*100
print(prediction_df)
fig, ax = plt.subplots(figsize = (24,12))
ax.set_title('Temperatures across time', fontsize=20)
ax.set_xlabel('Timesteps', fontsize=20)
ax.tick_params(axis='both', which='major', labelsize=20)
ax.set_ylabel('Temperature', fontsize=20)
plt1 = ax.plot(prediction_df['predictions'][steps:], color = 'g', label='predictions')
plt2 = ax.plot(prediction_df['actual'][steps:], color = 'r', label='actual')
ax.legend(loc='upper left', prop={'size': 20})
actual predictions difference difference_percentage
0 0.540650 [0.52996427] [0.010686159] [1.9765377]
1 0.550813 [0.5463712] [0.0044417977] [0.8064075]
2 0.544715 [0.54527795] [0.00056248903] [0.1032629]
3 0.543360 [0.5469178] [0.003557384] [0.65470064]
4 0.547425 [0.5332471] [0.014178336] [2.590003]
.. ... ... ... ...
977 0.410569 [0.440537] [0.029967904] [7.2991133]
978 0.395664 [0.44218686] [0.046522915] [11.758189]
979 0.414634 [0.448785] [0.03415087] [8.236386]
980 0.414634 [0.43778685] [0.023152709] [5.5838885]
981 0.409214 [0.45098385] [0.041769773] [10.207315]
Note that this model can be improved in a lot of ways, but I want you to understand the basics, which is why I tried to make it as simple as possible. After you have understood this approach, you can try an autoregressive approach as mentioned by elbe. Also note that I have not de-normalised your data, which is why you get very low values.
First, I suggest you read Tensorflow's tutorial on time series forecasting.
I played around a bit with your code and the data provided.
The first important thing is that only the temperature column contains information.
In the code below, I prepare the data so that X over a time window of 96 samples/steps and the next step is in Y. X is of dimension (n_samples, 96, 1) and Y of dimension (n_samples, ), I use only steps_backwards points for the past (and discarded the future for simplicity, without affecting the generality)
I have tried different models (a simple Fully Connected or RNN + FC, etc.).
I'm doing mean pooling (with the functional API rather than the sequential model definition approach) so that I have a single predicted value at the end.
X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards, 1][:, :, np.newaxis]
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards, 1][:, :, np.newaxis]
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards, 1][:, :, np.newaxis]
Y_train = series_reshaped_X[:timeslot_x_train_end, steps_backwards, 1]
Y_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, steps_backwards, 1]
Y_test = series_reshaped_X[timeslot_x_valid_end:, steps_backwards, 1]
# define the model
input = tf.keras.Input(shape=(96, 1))
x = input
x = keras.layers.SimpleRNN(10, return_sequences=False, input_shape=[96, 1])(x)
x = keras.layers.Dense(5)(x)
x = tf.reduce_mean(x, axis=1)
model = tf.keras.Model(inputs=input, outputs=x)
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mae'])
with return_sequences=False, the RNN outputs only the last predicted value.
Model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_5 (InputLayer) [(None, 96, 1)] 0
_________________________________________________________________
simple_rnn_27 (SimpleRNN) (None, 10) 120
_________________________________________________________________
dense_21 (Dense) (None, 5) 55
_________________________________________________________________
tf.math.reduce_mean_3 (TFOpL (None,) 0
=================================================================
Total params: 175
Trainable params: 175
Non-trainable params: 0
If you set return_sequences=True, the entire output sequence is outputed, but the prediction time step is still one in the RNN. It is explained here.
One way to predict more steps is to do an autoregressive approach i.e. concatenating the n-1 previous data and the predicted value to get the next value. Another (better) way is to consider that the RNN capture the time dependency in the input, so another possible model could be, if we consider that the input and the output data are of same shape:
input = tf.keras.Input(shape=(96, 1))
x = input
x = keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[96, 1])(x)
x = keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs=input, outputs=x)
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mae'])
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_7 (InputLayer) [(None, 96, 1)] 0
_________________________________________________________________
simple_rnn_29 (SimpleRNN) (None, 96, 10) 120
_________________________________________________________________
dense_23 (Dense) (None, 96, 1) 11
=================================================================
Total params: 131
Trainable params: 131
Non-trainable params: 0
In a way, you can think of the RNN as being able to capture the temporal dependencies in the sequence. It can be combined with other layers to give a better predictor (e.g. the dense layer as you did or stacked RNNs etc).
Note that the number of parameters in the model summary gives you an idea of the ability of the network to learn complex relationships between inputs and outputs (and over fitting problem if the number of parameters is too high).

Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0)

I'm trying to use LSTM to predict how the Dow Jones Industrial Average will perform in coming months. I think it is appropriate to frame this as a time series scenario since the DJIA behaves like a stock, with my data values spread evenly in time. I'm only a beginner, so starting simply with only one feature (daily close value). Now I know that stocks are very random and it's hard to predict them well. And, the close value alone is not very informative... but I'll add other features later.
Dataset: DJIA historical data, Jan 28, 1985 - Jun 24, 2020, downloadable here: https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI.
Visualization with matplotlib:
I use a series of close values (number = 'sequence_length') to predict the close value that immediately follows the series (sequence_length + 1). For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. Put another way, I partition the data such that x_train[0] contains close values for days 0-29, and y_train[0] contains the single value for day 31. Ok. So this is the result I get after running the model on my test data:
Ostensibly great, but I'm wondering if this whole concept is flawed: is the model merely seeing the data repetitively, and not learning any underlying pattern? See below for DJIA close predictions for 7/2020 through 4/20201. It seems to me that the prediction curve mimics the exact shape of the testing data, falling below 20,000 points and all...
Questions
Is this model valid? Is it a matter of changing parameters or reformatting data?
How the heck do you evaluate a model like this? Apparently 'accuracy' is an invalid metric. See below for loss curve
It was suggested that instead of using scalar close values for labels, I use sequences instead. For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60. I have been trying in vain to make this work and apparently have no idea how. I tried to make y_test and y_train Numpy arrays including arrays of sequence data - like this:
y_train, y_test = [], []
for i in range(sequence_length, len(training_set_scaled)):
y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
Any help would be SO greatly appreciated, and perhaps we can all benefit ($). Joking... sort of.
The Code
df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot
# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction
# Split off the training set and scale it.
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)
# Build 3D training set. Final shape: (examples, sequence_length, 1)
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))])
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)
x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1).
# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )
# Build Model
epochs: int = 150
batch_size: int = 32
LSTM_1 = LSTM(
units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
input_shape = (x_train.shape[1], 1),
return_sequences = False,
)
LSTM_2 = LSTM(
units = 10
)
model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))
model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
)
early_stopping = EarlyStopping(monitor='val_loss',
mode='min',
verbose = 1,
patience = 9,
restore_best_weights = False
)
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
verbose = 2,
validation_split = 0.20,
# validation_data = (x_test, y_test),
callbacks = [early_stopping],
)
# Evaluate performance
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)
# early_stopping.stopped_epoch returns 0 if training didn't stop early.
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)
y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)
test_dates = adj_dates[-x_test.shape[0]:]
# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate future data
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]
last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps
# Convert list of timestamps to list of strings
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings
# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)
plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Case 1: At the start of your question, you mentioned "For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. ".
Case 2: But in Question 3, you mentioned "For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60.".
Do you want to predict Closed Value of Next Day, or Closed Value of Next 30 Days.
For generating the Data for X and Y (Train and Test), you can use the function mentioned below:
def univariate_data(dataset, start_index, end_index, history_size, target_size):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index):
indices = range(i-history_size, i)
# Reshape data from (history_size,) to (history_size, 1)
data.append(np.reshape(dataset[indices], (history_size, 1)))
labels.append(dataset[i+target_size])
return np.array(data), np.array(labels)
The Value of the argument, history_size will be 30 and the value of target_size will be 1 for Case 1 and 30 for Case 2 (mentioned above).
You need to call that function once for Training and once for Testing as shown below:
univariate_past_history = 30
univariate_future_target = 1 or 30
x_train_uni, y_train_uni = univariate_data(data, 0, TRAIN_SPLIT,
univariate_past_history,
univariate_future_target)
x_val_uni, y_val_uni = univariate_data(data, TRAIN_SPLIT, None,
univariate_past_history,
univariate_future_target)
Please find this Tensorflow Tutorial which explains both Univariate (One Column) and Multi Variate (multiple columns) Time Series Analysis along with step by step Code, comprehensively.
Answering your questions in the sequence which you have asked:
Yes. Referring the Tutorial will help.
Yes, Accuracy is an invalid metric. You can use MAE or MSE, as shown below:
simple_lstm_model.compile(optimizer='adam', loss='mae')
We should use Numpy Arrays instead of Sequences.
Please let me know if you face any other issue and we will be Happy to help you.

LSTM produces identical forecast for each input

I've been working on reproducing a CNN-LSTM model for PV power forecasting from literature for the past four weeks for my Master Thesis in Energy Science (http://www.mdpi.com/2076-3417/8/8/1286). However I've been stuck on a seemingly simple issue: Any configuration of LSTM model that I've tried yields one of two things:
Rediculous output, makes no sense whatsoever (flat line, complete
stochasticity, negative values, you name it)
Exactly the same (very believable) PV power forecast.
I've done my best to reproduce the issue with as little code as possible:
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Sequential
from tensorflow.python.keras.layers import CuDNNLSTM
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from time import time
SUN_UP, SUN_DOWN = '03:00:00', '23:00:00'
df = pd.read_csv('../Model_Xander/CNN-LSTM-wang/pv_data/all_data_resample-15T_interpolate-4.csv',
index_col = 0,
parse_dates = True)
df = pd.DataFrame(df['151'])
df = df.between_time(SUN_UP, SUN_DOWN)
TIME_STEPS_PER_DAY = len(df.loc['1-1-2016'])
print('each day consists of ' + str(TIME_STEPS_PER_DAY) + ' time steps of 15 minutes')
df = df.values
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df = np.nan_to_num(df_scaled, nan = -1)
#df = np.float16(df)
def multivariate_data(dataset, target, start_index, end_index, history_size,
target_size, step, single_step=False):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index, step):
indices = range(i-history_size, i)
data.append(dataset[indices])
if single_step:
labels.append(target[i+target_size])
else:
labels.append(target[i:i+target_size])
return np.array(data), np.array(labels)
TRAIN_TEST_SPLIT = round(((2/3)*len(df)))
TARGET_COL = df[:,0]
HISTORY_SIZE = TIME_STEPS_PER_DAY * 10
TARGET_SIZE = TIME_STEPS_PER_DAY
STEP = TIME_STEPS_PER_DAY
x_train, y_train = multivariate_data(df, TARGET_COL, 0, TRAIN_TEST_SPLIT, HISTORY_SIZE, TARGET_SIZE, STEP)
x_test, y_test = multivariate_data(df, TARGET_COL, TRAIN_TEST_SPLIT, None, HISTORY_SIZE, TARGET_SIZE, STEP)
lstm = Sequential()
lstm.add(Input(shape = (x_train.shape[1], x_train.shape[2])))
lstm.add(Masking(mask_value = -1))
lstm.add(LSTM(units = 100,
kernel_initializer = keras.initializers.Orthogonal(),
bias_initializer = keras.initializers.Constant(value=0.1),
return_sequences = True))
lstm.add(LSTM(units = 100,
kernel_initializer = keras.initializers.Orthogonal(),
bias_initializer = keras.initializers.Constant(value=0.1),
return_sequences = False))
lstm.add(Dense(units = 100, activation = 'relu',
kernel_initializer = keras.initializers.TruncatedNormal(mean=0, stddev=0.05),
bias_initializer = keras.initializers.Constant(value=0.1)))
lstm.add(Dense(units = y_test.shape[1], activation = 'relu',
kernel_initializer = keras.initializers.TruncatedNormal(mean=0, stddev=0.05),
bias_initializer = keras.initializers.Constant(value=0.1)))
lstm.compile(loss = 'mse', optimizer = 'adam')
lstm.summary()
begin = time()
history = lstm.fit(x_train, y_train,
epochs = 5,
batch_size = 24,
validation_data = (x_test, y_test),
verbose = 1,
shuffle = False)
end = time()
print('it took ' + str(round(end-begin)) + ' seconds to train 5 epochs')
print(history.history)
predict = lstm.predict(x_test)
print(predict.shape)
plt.figure()
for i in range(10, 20):
plt.plot(predict[i,:])
plt.figure()
for i in range(0, x_test.shape[0]):
plt.plot(predict[i,:])
The problem is clearly seen in the last plot:
Plot of 350 predictions overlayed on top of one another
As you can see, all forecasts are identical, I have run out of ideas on how to combat this issue.
As far as i could deduce, there are a number of possible causes, first, my dataset contains a large number of NaN's, I've done my best to combat that issue with three methods:
Resampling from very high resolution (10 seconds) to standard resolution (15 min)
Interpolating up to 4 consecutive NaN's with linear interpolation (any more seems stupid to me)
The masking layer an observant reader might've noticed in the model definition in the code
Even after these steps, my dataset still contains a large amount of NaN's, I'm not really sure what to do about it, or if the Masking layer is even doing its intended job. I do know for sure that the masking layer cannot play nicely with CuDNNLSTM, and my normal LSTM model runs a LOT slower with the masking layer.
The best I've been able to accomplish in terms of obtaining differently shaped predictions for differently shaped inputs is this: Differently shaped output for differently shaped inputs However, as you can see, this is just the same shape with a slightly different amplitude.
Another thing I've noticed is that when i input data from 9 other sensors as features (each with a similar amount and location of NaN's), the amplitude changes per prediction (yay), but the shape remains the same across all predictions: yay different amplitude! Aww, same shape :(.
I will be uploading my model to my university's cluster (for the 200th time) to train for more than 5 epochs, who knows, maybe today is my lucky day. If anyone knows how to combat these issues, i would be very glad and thankful to hear your thoughts.
EDIT:
In light of the lessons learned from the response below i made the following changes: Regularization and dropout to combat overfitting (which will lead to the average being forecasted for every input if left unchecked).
Last LSTM layer with return_sequences = True
Added Flatten layer after last LSTM layer
Removed NaN values from my dataset removing the need for the masking layer and enabling the use of the CuDNNLSTM layer (train on GPU if I understand it correctly).
However, now that each day has a unique forecast, I noticed that increasing the number of units in the LSTM layer beyond somewhere between 20 and 50 (I tested 20 and 50). Will return the problem of each day having the exact same forecast. I am still stumped as to why this is. (See below for the model I used to produce unique forecasts for each day)
lstm = Sequential()
lstm.add(Input(shape = (x_train.shape[1], x_train.shape[2])))
lstm.add(CuDNNLSTM(units = 50,
kernel_initializer = keras.initializers.Orthogonal(),
kernel_regularizer = keras.regularizers.l1_l2(l1=1e-5, l2=1e-4),
bias_initializer = keras.initializers.Constant(value=0.1),
return_sequences = True))
#lstm.add(Dropout(rate=0.2))
lstm.add(CuDNNLSTM(units = 50,
kernel_initializer = keras.initializers.Orthogonal(),
kernel_regularizer = keras.regularizers.l1_l2(l1=1e-5, l2=1e-4),
bias_initializer = keras.initializers.Constant(value=0.1),
return_sequences = True))
lstm.add(Dropout(rate = 0.2))
lstm.add(Flatten())
lstm.add(Dense(units = int(0.5*x_train.shape[1]), activation = 'relu',
kernel_initializer = keras.initializers.TruncatedNormal(mean=0, stddev=0.05),
bias_initializer = keras.initializers.Constant(value=0.1)))
lstm.add(Dropout(rate = 0.2))
lstm.add(Dense(units = y_test.shape[1], activation = 'relu',
kernel_initializer = keras.initializers.TruncatedNormal(mean=0, stddev=0.05),
bias_initializer = keras.initializers.Constant(value=0.1)))
lstm.compile(loss = 'mse', optimizer = 'adam')
lstm.summary()

Using simple models on Google Trends data to predict something doesn't work as expected

I'm using Google Trends to develop a simple model to predict the future trend for a set of search terms. I took inspiration from this blog post and tried to do basically the same thing to other search terms, trying to find the best models for this kind of task.
The problem is: the predictions for other search terms are completely wrong. I only used terms with a regular pattern, sometimes less regular than the pattern in the example of the blog. Here is my adapted code:
import numpy as np
import pandas as pd
from datetime import date
from matplotlib import pyplot as plt
from keras.models import Sequential
from keras.layers import InputLayer, Reshape, Conv1D, MaxPool1D, Flatten, Dense, LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
def prepare_data(target, window_X, window_y):
""" Data preprocessing for multistep forecast """
X, y = [], []
start_X = 0
end_X = start_X + window_X
start_y = end_X
end_y = start_y + window_y
for _ in range(len(target)):
if end_y < len(target):
X.append(target[start_X:end_X])
y.append(target[start_y:end_y])
start_X += 1
end_X = start_X + window_X
start_y += 1
end_y = start_y + window_y
X = np.array(X)
y = np.array(y)
return np.array(X), np.array(y)
def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
""" Training function for network """
# Model input
model = Sequential()
model.add(InputLayer(input_shape=(X_train.shape[1], )))
if type == 'mlp':
model.add(Reshape(target_shape=(X_train.shape[1], )))
model.add(Dense(units=64, activation='relu'))
if type == 'cnn':
model.add(Reshape(target_shape=(X_train.shape[1], 1)))
model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
model.add(MaxPool1D())
model.add(Flatten())
if type == 'lstm':
model.add(Reshape(target_shape=(X_train.shape[1], 1)))
model.add(LSTM(units=64, return_sequences=False))
# Output layer
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=y_train.shape[1], activation='sigmoid'))
# Compile
model.compile(optimizer='adam', loss='mse')
# Callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=10)
model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
callbacks = [early_stopping, model_checkpoint]
# Fit model
model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
batch_size=batch_size, epochs=epochs, callbacks=callbacks, verbose=2)
# Load best model
model.load_weights('model.h5')
# Return
return model
# Define windows
window_X = 12
window_y = 6
# Load data
data = pd.read_csv('data/holocaust-world.csv', sep=',')
data = data.set_index(keys=pd.to_datetime(data['month']), drop=True).drop('month', axis=1)
# Scale data
data['y'] = data['y'] / 100.
# Prepare tensors
X, y = prepare_data(target=data['y'].values, window_X=window_X, window_y=window_y)
# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]
# Train models
models = ['mlp', 'cnn', 'lstm']
# Test data
X_test = data['y'].values[-window_X:].reshape(1, -1)
# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*6, 'cnn': [np.nan]*6, 'lstm': [np.nan]*6})
preds = preds.set_index(pd.date_range(start=date(2018, 11, 1), end=date(2019, 4, 1), freq='MS'))
# Fit models and plot
for mod in models:
# Train models
model = fit_model(type=mod, X_train=X_train, y_train=y_train, X_test=X_valid, y_test=y_valid, batch_size=16, epochs=1000)
# Predict
p = model.predict(x=X_test)
# Fill
preds[mod] = p[0]
# Plot the entire timeline, including the predicted segment
idx = pd.date_range(start=date(2004, 1, 1), end=date(2019, 4, 1), freq='MS')
data = data.reindex(idx)
plt.plot(data['y'], label='Google')
# Plot
plt.plot(preds['mlp'], label='MLP')
plt.plot(preds['cnn'], label='CNN')
plt.plot(preds['lstm'], label='LSTM')
plt.legend()
plt.show()
Here i tried evaluating the interest in the theme of the holocaust, which is also periodic (peak in january, you can grab the csv from Google Trends site obviously). Here are the results:
So the questions are:
how can i adapt this model to use every month available (at the time of writing, until august 2019)? When i try to do that, i have weird behaviors, so i manually deleted everything after october 2018 in the csv for now.
how can i improve those simple models to actually give useful and meaningful results? I wonder why the example in the blog post works perfectly, while my attempts fail miserably.
Thanks in advance!
Increase the number of predictions you test and you should get better results.
window_y = 49
...
# Predictions
preds = pd.DataFrame({'mlp': [np.nan]*49, 'cnn': [np.nan]*49, 'lstm': [np.nan]*49})
preds = preds.set_index(pd.date_range(start=date(2015, 1, 1), end=date(2019, 1, 1), freq='MS'))
Playing with the training/test set will also help:
# Training and test
train = 50
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]
However, this particular trend is periodic but also affected by other factors. Phrophet can help you dealing with this kind of trends better than a simple machine learning model.

Categories