Time series prediction with Keras - values close to average

Time series prediction with Keras - values close to average - python

I am doing a one step ahead prediction using 15 previous samples on a dataset using LSTMs in Keras.
The data csv file can be found here:
(https://drive.google.com/file/d/0Byiipc0dArG0LVZJelB4NFBucms/view?usp=sharing)
The second column col[1] values are used. The values in the first column (timestamps) are not used at all.
I use the following code:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
dataY.append(dataset[i + look_back, 0])
return numpy.array(dataX), numpy.array(dataY)
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset
dataframe = pandas.read_csv('node70-3000.csv', usecols=[1],
engine='python', skipfooter=3)
dataset = dataframe.values
dataset = dataset.astype('float32')
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.7)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:],
dataset[train_size:len(dataset),:]
# reshape into X=t and Y=t+1
look_back = 15
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))
# create and fit the LSTM network
batch_size = 11
model = Sequential()
model.add(LSTM(32, batch_input_shape=(batch_size, look_back, 1),
stateful=True))
#model.add(LSTM(32, stateful = True))
model.add(Dense(32))
model.add(Dense(1))
# default lr=0.001
optim = Adam(lr=0.05, beta_1=0.9, beta_2=0.999, epsilon=1e-08,
decay=0.1)
model.compile(loss='mean_squared_error', optimizer=optim)
for i in range(50):
model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size,
verbose=2, shuffle=False)
model.reset_states()
# make predictions
The problem:
I have used this code to predict few more periodical and clean time series and it works well. For this set of data however, I used different parameters of Adam (learning rate, etc). Still, I get the prediction with a large offset from the actual values. It seems like the predicted values are always close to the average of the data values. Please see the following graph. I have 1850 data points in the csv file. These are grouped into sequences of size 15. The input is a sequence of size 15. The output should be the next value predicted. 70% of the data is used for training and the rest is the test data set. Prediction is done on train and test datasets shown in green and red respectively in the following image.
(https://drive.google.com/file/d/0Byiipc0dArG0OEN5el9lc0puNGM/view?usp=sharing)
Do you have any idea why this is happening and what can be causing it?
Thanks!

Your data looks a lot like binary data + noise.
Below is a histogram of the raw data and a histogram of the first difference of the raw data. If the exact value of the prediction is not important to you, I would suggest making the data binary and using a different cost function, e.g. binary crossentropy, though I'm skeptical that it will work giving the explanation below.
If there are similar sequences with multiple possible next values, the network will try to predict the average value. For example, consider the sequences (0,0,1) and (0,0,0) and (0,0,-1) and a network trying to learn their last values. With this toy data, any prediction model's best MSE is to predict the average of the last values, 0 in this case...
I suggest checking what the network is learning by plotting target values, Y, and predicted target values, Y_hat.
Hope this helps!

Related

Not able to get correct results even though mean absolute error is low

import pandas as pd
df=pd.read_csv('final sheet for project.csv')
features=['moisture','volatile matter','fixed carbon','calorific value','carbon %','oxygen%']
train_data=df[features]
target_data=df.pop('Activation energy')
X_train, X_test, y_train, y_test = train_test_split(train_data,target_data, test_size=0.09375, random_state=1)
standard_X_train=pd.DataFrame(StandardScaler().fit_transform(X_train))
standard_X_test=pd.DataFrame(StandardScaler().fit_transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_train)
normalized_y_train = scaler.transform(y_train)
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_test)
normalized_y_test = scaler.transform(y_test)
model=keras.Sequential([layers.Dense(units=20,input_shape=[6,]),layers.Dense(units=1,activation='tanh')])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(standard_X_train,normalized_y_train, validation_data=(standard_X_test,normalized_y_test),epochs=200)
I wish to create a model to predict activation energy using some features . I am getting training loss: 0.0629 and val_loss: 0.4213.
But when I try to predict the activation energies of some other unseen data ,I get bizarre results. I am a beginner in ML.
Can someone please help what changes can be made in the code. ( I want to make a model with one hidden layer of 20 units that has activation function tanh.)

You should not use fit_transform for test data. You should use fit_transform for training data and apply just transform to test data, in order to use the same parameters for training data, on the test data.
So, the transformation part of your code should change like this:
scaler_x = StandardScaler()
standard_X_train = pd.DataFrame(scaler_x.fit_transform(X_train))
standard_X_test = pd.DataFrame(scaler_x.transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))
normalized_y_train = scaler_y.fit_transform(y_train)
normalized_y_test = scaler_y.transform(y_test)
Furthermore, since you are scaling your data, you should do the same thing for any prediction. So, your prediction line should be something like:
preds = scaler_y.inverse_transform(
model.predict(scaler_x.transform(pred_input)) #if it is standard_X_test you don't need to transform again, since you already did it.
)
Additionally, since you are transforming your labels in range 0 and 1, you may need to change your last layer activation function to sigmoid instead of tanh, or even may better to use an activation function like relu in your first layer if you are still getting poor results after above modifications.
model=keras.Sequential([
layers.Dense(units=20,input_shape=[6,],activation='relu'),
layers.Dense(units=1,activation='sigmoid')
])

Why does shuffling sequences of data in tf.keras.dataset affect the order of sequences differently between tf.fit and tf.predict?

I am training an LSTM deep learning model with time series sequences and labels.
I generate a tensorflow dataset "train_data" and "test_data"
train_data = tf.keras.preprocessing.timeseries_dataset_from_array(
data=data,
targets=None,
sequence_length=total_window_size,
sequence_stride=1,
batch_size=batch_size,
shuffle=is_shuffle).map(split_window).prefetch(tf.data.AUTOTUNE)
I then train the model with the above datasets
model.fit(train_data, epochs=epochs, validation_data = test_data, callbacks=callbacks)
And then run predictions to obtain the predicted values
train_labels = np.concatenate([y for x, y in train_data], axis=0)
train_predictions = model.predict(train_data)
test_labels = np.concatenate([y for x, y in test_data], axis=0)
test_predictions = model.predict(test_data)
Here is my question: When I plot the train/test label data against the predicted values I get the following plot when I do not shuffle the sequences in the dataset building step:
Here the output with shuffling:
Question Why is this the case? I use the exact same source dataset for training and prediction. The dataset should be shuffled. Is there a chance that TensorFlow shuffles the data twice randomly, once during training and another time for predictions? I tried to supply a shuffle seed but that did not change things either.

The dataset gets shuffled everytime you iterate through it. What you get after your list comprehension isn't in the same order as when you write predict. If you don't want that, pass:
shuffle(buffer_size=BUFFER_SIZE, reshuffle_each_iteration=False)

How to re-frame the data-frame with multiple inputs for LSTM in Keras?

I am trying to predict the temperature for the given area (its integer number from 1 to 142) for the given date and time.
The problem is that I have CSV with the following columns:
DateTime,AreaID,Temperature
How to reframe the data-frame for LSTM (Apologise as I am a new bee for the LSTM)?
For the information, I have data for two months with a measured by the period of every 5 minutes.
I have coded LSTM for Input DateTime. But I want to include AreaID too. to predict Temperature.
The dataset created for the Training and Testing sets are using the following code block:
dataset = dataset.temperature.values #numpy.ndarray
dataset = dataset.astype('float32')
dataset = np.reshape(dataset, (-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
train_size = int(len(dataset) * 0.80)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
def create_dataset(dataset, look_back=1):
X, Y = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
X.append(a)
Y.append(dataset[i + look_back, 0])
return np.array(X), np.array(Y)
look_back = 30
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
Before this, The sample code have sorted the data frame based on DateTime like:
dataset.sort_values('timestamp', inplace=True, ascending=True)
I want to change LSTM to take two inputs
1. DateTime
2. AreaID
& One Output :
1. Temperature
How to code LSTM for this requirements? (Please help me I am a new bee in the area of neural network)

Just for hint.
You prepare new dataset into x_train and y_train
Take an starting 60 days to train the model and predict 61th days thats my logic
X_train=[]
y_train=[]
count=0
for i in range(60,train.shape[0]):
count=count+1
X_train.append(df[i-60:i])
y_train.append(train['targetcol'][i])

Keras Prediction after test values

I am currently trying to build neuronal network to be able to predict time series, but the question is, is it possible to predict further than just the test dataset. I mean, for my example, I have a dataset of about 3000 values, from which I keep 90% for training and 10% for testing. Then When I compare the prediction with the actual test value, it corresponds, but is it possible for instance to ask the program to predict the next 500 values (i.e. from 3001 to 3500) ?
Here is a snipper of the code I use.
import csv
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, GRU
from keras.models import Sequential
from keras import optimizers
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import learning_curve
from sklearn.kernel_ridge import KernelRidge
import time
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (-1, 1))
def load_data(datasetname, column, seq_len, normalise_window):
# A support function to help prepare datasets for an RNN/LSTM/GRU
data = datasetname.loc[:,column]
sequence_length = seq_len + 1
result = []
for index in range(len(data) - sequence_length):
result.append(data[index: index + sequence_length])
result = np.array(result)
result.reshape(-1,1)
training_set_scaled = sc.fit_transform(result)
print (result)
#Last 10% is used for validation test, first 90% for training
row = round(0.9 * training_set_scaled.shape[0])
train = training_set_scaled[:int(row), :]
#np.random.shuffle(train)
x_train = train[:, :-1]
y_train = train[:, -1]
X_test = training_set_scaled[int(row):, :-1]
y_test = training_set_scaled[int(row):, -1]
print ("shape train", x_train)
print ("shape train", X_test)
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
return [x_train, X_test, y_train, y_test]
def build_model():
model = Sequential()
layers = {'input': 100, 'hidden1': 150, 'hidden2': 256, 'hidden3': 100, 'output': 10}
model.add(LSTM(
50,
return_sequences=True,
input_shape=(200,1)
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden2'],
return_sequences=True,
))
model.add(Dropout(0.2))
model.add(LSTM(
layers['hidden3'],
return_sequences=False,
))
model.add(Dropout(0.2))
model.add(Activation("linear"))
model.add(Dense(
output_dim=layers['output']))
start = time.time()
model.compile(loss="mean_squared_error", optimizer="adam")
print ("Compilation Time : ", time.time() - start)
return model
dataset = pd.read_csv(
'data.csv')
X_train, X_test, y_train, y_test = load_data(dataset, 'mean anomaly', 200, False)
model = build_model()
print ("train",X_train)
print ("test",X_test)
model.fit(X_train, y_train, batch_size=256, epochs=1, validation_split=0.05)
predictions = model.predict(X_test)
predictions = np.reshape(predictions, (predictions.size,))
plt.figure(1)
plt.subplot(311)
plt.title("Actual Test Signal w/Anomalies & noise")
plt.plot(y_test)
plt.subplot(312)
plt.title("predicted signal")
plt.plot(predictions, 'g')
plt.subplot(313)
plt.title("training signal")
plt.plot(y_train, 'b')
plt.plot(y_test, 'y')
plt.legend(['train', 'test'])
plt.show()
I have read that I should increase the output dim of the dense layer to get more than 1 predicted value, or increase the size of my window in the load data function ?
Here is the result, the yellow plot is supposed to be after the blue one, it respresents my input test data, the first plot is a zoom on this data and the second one the prediction.

If you want to predict the output value of your serie at t+x based on data at time t, the data you need to feed to the network should already have this format.
Time series data formating :
If you have 3000 data point and want to predict the output value for the next "virtual" 500 point you should offset the output value by this amount. For exemple :
In your dataset, your 500th data point correspond to the 500th output value. If you want to predict "future" values then the 500th data point should have the 1000th output value. You can do this in pandas with the shift function. Be aware that you will loose the last 500 data point by doing so, has they will no longer have an output value.
Then when you predict on data point xi you'll have the output value yi+500. You should find some basic guides for time serie forecasting on sites like machinelearningmastery
Good pratice for model evaluation :
If you want to better evaluate the quality of your model, first find some metrics that suits your problem and try to increase test set percenatage. While graphics are a good way to visualise result, they can be deceiving, try combining them with some metrics ! (be carefull with Mean Squarred Error, it can give you a biased score with value in the range [-1;1] as the square of an error in this range will always be less than the acutal error, try Mean Absolute Error instead)
Data leakage when scalling data :
While scalling data is usually a good thing you need to be carefull doing so. You comited something called a data leak. You used scalling on the whole data set before splitting into training and test set. Further reading about this data leak.
Update
I think i misunderstood your problem.
If you want to "predict further than just the test dataset" you will need some unseen/new data to make more prediction. The test set is only made to evaluate the performance of the learning phase.
Now if you want to predict further than just the next step (this won't allow you to "predict further than just the test dataset" because of the way you change your dataset, see bellow) :
Your model as it's made will only ever predict the next step.
In your example you feed to the algorithm series of lenght 'seq_len' and give them as output the value right after the end of those series. If you want your algorithm to learn to predict in more than one step into the future you y_train must have value at the corresponding time in the future, example :
x = [0,1,2,3,4,5,6,7,8,9,10,...]
seq_len = 5
step_to_predict = 5
So to predict not one step into the future but five, your series will have to look like this :
x_serie_1 = [0,1,2,3,4]
y_serie_1 = [9]
x_serie_2 = [1,2,3,4,5]
y_serie_2 = [10]
This is a way to get your model to learn how to make predictions further into the future than just the next step.

Why can model not even predict sine

I am trying to generate a learned timeseries with an LSTM RNN using Keras, so I want to predict a datapoint, and feed it back in as input to predict the next one and so on, so that I can actually generate the timeseries (for example given 2000 datapoints, predict the next 2000)
I am trying it like this, but the Test score RMSE is 1.28 and the prediction is basically a straight line
# LSTM for international airline passengers problem with regression framing
import numpy
import matplotlib.pyplot as plt
from pandas import read_csv
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
dataX, dataY = [], []
for i in range(len(dataset)-look_back-1):
a = dataset[i:(i+look_back), 0]
dataX.append(a)
dataY.append(dataset[i + look_back, 0])
return numpy.array(dataX), numpy.array(dataY)
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset
dataset = np.sin(np.linspace(0,35,10000)).reshape(-1,1)
print(type(dataset))
print(dataset.shape)
dataset = dataset.astype('float32')
# normalize the dataset
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
# split into train and test sets
train_size = int(len(dataset) * 0.5)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
# reshape into X=t and Y=t+1
look_back = 1
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
# create and fit the LSTM network
model = Sequential()
model.add(LSTM(16, input_shape=(1, look_back)))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=10, batch_size=1, verbose=2)
# make predictions
trainPredict = model.predict(trainX)
testPredict = list()
prediction = model.predict(testX[0].reshape(1,1,1))
for i in range(trainX.shape[0]):
prediction = model.predict(prediction.reshape(1,1,1))
testPredict.append(prediction)
testPredict = np.array(testPredict).reshape(-1,1)
# invert predictions
trainPredict = scaler.inverse_transform(trainPredict)
trainY = scaler.inverse_transform([trainY])
testPredict = scaler.inverse_transform(testPredict)
testY = scaler.inverse_transform([testY])
# calculate root mean squared error
trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))
print('Test Score: %.2f RMSE' % (testScore))
# shift train predictions for plotting
trainPredictPlot = numpy.empty_like(dataset)
trainPredictPlot[:, :] = numpy.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
# shift test predictions for plotting
testPredictPlot = numpy.empty_like(dataset)
testPredictPlot[:, :] = numpy.nan
testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict
# plot baseline and predictions
plt.plot(scaler.inverse_transform(dataset))
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()
What am I doing wrong?

I see multiple issues with your code. Your value for look_back is 1, which means the LSTM sees only one Sample at a time, which is obviously not sufficient to learn anything about the sequence.
You probably did this so that you can make the final prediction at the end by feeding the prediction from the previous step as the new input. To correct way to make this work is to train with more timesteps and then change to network to a stateful LSTM with a single timestep.
Also, when you do the final prediction you have to show the network more than one ground truth sample. Otherwise the position on the sine is ambigious. (is it going up or down in the next step?)
I slapped together q quick example. Here is how I generated the data:
import numpy as np
numSamples = 1000
numTimesteps = 50
width = np.pi/2.0
def getRandomSine(numSamples = 100, width = np.pi):
return np.sin(np.linspace(0,width,numSamples) + (np.random.rand()*np.pi*2))
trainX = np.stack([getRandomSine(numSamples = numTimesteps+1) for _ in range(numSamples)])
valX = np.stack([getRandomSine(numSamples = numTimesteps+1) for _ in range(numSamples)])
trainX = trainX.reshape((numSamples,numTimesteps+1,1))
valX = valX.reshape((numSamples,numTimesteps+1,1))
trainY = trainX[:,1:,:]
trainX = trainX[:,:-1,:]
valY = valX[:,1:,:]
valX = valX[:,:-1,:]
Here I trained the model:
import keras
from keras.models import Sequential
from keras import layers
model = Sequential()
model.add(layers.recurrent.LSTM(32,return_sequences=True,input_shape=(numTimesteps, 1)))
model.add(layers.recurrent.LSTM(32,return_sequences=True))
model.add(layers.wrappers.TimeDistributed(layers.Dense(1,input_shape=(1,10))))
model.compile(loss='mean_squared_error',
optimizer='adam')
model.summary()
model.fit(trainX, trainY, nb_epoch=50, validation_data=(valX, valY), batch_size=32)
And here I changed the trained model to allow the continues prediction:
# serialize the model and get its weights, for quick re-building
config = model.get_config()
weights = model.get_weights()
config[0]['config']['batch_input_shape'] = (1, 1, 1)
config[0]['config']['stateful'] = True
config[1]['config']['stateful'] = True
from keras.models import model_from_config
new_model = Sequential().from_config(config)
new_model.set_weights(weights)
#create test sine
testX = getRandomSine(numSamples = numTimesteps*10, width = width*10)
new_model.reset_states()
testPredictions = []
# burn in
for i in range(numTimesteps):
prediction = new_model.predict(np.array([[[testX[i]]]]))
testPredictions.append(prediction[0,0,0])
# prediction
for i in range(numTimesteps, len(testX)):
prediction = new_model.predict(prediction)
testPredictions.append(prediction[0,0,0])
# plot result
import matplotlib.pyplot as plt
plt.plot(np.stack([testPredictions,testX]).T)
plt.show()
Here is what the result looks like. The prediction errors add up and very quickly it diverges from the input sine. But it clearly learned the general shape of sines. You can now try to improve on this by trying different layers, activation functions etc.

I was working a bit on a different architecture and uploaded it on github.
So for all people who are looking into predicting a time series point by point, I hope this helps.
The results look like this:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.