Keras Recurrent Neural Networks For Multivariate Time Series - python

I have been reading about Keras RNN models (LSTMs and GRUs), and authors seem to largely focus on language data or univariate time series that use training instances composed of previous time steps. The data I have is a bit different.
I have 20 variables measured every year for 10 years for 100,000 persons as input data, and the 20 variables measured for year 11 as output data. What I would like to do is predict the value of one of the variables (not the other 19) for the 11th year.
I have my data structured as X.shape = [persons, years, variables] = [100000, 10, 20] and Y.shape = [persons, variable] = [100000, 1]. Below is my Python code for a LSTM model.
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(128, activation = 'tanh',
input_shape = (X.shape[1], X.shape[2])))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X, Y, epochs = 25, batch_size = 128)
I have four (related) questions, please:
Have I coded the Keras model correctly for the data structure I have? The performance I get from a fully-connected network (using flattened data) and from LSTM, GRU, and 1D CNN models are nearly identical, and I don't know if I have made an error in Keras or if a recurrent model is simply not helpful in this case.
Should I have Y as a series with shape Y.shape = [persons, years] = [100000, 11], rather than including the variable in X, which would then have shape X.shape = [persons, years, variables] = [100000, 10, 19]? If so, how can I get the RNN to output the predicted sequence? When I use return_sequences = True, Keras returns an error.
Is this the best way to predict with the data I have? Are there better option choices available in the Keras RNN models, or even other models?
How could I simulate data resembling the data structure I have so that a RNN model would outperform a fully-connected network?
UPDATE:
I have tried a simulation, with what I hope is a very simple case where an RNN should be expected to outperform a FNN.
While the LSTM tends to outperform the FNN when both have less hidden layers (4), the performance becomes identical with more hidden layers (8+). Can anyone think of a better simulation where a RNN would be expected to outperform a FNN with a similar data structure?
from keras import models
from keras import layers
from keras.layers import Dense, LSTM
import numpy as np
import matplotlib.pyplot as plt
The code below simulates data for 10,000 instances, 10 time steps, and 2 variables. If the second variable has a 0 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 3. If the second variable has a 1 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 9.
My hope was that the RNN would keep the value of second variable at the very first time step in memory and use that to know which value (3 or 9) to multiply the the first variable for the very last time step.
## Simulate data.
instances = 10000
sequences = 10
X = np.zeros((instances, sequences * 2))
X[:int(instances / 2), 1] = 1
for i in range(instances):
for j in range(0, sequences * 2, 2):
X[i, j] = np.random.random()
Y = np.zeros((instances, 1))
for i in range(len(Y)):
if X[i, 1] == 0:
Y[i] = X[i, -2] * 3
if X[i, 1] == 1:
Y[i] = X[i, -2] * 9
Below is code for a FNN:
## Densely connected model.
# Define model.
network_dense = models.Sequential()
network_dense.add(layers.Dense(4, activation = 'relu',
input_shape = (X.shape[1],)))
network_dense.add(Dense(1, activation = None))
# Compile model.
network_dense.compile(optimizer = 'rmsprop', loss = 'mean_absolute_error')
# Fit model.
history_dense = network_dense.fit(X, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_dense.predict(X[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_dense.predict(X[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Below is code for a LSTM:
## Structure X data for LSTM.
X_lstm = X.reshape(X.shape[0], X.shape[1] // 2, 2)
X_lstm.shape
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(4, activation = 'relu',
input_shape = (X_lstm.shape[1], 2)))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X_lstm, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_lstm.predict(X_lstm[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_lstm.predict(X_lstm[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Yes the code used is correct for what you are trying to do. 10 years is the time window used to predict the following year so that should be the number of inputs into your model for each of the 20 variables. The sample size of 100,000 observations is not relevant to the input shape of your model.
The way that you had originally shaped the dependent variable Y is correct. You are predicting a window of 1 year for 1 variable and you have 100,000 observations. The key word argument return_sequences=True will cause an error to be thrown because you only have a single LSTM layer. Set this parameter to True if you are implementing multiple LSTM layers and the layer in question is followed by another LSTM layer.
I wish I could offer some guidance to 3 but without actually having your dataset I don't know if it's possible to answer this with any sort of certainty.
I will say that LSTM's were designed to address what is know as the the long term dependency problem present in regular RNN's. What this problem boils down to is that as the gap between when the relevant information was observed to the point where that information would be useful grows, the standard RNN will have a harder time learning the relationship between them. Think of predicting a stock price based on 3 days of activity vs an entire year.
This leads into number 4. If I use the term 'resembling' loosely and stretch your time window further out to say 50 years as opposed to 10, the advantages gained from using an LSTM would become more apparent. Although I'm sure that someone more experienced will be able to offer a better answer and I look forward to seeing it.
I found this page helpful for understanding LSTM's:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Related

Deep Neural Network to learn a function f(x) = x^2 [duplicate]

This question already has answers here:
Neural network for square (x^2) approximation
(3 answers)
Closed 2 years ago.
Currently, I have a function f(x) = x^2.
I have a dataset, whose feature is x, and the corresponding label is x^2.
I would like my machine learning model to somewhat accurately predict new values.
For example, the prediction of 300 should be close to 300*300 = 90000
In my code, I first create my training data features and labels, which look like
features: [0, 1, 2, ... 999]
labels: [0, 1, 4, ... 999*999]
import tensorflow as tf
import numpy as np
import logging
import matplotlib.pyplot as plt
logger = tf.get_logger()
logger.setLevel(logging.ERROR)
val = np.empty([1000], dtype = float)
val_squared = np.empty([1000], dtype = float)
#Create training data
for i in range(1000):
val[i] = i
val_squared[i] = i*i;
#Create layers of Deep Neural Network
l0 = tf.keras.layers.Dense(units = 500,input_shape=[1])
l1 = tf.keras.layers.Dense(units = 500, activation = 'sigmoid')
l2 = tf.keras.layers.Dense(units = 500, activation = 'sigmoid')
l3 = tf.keras.layers.Dense(units = 1)
model = tf.keras.Sequential([l0, l1, l2, l3])
model.compile(loss='mean_squared_error', optimizer = tf.keras.optimizers.Adam(lr=10))
history = model.fit(val,val_squared,epochs = 500, verbose = False, batch_size = 500)
plt.xlabel('Epoch Number')
plt.ylabel("Loss Magnitude")
plt.plot(history.history['loss'])
print("Prediction of 200: {}".format(model.predict([200.0])))
plt.show()
When the graph is plotted, we can see that the loss converges, which is a sign that the model is learning. However, the actual prediction is very different from our expected value - 332823.16 as opposed to 40000.
The plotted graph can be seen here: https://imgur.com/a/GJMSrbV
I have tried changing the activation function to relu and tanh, and tweaked hyperparameters to make sure the loss converged, but to no effect. Are there any other ways I can improve the neural network's performance?
Your loss graph is showing an error of around 0.8E11, 8 billion - a very large loss, equivalent to an error of around 300,000 in your predictions.
The reason is probably that your learning rate is 10, which is very high (tf.keras.optimizers.Adam(lr=10)). Normally with Adam one uses a learning rate of around 1e-3 (0.001) or 1e-4 (0.0001).
A couple more points - you shouldn't even need a multi layer model to solve y=x^2, try a single layer model with say 500 hidden nodes to start. Smaller models converge faster.

Tensorflow / Keras predict function output length does not match input length

I am using Ubuntu 19.04 (Disco Dingo), Python 3.7.3, and TensorFlow 1.14.0.
I noticed that the number of outputs given by the tensorflow.keras.Sequential.predict function is different than the number of inputs. Furthermore, it appears that there is no relation between the inputs and outputs.
Example:
import tensorflow as tf
import math
import numpy as np
import json
# We will train the model to recognize an XOR
x = [ [0,0], [0,1], [1,0], [1,1] ]
y = [ 0, 1, 1, 0 ]
xt = tf.cast(x, tf.float64)
yt = tf.cast(y, tf.float64)
# This model should be more than enough to learn an XOR
L0 = tf.keras.layers.Dense(2)
L1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
L2 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
L3 = tf.keras.layers.Dense(2, activation=tf.nn.softmax)
model = tf.keras.Sequential([L0,L1,L2,L3])
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
model.fit(
x=xt,
y=yt,
batch_size=32,
epochs=1000, # Try to overfit data
shuffle=False,
steps_per_epoch=math.ceil(len(x)/32)
)
# While it is training, the loss drops to near zero
# and the accuracy goes to 100%.
# The large number of epochs and the small number of training examples
# should mean that the network is overtrained.
print("testing")
for i in range(len(y)):
m = tf.cast([x[i]],tf.float64)
# m should be the ith training example
values = model.predict(m,steps=1)
best = np.argmax(values[0])
print(x[i],y[i],best)
The output I always get is:
(input, correct answer, predicted answer)
[0, 0] 0 0
[0, 1] 1 0
[1, 0] 1 0
[1, 1] 0 0
or
[0, 0] 0 1
[0, 1] 1 1
[1, 0] 1 1
[1, 1] 0 1
So, even though I thought that the network would be overtrained, even though the program said that the accuracy was 100% and the loss was virtually zero, the output looks as though the network hadn't trained at all.
Stranger yet is when I replace the testing section with the following:
print("testing")
m = tf.cast([], tf.float64)
values = model.predict(m, steps=1)
print(values)
I would think that this would return an empty array or throw an exception. Instead it gives:
[[0.9979249 0.00207507]
[0.10981816 0.89018184]
[0.10981816 0.89018184]
[0.9932179 0.0067821 ]]
This corresponds to [0,1,1,0]
So even though it was given nothing to predict on, it still gives out predictions for something. And it appears as though the predictions match up with what what we would expect from sending the entire training set into the predict method.
Replacing the testing section again:
print("testing")
m = tf.cast([[0,0]], tf.float64)
# [0,0] is the first training example
# the output should be something close to [[1.0,0.0]]
values = model.predict(m, steps=1)
for j in range(len(values)):
print(values[j])
exit()
I get:
[0.9112452 0.08875483]
[0.00552484 0.9944752 ]
[0.00555605 0.99444395]
[0.9112452 0.08875483]
This corresponds to [0,1,1,0]
So asking it to predict on zero inputs, gives out 4 predictions. Asking it to predict on one input gives out 4 predictions. Furthermore, the predictions it gives out looks like what we would expect if we put the entire training set into the predict function.
Any ideas as to what's going on? How do I get my network to give exactly one prediction for each input given?
Providing the solution here (Answer Section), even though it is present in the Comment Section, for the benefit of the community.
Upgrading Tensorflow from 1.14.0 >=2.0 has resolved the issue.
After upgrading test section works as expected
m = tf.cast([[0,0]], tf.float64)
# [0,0] is the first training example
# the output should be something close to [[1.0,0.0]]
values = model.predict(m, steps=1)
for j in range(len(values)):
print(values[j])
exit()
Output:
[0.9921625 0.00783745]

LSTM giving same prediction for numerical data

I created an LSTM model for intraday stock predictions. I took the training data with the shape of (290, 4). I did all the preprocessing like Normalize the data, taking the difference, taking window size of 4.
This is a sample of my input data.
X = array([[0, 0, 0, 0],
[array([ 0.19]), 0, 0, 0],
[array([-0.35]), array([ 0.19]), 0, 0],
...,
[array([ 0.11]), array([-0.02]), array([-0.13]), array([-0.09])],
[array([-0.02]), array([ 0.11]), array([-0.02]), array([-0.13])],
[array([ 0.07]), array([-0.02]), array([ 0.11]), array([-0.02])]], dtype=object)
y = array([[array([ 0.19])],
[array([-0.35])],
[array([-0.025])],
.....,
[array([-0.02])],
[array([ 0.07])],
[array([-0.04])]], dtype=object)
Note: I am giving as well as predicting the difference value. So input value is between range (-0.5,0.5)
Here is my Keras LSTM model :
dim_in = 4
dim_out = 1
model.add(LSTM(input_shape=(1, dim_in),
return_sequences=True,
units=6))
model.add(Dropout(0.2))
model.add(LSTM(batch_input_shape=(1, features.shape[1],features.shape[2]),return_sequences=False,units=6))
model.add(Dropout(0.3))
model.add(Dense(activation='linear', units=dim_out))
model.compile(loss = 'mse', optimizer = 'rmsprop')
for i in range(300):
#print("Completed :",i+1,"/",300, "Steps")
model.fit(X, y, epochs=1, batch_size=1, verbose=2, shuffle=False)
model.reset_states()
I am feeding the last sequence value of shape=(1,4) and predict the output.
This is my prediction :
base_value = df.iloc[290]['Close']
prediction = []
orig_pred = []
input_data = np.copy(test[0,:])
input_data = input_data.reshape(len(input_data),1)
for i in range(100):
inp = input_data[i:,:]
inp = inp.reshape(1,1,inp.shape[0])
y = model.predict(inp)
orig_pred.append(y[0][0])
input_data = np.insert(input_data,[i+4],y[0][0], axis=0)
base_value = base_value + y
prediction_apple.append(base_value[0][0])
sqrt(mean_squared_error(test_output, orig_pred))
RMSE = 0.10592485833344527
Here is the difference in prediction visualization along with stock price prediction.
fig:1 -> This is the LSTM prediction
fig:2 -> This is the Stock prediction
I am not sure why it is predicting the same output value after 10 iterations. Maybe it is the vanishing gradient problem or I am feeding fewer input data(290 approx) or problem in the model architecture. I am not sure.
Please Help how to get the reasonable result.
Thank you !!!
I don't work with Keras, but looking through your code and plots it seems like the complexity of your network might not be high enough to fit the data. Try enlarging the network with more units and also try larger window sizes.
Because your regressor secures the minimization of the cost function by replicating the feature you give as input feature. For example if you have BTC closing value as $6340 at time t, it will go for it at t+1 or some value close to it. Ensure that you are not giving a direct numerical intuition to a regressor that what the predicted label might be, especially when working with time-series data.

Tensorflow ReLu doesn't work?

I have written a convolutional network in tensorflow with relu as an activation function, however it is not learning (loss is constant for both eval and train data set).
For different activation functions everything works as it should.
Here is code where the nn is created:
def _create_nn(self):
current = tf.layers.conv2d(self.input, 20, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.conv2d(current, 24, 3, activation=self.activation)
current = tf.layers.max_pooling2d(current, 2, 2)
self.descriptor = current = tf.layers.conv2d(current, 32, 5, activation=self.activation)
if not self.drop_conv:
current = tf.layers.conv2d(current, self.layer_7_filters_n, 3, activation=self.activation)
if self.add_conv:
current = tf.layers.conv2d(current, 48, 2, activation=self.activation)
self.descriptor = current
last_conv_output_shape = current.get_shape().as_list()
self.descr_size = last_conv_output_shape[1] * last_conv_output_shape[2] * last_conv_output_shape[3]
current = tf.layers.dense(tf.reshape(current, [-1, self.descr_size]), 100, activation=self.activation)
current = tf.layers.dense(current, 50, activation=self.last_activation)
return current
self.activiation is set to tf.nn.relu and self.last_activiation is set to tf.nn.softmax
loss function and optimizer are created here:
self._nn = self._create_nn()
self._loss_function = tf.reduce_sum(tf.squared_difference(self._nn, self.Y), 1)
optimizer = tf.train.AdamOptimizer()
self._train_op = optimizer.minimize(self._loss_function)
I tried changing variables initialization by passing tf.random_normal_initializer(0.1, 0.1) as initializers however it did not result in any change in loss function.
I would be grateful for help in making this neural network work with ReLu.
Edit: Leaky ReLu has the same problem
Edit: Small example where I managed to duplicate same error:
x = tf.constant([[3., 211., 123., 78.]])
v = tf.Variable([0.5, 0.5, 0.5, 0.5])
h_d = tf.layers.Dense(4, activation=tf.nn.leaky_relu)
h = h_d(x)
y_d = tf.layers.Dense(4, activation=tf.nn.softmax)
y = y_d(h)
d = tf.constant([[.5, .5, 0, 0]])
Gradients (as calculated with tf.gradients) for h_d and y_d kernels and biases are either equal or close to 0
In a very improbable case, all activations in some layer can be negative for all samples. They are set to zero by the ReLU and there is no learning progress because the gradient is zero in the negative part of the ReLU.
Things that make this more probable are a small dataset, weird scaling of input features, inappropriate weight initialization, and/or few channels in intermediate layers.
Here you use random_normal_initializer with mean=0.1, so maybe your inputs are all negative, and thus get mapped to negative values. Try mean=0, or rescale input features.
You can also try a Leaky ReLU. Also maybe the learning rate is too small or too large.
Looks like the problem was with the scale of input data. With values being between 0 and 255 that scale was more or less kept in the next layers, giving pre-activation outputs of the last layer having large enough differences to decrease softmax gradient to (almost) 0.
It was observable only with relu-like activation functions because other, like sigmoid or softsign, kept values ranges in network smaller, with an order of magnitude of 1 inststead of tens or hundreds.
The solution here was to just multiply input to rescale it to 0-1, in case of bytes by 1/255.

How can I make TensorFlow RNN training more robust?

I am training an RNN on a time series. I subclassed RNNCell and I use it in dynamic_rnn. The topology of the RNNCell is as follows:
input (shape [15, 100, 3])
1x3 convolution (5 kernels), ReLu (shape [15, 98, 5])
1x(remaining) convolution (20 kernels), ReLu (shape [15, 1, 20])
concatenate previous output (shape [15, 1, 21])
squeeze and 1x1 convolution (1 kernel), ReLu (shape [15, 1])
squeeze and softmax (shape [15])
The batch size for dynamic_rnn is around 100 (not the same 100 of the description above, that's the number of time periods in a window of data). Epochs are made of about 200 batches.
I would like to experiment with hyperparameters and regularization, but too often what I try stops the learning entirely and I don't understand why. These are some of the weird things that happen:
Adagrad works, but if I use Adam or Nadam the gradients are all zero.
I am forced to set a huge learning rate (~1.0) to see learning from epoch to epoch.
If I try to add dropout after any of the convolutions, even if I set keep_prob to 1.0 it stops learning.
If I tweak the number of kernels in the convolutions, for some choices that would seem just as good (e.g. 5, 25, 1 vs 5, 20, 1) again the network stops learning entirely.
Why is this model so fragile? Is it the topology of the RNNCell?
EDIT:
This is the code of the RNNCell:
class RNNCell(tf.nn.rnn_cell.RNNCell):
def __init__(self):
super(RNNCell, self).__init__()
self._output_size = 15
self._state_size = 15
def __call__(self, X, prev_state):
network = X
# ------ 2 convolutional layers ------
network = tflearn.layers.conv_2d(network, 5, [1, 3], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
width = network.get_shape()[2]
network = tflearn.layers.conv_2d(network, 20, [1, width], [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
# ------ concatenate the previous state ------
_, height, width, features = network.get_shape()
network = tf.reshape(network, [-1, int(height), 1, int(width * features)])
network = tf.concat([network, prev_state[..., None, None]], axis=3)
# ------ last convolution and softmax ------
network = tflearn.layers.conv_2d(network, 1, [1, 1], activation='relu', weights_init=tflearn.initializations.variance_scaling(), padding="valid", regularizer=None)
network = network[:, :, 0, 0]
predictions = tflearn.layers.core.activation(network, activation="softmax")
return predictions, predictions
#property
def output_size(self):
return self._output_size
#property
def state_size(self):
return self._state_size
Most probably you are facing vanished gradients problem.
Potentially the instability can be caused by using ReLU in a combination with a pretty small number of parameters to tune. As far as I understand from the description there are only 1x3x5 = 15 trainable parameters in a first layer for instance. If to suppose, that the initialization is around zero, than gradients of in average 50% of parameters will always stay zero. Generally speaking ReLU on small networks in an evil, especially in a case of RNNs.
Try to use Leaky ReLU (but you can face exploding gradients though)
Try to use tanh, but check initial values of parameters, that they are really around zero, otherwise your gradients will vanish very quickly as well.
Retrieve results of untrained, but just initialized network at a step 0. With a right initialization and NN construction you should get normally distributed values around .5 If you have strictly ones, zeros or mix of them, your NN architecture is wrong. All values strictly .5 is also bad.
Consider more robust approach such as LSTM

Categories