I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Masking(mask_value=-float('inf'),
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.LSTM(32))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(30),
kernel_initializer=tf.keras.initializers.zeros())
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.
I am currently building a neural network to predict features such as temperature. So the output for this could be a positive or negative value. I am normalizing my input data and using the tanh activation function in each hidden layer.
Should I use a linear activation function for the output layer to get an unbounded continuous output OR should I use tanh for the output layer and then inverse normalize the output? Could someone explain this I don't think my understanding of this is correct.
You are actually in the correct direction
Option1:
you need to normalize the temperatures first and then fed it to the model let say your temperature ranges from [-100,100] so convert it into a scale of [-1,1] then use this scaled version of temp in your target variable.
At the time of prediction just inverse transform the output and you will get your desired result.
Option2:
You create a regression kind of neural network and don't apply any activation function to the output layer (means no bonds for values it could be +ve or -ve).
In this case you are not required to normalize the values.
Sample NN Spec:
Input Layer==> # neurons one per feature
Hidden Layer==>relu/selu as activation function| # of neurons/Hidden layers is as per your requirement
Output Layer==> 1 neuron/ no Activation function required
Python 3.7 tensorflow
I am experimenting Time series forecasting w Tensorflow
I understand the second line creates a LSTM RNN i.e. a Recurrent Neural Network of type Long Short Term Memory.
Why do we need to add a Dense(1) layer in the end?
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Tutorial for Dense() says
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
would you like to rephrase or elaborate on need for Dense() here ?
The following line
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
creates an LSTM layer which transforms each input step of size #features into a latent representation of size 32. You want to predict a single value so you need to convert this latent representation of size 32 into a single value. Hence, you add the following line
single_step_model.add(tf.keras.layers.Dense(1))
which adds a Dense Layer (Fully-Connected Neural Network) with one neuron in the output which, obviously, produces a single value. Look at it as a way to transform an intermediate result of higher dimensionality into the final result.
Well in the tutorial you are following Time series forecasting, they are trying to forecast temperature (6 hrs ahead). For which they are using an LSTM followed by a Dense layer.
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Dense layer is nothing but a regular fully-connected NN layer. In this case you are bringing down the output dimensionality to 1, which should represent some proportionality (need not be linear) to the temperature you are trying to predict. There are other layers you could use as well. Check out, Keras Layers.
If you are confused about the input and output shape of LSTM, check out
I/O Shape.
I built an LSTM in Keras. It reads observations of 9 time-lags, and predicts the next label. For some reason, the model I trained is predicting something that is nearly a straight line. What issue might there be in the model architecture that is creating such a bad regression result?
Input Data: Hourly financial time-series, with a clear upward trend 1200+ records
Input Data Dimensions:
- originally:
X_train.shape (1212, 9)
- reshaped for LSTM:
Z_train.shape (1212, 1, 9)
array([[[0.45073171, 0.46783444, 0.46226164, ..., 0.47164819,
0.47649667, 0.46017738]],
[[0.46783444, 0.46226164, 0.4553289 , ..., 0.47649667,
0.46017738, 0.47167775]],
Target data: y_train
69200 0.471678
69140 0.476364
69080 0.467761
...
7055 0.924937
7017 0.923651
7003 0.906253
Name: Close, Length: 1212, dtype: float64
type(y_train)
<class 'pandas.core.series.Series'>
LSTM design:
my = Sequential()
my.add(LSTM((20),batch_input_shape=(None,1,9), return_sequences=True))
my.add(LSTM(20, return_sequences=True))
my.add(LSTM(20, return_sequences=True))
my.add(LSTM(1))
input layer of 9 nodes. 3 hidden layers of 20 units each. 1 output layers of 1 unit.
The Keras default is return_sequences=False
Model is compiled with mse loss, and adam or sgd optimizer.
curr_model.compile(optimizer=optmfunc, loss="mse")
Model is fit in this manner. Batch is 32, shuffle can be True/False
curr_model.fit(Z_train, y_train,
validation_data=(Z_validation,y_validation),
epochs=noepoch, verbose=0,
batch_size=btchsize,
shuffle=shufBOOL)
Config and Weights are saved to disk. Since I'm training several models, I load them afterward to test certain performance metrics.
spec_model.model.save_weights(mname_trn)
mkerascfg = spec_model.model.to_json()
with open(mname_cfg, "w") as json_file:
json_file.write(mkerascfg)
When I trained an MLP, I got this result against the validation set:
I've trained several of the LSTMs, but the result against the validation set looks like this:
The 2nd plot (LSTM plot) is of the validation data. This is y_validation versus predictions on Z_validation. They are the last 135 records in respective arrays. These were split out of full data (i.e validation), and have the same type/properties as Z_train and y_train. The x-axis is just numbering 0 to 134 of the index, and y-axis it the value of y_validation or the prediction. Units are normalized in both arrays. So all the units are the same. The "straight" line is the prediction.
What idea could you suggest on why this is happening?
- I've changed batch sizes. Similar result.
- I've tried changing the return_sequences, but it leads to various errors around shape for subsequent layers, etc.
Information about LSTM progression of MSE loss
There are 4 models trained, all with the same issue of course. We'll just focus on the 3 hidden layer, 20-unit per layer, LSTM, as defined above.(Mini-batch size was 32, and shuffling was disabled. But enabling changed nothing).
This is a slightly zoomed in image of the loss progressionfor the first model (adam optimizer)
From what I can tell by messing with the index, that bounce in the loss values (which creates the thick area) starts after in the 500s of epochs.
Your code has a single critical problem: dimensionality shuffling. LSTM expects inputs to be shaped as (batch_size, timesteps, channels) (or (num_samples, timesteps, features)) - whereas you're feeding one timestep with nine channels. Backpropagation through time never even takes place.
Fix: reshape inputs as (1212, 9, 1).
Suggestion: read this answer. It's long, but could save you hours of debugging; this information isn't available elsewhere in such a compact form, and I wish I've had it when starting out with LSTMs.
Answer to a related question may also prove useful - but previous link's more important.
OverLordGoldDragon is right: the problem is with the dimensionality of the input.
As you can see in the Keras documentation all recurrent layers expect the input to be a 3D tensor with shape: (batch_size, timesteps, input_dim).
In your case:
the input has 9 time lags that need to be fed to the LSTM in sequence, so they are timesteps
the time series contains only one financial instrument, so the input_dim is 1
Hence, the correct way to reshape it is: (1212, 9, 1)
Also, make sure to respect the order in which data is fed to the LSTM. For forecasting problems it is better to feed the lags from the most ancient to the most recent, since we are going to predict the next value after the most recent.
Since the LSTM reads the input from left to right, the 9 values should be ordered as: x_t-9, x_t-8, ...., x_t-1 from left to right, i.e. the input and output tensors should look like this:
Z = [[[0], [1], [2], [3], [4], [5], [6], [7], [8]],
[[1], [2], [3], [4], [5], [6], [7], [8], [9]],
...
]
y = [9, 10, ...]
If they are not oriented as such you can always set the LSTM flag go_backwards=True to have the LSTM read from right to left.
Also, make sure to pass numpy arrays and not pandas series as X and y as Keras sometimes gets confused by Pandas.
For a full example of doing time series forecasting with Keras take a look at this notebook
I have a trained TF model which has the following architecture:
Inputs:
word_a, one-hot representation, vocab-size: 50000
word_b, one-hot representation, vocab-size: 50
Output:
probs, size: 1x10000
The network consists of embedding lookup of word_a of size 1x100 (dense_word_a) from an embedding matrix. word_b is transformed into a similar vector using a Character CNN into a dense vector of size 1x250. Both the vectors are concatenated into a 1x350 vector and using a decoder layer and sigmoid we're mapping it to the output layer and sigmoid with vector size 1x10000.
I need to run this model on the client, and for this I'm converting it to TFLite.
However, I also need to break the model into two sub-models with the following inputs and outputs:
Model 1:
Inputs:
word_a: one-hot representation, (1x50000) vocab-size: 50000
Output:
dense_word_a: dense word-embedding looked up from embedding matrix (1x100)
Network:
Simple embedding lookup for word_a from embedding matrix.
Model 2:
Inputs:
dense_word_a: embedding for word_a received from Model 1. (1x100).
word_b, one-hot representation, vocab-size: 50 (1x50)
Output:
probs, size: 1x10000
In Model 1, the input word_a is a placeholder and dense_word_a is a variable. In Model 2, dense_word_a is a placeholder and it's value is concatenated with the word_b's embedding.
The embeddings in my model are not pre-trained, and are trained as part of the model training process itself. So I need to train the model as a combined model but during inference I want to break it up into Model 1 and Model 2 as described above.
The idea is to run the Model 1 on server side and pass the embedding values to client so it can perform inference using word_b and not have a 5MB embedding matrix on the client. So, I'm not constrained on the size of Model 1, but since Model 2 runs on the client I need it to be small.
Here's what I've tried:
1. During model freezing time, I freeze the full model but in the output nodes list I also add the variable name dense_word_a along with probs. I then convert the model to TFLite. During inference, I'm able to see the dense_word_a output as a 1x100 vector. This seems to work fine. I'm also getting the probs as output,
For generating Model 2, I just remove the dense_word_a variable and convert it into a placeholder (using tf.placeholder), remove the placeholder for word_a and freeze the graph again.
However, I'm not able to match the probs value. The probs vector generated by the full model don't match with the probs values vector generated by Model 2.
How can I go about breaking the model into sub-models and also match the results?
I think what you described should work.
Is it easy to reproduce the problem that you're seeing? If you can isolate the reproducible steps and you believe there's a bug, could you file a bug on github? Thanks!