RNN fails to fit a linear trend (Keras BPTT issue?) - python

I am trying to train a simple LSTM to fit a line. My hypothesis is that I should be able to fit a linearly decreasing trend with zero input since the LSTM can decide how much it listens to its input vs. internal state, and can thus learn to just operate on the internal state. Basically a degenerate case for testing whether the LSTM can fit an expected result with zero input.
I create my input and target data:
seq_len = 1000
x_train = np.zeros((1, seq_len, 1)) # [batch_size, seq_len, num_feat]
target = np.linspace(100, 0, num=seq_len).reshape(1, -1, 1)
I create a pretty simple network:
from keras.models import Model
from keras.layers import LSTM, Dense, Input, TimeDistributed
x_in = Input((seq_len, 1))
seq1 = LSTM(8, return_sequences=True)(x_in)
dense1 = TimeDistributed(Dense(8))(seq1)
seq2 = LSTM(8, return_sequences=True)(dense1)
dense2 = TimeDistributed(Dense(8))(seq2)
out = TimeDistributed(Dense(1))(dense2)
model = Model(inputs=x_in, outputs=out)
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(x_train, target, batch_size=1, epochs=1000,
I also created a custom callback that calls model.predict(x_train) after every epoch and adds the results to an array so I can see how my model's output is evolving over time. Basically the model just learns to predict a constant value which gradually (asymptotically) approaches the mean of my target line (target line is in red, not sure why the legend didn't show):
So basically nothing is driving my response to fit the actual line, I'm just gradually approaching the mean of the line. I suspect I am not getting any gradient with respect to time (data index), just an average gradient over time. But I would have thought LSTM losses would automagically give you gradient through time.
I've tried:
different activation functions for the LSTM layers (None, 'relu' for both the regular activation and recurrent activation)
different optimizers ('nadam', 'adadelta', 'rmsprop')
the 'mean_aboslute_error' loss function, which I didn't expect to improve the results, and it acted about the same
passing sequences of random numbers drawn from a normal distribution as input
replacing LSTM with GRU
Nothing seems to do it.
Anybody have a suggestion as to how I can force this thing to train on the gradient as a function of my sequence index, i.e. g(t)? Or any other suggestions on how I can get this to work?
Note: with the trend as shown, if the LSTM results in a constant value at exactly the mean (50), the minimum mean absolute error will be 25 and the minimum mean squared error will be about 835.8. So if we don't see any better than that, we probably aren't fitting the line, just the mean.
Just some references in case you run this yourself.


Keras LSTM trained with masking and custom loss function breaks after first iteration

I am attempting to train an LSTM that reads a variable length input sequence and has a custom loss function applied to it. In order to be able to train on batches, I pad my inputs to all be the maxmimum length.
My input data is a float tensor of shape (7789, 491, 11) where the form is (num_samples, max_sequence_length, dimension).
Any sample that is shorter than the maximum length I pad with -float('inf'), so a sequence with 10 values would start with 481 sets of 11 '-inf' values followed by the real values at the end.
The way I am attempting to evaluate this model doesn't fit into any standard loss functions, so I had to make my own. I've tested it and it performs as expected on sample tensors. I don't believe this is the source of the issue so I won't go into details, but I could be wrong.
The problem I'm having comes from the model itself. Here is how I define and train it:
model = tf.keras.Sequential()
input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Reshape((3, 10)))
model.compile(loss=batched_custom_loss, optimizer='rmsprop', run_eagerly=True)
model.fit(x=train_X, y=train_y, validation_data=val, epochs=5, batch_size=32)
No errors are thrown when I try to fit the model, but it only works on the first batch of training. As soon as the second batch starts, the loss becomes 'nan'. Upon closer inspection, it seems like the LSTM layer is outputting 'nan' after the first epoch of training.
My two guesses for what is going on are:
I set up the masking layer wrong, and it for some reason fails to mask out all of the -inf values after the first training iteration. Thus, -inf gets passed through the LSTM and it goes haywire.
I did something wrong with the format of my loss function, and the when the optimizer applies my calculated loss to the model it ruins the weights of the LSTM. For reference, my loss function outputs a 1D tensor with length equal to the number of samples in the batch. Each item in the output is a float with the loss of that sample.
I know that the math in my loss function is good since I've tested it on sample data, but maybe the output format is wrong even though it seems to match what I've found online.
Let me know if the problem is obvious from what I've shown or if you need more information.

Searching for an LSTM architecture to be used for regression

The lack of good intuition on LSTMs and how they work paired with
an awkward dataset and regression problem leaves me with questions on how to approach and solve my scenario.
I don't want any in depth answers, I seek just for intuition and suggestions.
My dataset consists of:
X flights, each flight has Y timesteps where each timestep has Z features. Every flight is characterized by K which is a 2 value vector (K_1, K_2) and that's the target of regression, predicting these 2 variables.
I've tried several different regression methods and they happen to perform really well. Because I have time dependent trajectories, the other methods calculated stats across each trajectory for the Z features and transformed each whole trajectory to just [Z*l,] - [K_1, K_2] supervised data, where l is just a factor that implies that we have new calculated features. (for example one of these features could be the mean of a feature across all trajectory).
I want to implement an LSTM regression pipeline that takes as input the raw dataset (X,Y,Z) and after a dense layer outputs 2 values, K_1 and K_2 (with 1 model or 2 seperate models for each value) and backpropagate correctly the loss of K_1_target, K_2_target.
I've tried it and it seems that it performs really poorly and I don't know if it's a technical mistake or a theoritical mistake.
Below I provide the architecture I use at the moment.
samples, timesteps, features = x_train.shape[0], x_train.shape[1], x_train.shape[2]
model1 = Sequential()
model1.add(LSTM(hidden_units, return_sequences = True))
model1.add(Dense(hidden_units, activation = "relu"))
model1.add(Dense(1, activation = "linear"))
model1.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model1.fit(x_train, y_train[:,0], validation_data=(x_test, y_test[:,0]), epochs=epochs, batch_size=batch, shuffle=False)
model2 = Sequential()
model2.add(LSTM(hidden_units, return_sequences=True))
model2.add(Dense(hidden_units, activation = "relu"))
model2.add(Dense(1, activation = "linear"))
model2.compile(loss='mse', optimizer=Adam(learning_rate=0.0001))
model2.fit(x_train, y_train[:,1], validation_data=(x_test, y_test[:,1]), epochs=epochs, batch_size=batch, shuffle=False)
In my mind it makes sense but it seems that it is not working well..
I'm not entirely sure how the trainable params of LSTM are learning on backpropagation using a dense layer after and backwarding the loss of 2 values which are the same again and again for the same trajectory.
Any kind of clarification, correction or intuition will help me a lot!
Lastly, I'll provide some real details.
K_1 takes discrete values from 0 to 100
K_2 takes 1 floating point precision values from 0 to 1
x_input is a subset of (6991, 527, 6) using k-fold CV, using k = 10, so about (6292, 527, 6) and y_input is of shape (6292, 2). Accordingly for testing.
I,ve used pre padding for even length of trajectories and a masking layer that ignores rows with no data.
I've normalized all my features and target values indepedently with MinMax normalization, and inversed transformed model's output and y_test for correct loss calculation.
The best result I've got till now is a MAE loss is:
K_1 (whose range is 0 - 100) = ~6.0 (Even lasso performs better, while in a non linear problem)
K_2 (whose range is 0 - 1) = ~0.003 (Pretty good)

Keras RNN accuracy doesn't improve

I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history = model.fit(
X_train, y_train,
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work
A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.

from_logits=True and from_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet

I am doing the image semantic segmentation job with unet, if I set the Softmax Activation for last layer like this:
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
conv10 = (Activation('softmax'))(conv9)
model = Model(inputs, conv10)
return model
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
The training will not converge even for only one training image.
But if I do not set the Softmax Activation for last layer like this:
conv9 = Conv2D(n_classes, (3,3), padding = 'same')(conv9)
model = Model(inputs, conv9)
return model
and then using loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
The training will converge for one training image.
My groundtruth dataset is generated like this:
X = []
Y = []
im = cv2.imread(impath)
seg_labels = np.zeros((height, width, n_classes))
for spath in segpaths:
mask = cv2.imread(spath, 0)
seg_labels[:, :, c] += mask
Y.append(seg_labels.reshape(width*height, n_classes))
Why? Is there something wrong for my usage?
This is my experiment code of git: https://github.com/honeytidy/unet
You can checkout and run (can run on cpu). You can change the Activation layer and from_logits of CategoricalCrossentropy and see what i said.
Pushing the "softmax" activation into the cross-entropy loss layer significantly simplifies the loss computation and makes it more numerically stable.
It might be the case that in your example the numerical issues are significant enough to render the training process ineffective for the from_logits=False option.
You can find a derivation of the cross entropy loss (a special case of "info gain" loss) in this post. This derivation illustrates the numerical issues that are averted when combining softmax with cross entropy loss.
from_logits = True signifies the values of the loss obtained by the model are not normalized and is basically used when we don't have any softmax function in our model. For e.g. https://www.tensorflow.org/tutorials/generative/dcgan in this model they have not used a softmax activation function or in other words we can say it helps in numerical stability.
By default, all of the loss function implemented in Tensorflow for classification problem uses from_logits=False. Remember in case of classification problem, at the end of the prediction, usually one wants to produce output in terms of probabilities.
Just look at the image below, the last layer of the network(just before softmax function)
So the sequence is Neural Network ⇒ Last layer output ⇒ Softmax or Sigmoid function ⇒ Probability of each class.
For example in the case of a multi-class classification problem, where output can be y1, y2, ....... yn one wants to produce each output with some probability. (see the output layer). Now, this output layer will get compared in cross-entropy loss function with the true label.
Let us take an example where our network produced the output for the classification task. Assume your Neural Network is producing output, then you convert that output into probabilities using softmax function and calculate loss using a cross-entropy loss function
# output produced by the last layer of NN
nn_output_before_softmax = [3.2, 1.3, 0.2, 0.8]
# converting output of last layer of NN into probabilities by applying softmax
nn_output_after_softmax = tf.nn.softmax(nn_output_before_softmax)
# output converted into softmax after appling softmax
[0.77514964 0.11593805 0.03859243 0.07031998]
y_true = [1.0, 0.0, 0.0, 0.0]
Now there are two scenarios:
One is explicitly using the softmax (or sigmoid) function
One is not using softmax function separately and wants to include in the calculation of loss function
1) One is explicitly using the softmax (or sigmoid) function
When one is explicitly using softmax (or sigmoid) function, then, for the classification task, then there is a default option in TensorFlow loss function i.e. from_logits=False. So here TensorFlow is assuming that whatever the input that you will be feeding to the loss function are the probabilities, so no need to apply the softmax function.
# By default from_logits=False
loss_taking_prob = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
loss_1 = loss_taking_prob(y_true, nn_output_after_softmax)
tf.Tensor(0.25469932, shape=(), dtype=float32)
2) One is not using the softmax function separately and wants to include it in the calculation of the loss function. This means that whatever inputs you are providing to the loss function is not scaled (means inputs are just the number from -inf to +inf and not the probabilities). Here you are letting TensorFlow perform the softmax operation for you.
loss_taking_logits = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
loss_2 = loss_taking_logits(y_true, nn_output_before_softmax)
tf.Tensor(0.2546992, shape=(), dtype=float32)
Please do remember that you using from_logits=False when it should be True leads to taking softmax of probabilities and producing incorrect model
I guess the problem comes from the softmax activation function. Looking at the doc I found that sotmax is applied to the last axis by default. Can you look at model.summary() and check if that is what you want ?
For softmax to work properly, you must make sure that:
You are using 'channels_last' as Keras default channel config.
This means the shapes in the model will be like (None, height, width, channels)
This seems to be your case because you are putting n_classes in the last axis. But it's also strange because you are using Conv2D and your output Y should be (1, height, width, n_classes) and not that strange shape you are using.
Your Y has only zeros and ones (not 0 and 255 as usually happens to images)
Check that Y.max() == 1 and Y.min() == 0
You may need to have Y = Y / 255.
Only one class is correct (your data does not have more than one path/channel with value = 1).
Check that (Y.sum(axis=-1) == 1).all() is True

Understanding Regularization in Keras

I am trying to understand why regularization syntax in Keras looks the way that it does.
Roughly speaking, regularization is way to reduce overfitting by adding a penalty term to the loss function proportional to some function of the model weights. Therefore, I would expect that regularization would be defined as part of the specification of the model's loss function.
However, in Keras the regularization is defined on a per-layer basis. For instance, consider this regularized DNN model:
input = Input(name='the_input', shape=(None, input_shape))
x = Dense(units = 250, activation='tanh', name='dense_1', kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
x = Dense(units = 28, name='dense_2',kernel_regularizer=l2, bias_regularizer=l2, activity_regularizer=l2)(x)
y_pred = Activation('softmax', name='softmax')(x)
mymodel= Model(inputs=input, outputs=y_pred)
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
I would have expected that the regularization arguments in the Dense layer were not needed and I could just write the last line more like:
mymodel.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'], regularization='l2')
This is obviously wrong syntax, but I was hoping someone could elaborate for me a bit on why the regularizes are defined this way and what is actually happening when I use layer-level regularization.
The other thing I don't understand is under what circumstances would I use each or all of the three regularization options: (kernel_regularizer, activity_regularizer, bias_regularizer)?
Let's break down the components of your question:
Your expectation of regularisation is probably in line with a feed-forward network where yes the penalty term is applied to the weights of the overall network. But this is not necessarily the case when you have RNNs mixed with CNNs etc so Keras opts give fine grain control. Perhaps for easy setup, a regularisation at model level could be added to the API for all weights.
When you use layer regularisation, the base Layer class actually adds the regularising term to the loss which at training time penalises the corresponding layer's weights etc.
Now in Keras you can often apply regularisation to 3 different things as in Dense layer. Every layer has different kernels such recurrent etc, so for the question let's look at the ones you are interested in but the same roughly applies to all layers:
kernel: this applies to actual weights of the layer, in Dense it is the W of Wx+b.
bias: this is the bias vector of the weights, so you can apply a different regulariser for it, the b in Wx+b.
activity: is applied to the output vector, the y in y = f(Wx + b).
