Python Keras LSTM learning converges too fast on high loss

Python Keras LSTM learning converges too fast on high loss - python

This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.
I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).
The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.
My questions are basically :
How to initialize weights and bias given this problem?
Which optimizer to use?
How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
Should I add more training data?
Input and output are normalized with minmax.
I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)
I have printed the weights after few epochs and noticed that many weights
are zero and the rest are basically have the value of 1 (or very close to it).
Here are some plots from tensorboard :

Faster convergence with a very high loss could possibly mean you are facing an exploding gradients problem. Try to use a much lower learning rate like 1e-5 or 1e-6. You can also try techniques like gradient clipping to limit your gradients in case of high learning rates.
Answer 1
Another reason could be initialization of weights, try the below 3 methods:
Method described in this paper https://arxiv.org/abs/1502.01852
Xavier initialization
Random initialization
For many cases 1st initialization method works the best.
Answer 2
You can try different optimizers like
Momentum optimizer
SGD or Gradient descent
Adam optimizer
The choice of your optimizer should be based on the choice of your loss function. For example: for a logistic regression problem with MSE as a loss function, gradient based optimizers will not converge.
Answer 3
How deep or wide your network should be is again fully dependent on which type of network you are using and what the problem is.
As you said you are using a sequential model using LSTM, to learn sequence on text. No doubt your choice of model is good for this problem you can also try 4-5 LSTMs.
Answer 4
If your gradients are going either 0 or infinite, it is called vanishing gradients or it simply means early convergence, try gradient clipping with proper learning rate and the first weight initialization technique.
I am sure this will definitely solve your problem.

Consider reducing your batch_size.
With large batch_size, it could be that your gradient at some point couldn't find any more variation in your data's stochasticity and for that reason it convergences earlier.

Related

Cannot improve the accuracy of my Multilayer Perceptron (MLP) model in classification

I am trying to predict labels for building performance: {1, 0}. Since this is a binary classification, I tried sigmoid and identity activation functions with Xavier initialization. However, I cannot improve the accuracy of my models as the loss and accuracy stay still after training each epoch. This is a very imbalanced dataset where the ones have the 90% majority. So, I assume this might be due to the initial bias. Can you help me with this one? You can see the setup of the training process and the other relevant images attached.model definition, hyperparameters, results

Here are several suggestions which may help:
Use activation after each hidden layer
Learning rate of 0.1 is too high for Adam. Try smaller (3e-4 for example)
You are printing loss value incorrectly. Currently loss value is taken for the last iteration only. Calculate mean epoch loss instead.
Minor suggestion: since the task is binary classification it's better to apply torch.nn.BCELoss or torch.nn.BCEWithLogitsLoss if you don't use sigmoid on last layer. Last linear layer must have output_size=1 in this case.
Best model checkpoint may be missed with code you provided. That's because you calculate accuracy each 10 epochs, however accuracy > best_accuracy is done on each epoch which is inconsistent.

How to update a keras LSTM weights to avoid Concept Drift

I´m trying to update a Keras LSTM to avoid the concept of drift. For that I´m following the approach proposed in this paper [1] on which they compute an anomaly score and they apply it to update the network weights. In the paper they use the L2 norm to compute the anomaly score and then they update the model weights. As it is stated in the paper:
RNN Update: The anomaly score 𝑎𝑡 is then used to update the network W𝑡−1 to obtain W𝑡 using backpropagation through time (BPTT):
W𝑡 = W𝑡−1 − 𝜂∇𝑎𝑡(W𝑡−1) where 𝜂 is the learning rate
I’m trying to update the LSTM network weights, but although I have seen some improvements in the model performance for forecasting multi-step ahead multi-sensor data I’m not sure if the improvement is because the updates deal with the drift concept or just because the model is refitted with the newest data.
Here is an example model:
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(n_neurons, input_shape=(n_seq, n_features)))
model.add(layers.Dense(n_pred_seq * n_features))
model.add(layers.Reshape((n_pred_seq, n_features)))
model.compile(optimizer='adam', loss='mse')
And here is the way on which I’m updating the model:
y_pred = model.predict_on_batch(x_batch)
up_y = data_y[i,]
a_score = sqrt(mean_squared_error(data_y[i,].flatten(), y_pred[0, :]))
w = model.layers[0].get_weights() #Only get weights for LSTM layer
for l in range(len(w)):
w[l] = w[l] - (w[l]*0.001*a_score) #0.001=learning rate
model.layers[0].set_weights(w)
model.fit(x_batch, up_y, epochs=1, verbose=1)
model.reset_states()
I’m wondering if this is the correct way to update the LSTM neural network and how the BPTT is applied after updating the weights.
P.D.: I have also seen other methods to detect concept drift such as the ADWIN method from the skmultiflow package but I found this one especially interesting because it also deals with anomalies, updating the model slightly when new data with concept drift comes and almost ignoring the updates when anomalous data comes.
[1] Online Anomaly Detection with Concept Drift Adaptation using Recurrent Neural Networks Saurav, S., Malhotra, P., TV, V., Gugulothu, N., Vig, L., Agarwal, P., & Shroff, G. (2018, January). In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 78-87). ACM.

I personally thinks that it's a valid method. The fact that you're updating the ntework weights depends on what you're doing, so if you do it like you do it's fine.
Maybe another way to do it is to implement your own loss function and embed the anti-drift parameter into it, but it might be a little complicated.
Regarding the BPTT i think it's applied as normal, but you have different "starting points", the weights you've just updated.

Looking at the second block of your code, I believe you are not calculating the gradient properly. Specifically, the gradient update w[l] = w[l] - (w[l]*0.001*a_score) seems to be wrong to me.
Here you are multiplying the weights and the anomaly score. However, the original gradient update equation
means to calculate the gradient of W_{t-1} using the loss \alpha_t, it does not mean to multiply \alpha_t with W_{t-1}.
To apply the online update correctly, you just need to sample your stream sequentially and apply the model.fit() as usual.
Hope this helps.

Interpreting tensorboard plots

I'm still newbie in tensorflow and I'm trying to understand what's happenning in details while my models' training goes on. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. Here are some plots extracted from tensorboard for 2 separate models:
Model_1 (InceptionResnet_V2)
Model_2 (InceptionV4)
So far, both models have poor results on the validation sets (Average Az (Area under the ROC curve) = 0.7 for Model_1 & 0.79 for Model_2). My interpretation to these plots is that the weights are not changing over the mini-batches. It's only the biases that change over the mini-batches and this might be the problem. But I don't know where to look to verify this point. This is the only interpretation I can think of but it might be wrong considering the fact that I'm still newbie. Can u please share with me your thoughts? Don't hesitate to ask for more plots in case needed.
EDIT:
As you can see in the plots below, it seems the weights are barely changing over time. This is applied for all other weights for both networks. This led me to think that there is a problem somewhere but don't know how to interpret it.
InceptionV4 weights
InceptionResnetV2 weights
EDIT2:
These models were first trained on ImageNet and these plots are the results of finetuning them on my dataset. I'm using a dataset of 19 classes with roughly 800000 images in it. I'm doing a multi-label classification problem and I'm using sigmoid_crossentropy as a loss function. The classes are highly unbalanced. In the table below, we're showing the percentage of presence of each class in the 2 subsets (train, validation):
Objects train validation
obj_1 3.9832 % 0.0000 %
obj_2 70.6678 % 33.3253 %
obj_3 89.9084 % 98.5371 %
obj_4 85.6781 % 81.4631 %
obj_5 92.7638 % 71.4327 %
obj_6 99.9690 % 100.0000 %
obj_7 90.5899 % 96.1605 %
obj_8 77.1223 % 91.8368 %
obj_9 94.6200 % 98.8323 %
obj_10 88.2051 % 95.0989 %
obj_11 3.8838 % 9.3670 %
obj_12 50.0131 % 24.8709 %
obj_13 0.0056 % 0.0000 %
obj_14 0.3237 % 0.0000 %
obj_15 61.3438 % 94.1573 %
obj_16 93.8729 % 98.1648 %
obj_17 93.8731 % 97.5094 %
obj_18 59.2404 % 70.1059 %
obj_19 8.5414 % 26.8762 %
The values of the hyperparams:
batch_size=32
weight_decay = 0.00004 #'The weight decay on the model weights.'
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9 #'Decay term for RMSProp.'
learning_rate_decay_type = exponential #Specifies how the learning rate is decayed
learning_rate = 0.01 #Initial learning rate.
learning_rate_decay_factor = 0.94 #Learning rate decay factor
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
Concerning the sparsity of the layers, here are some samples of the sparsity of the layers for both networks:
sparsity (InceptionResnet_V2)
sparsity (InceptionV4)
EDITED3:
Here are the plots of the losses for both models:
Losses and regularization loss (InceptionResnet_V2)
Losses and regularization loss (InceptionV4)

I agree with your assessment - the weights aren't changing very much across the minibatches. It does appear they are changing somewhat.
As I'm sure you're aware, you are doing fine tuning with very large models. As such, backprop can sometimes take a while. But, you're running many training iterations. I don't really think this is the problem.
If I'm not mistaken, both of these were originally trained on ImageNet. If your images are in a totally different domain than something in ImageNet, that could explain the problem.
The backprop equations do make it easier for biases to change with certain activation ranges. ReLU can be one if the model is highly sparse (i.e. if many layers have activation values of 0, then weights will struggle to adjust but biases will not). Also, if activations are in the range [0, 1], the gradient with respect to a weight will be higher than the gradient with respect to a bias. (This is why sigmoid is a bad activation function).
It could also be related to your readout layer - specifically the activation function. How are you calculating error? Is this a classification or regression problem? If at all possible, I recommend using something other than sigmoid as your final activation function. tanh could be marginally better. Linear readout sometimes speeds up training, too (all the gradients have to "pass through" the readout layer. If the derivative of the readout layer is always 1 - linear - you're "letting more gradient through" to adjust the weights further down the model).
Lastly I notice your weights histograms are pushing towards negative weights. Sometimes, especially with models that have a lot of ReLU activation, that can be an indicator of the model learning sparsity. Or an indicator of the dead neuron problem. Or both - the two are somewhat linked.
Ultimately, I think your model is just struggling to learn. I've encountered very similar histograms retraining Inception. I was using a dataset of about 2000 images, and I was struggling to push it over 80% accuracy (as it happens, the dataset was heavily biased - that accuracy was roughly as good as random guessing). It helped when I made the convolution variables constant and only made changes to the fully connected layer.
Indeed this is a classification problem and sigmoid cross entropy is the appropriate activation function. And you do have a sizable dataset - certainly big enough to fine tune these models.
With this new information, I would suggest lowering the initial learning rate. I have a two-fold reasoning here:
(1) is my own experience. As I mentioned, I'm not especially familiar with RMSprop. I've only used it in the context of DNCs (though, DNCs with convolutional controllers), but my experience there backs up what I'm about to say. I think .01 is high for training a model from scratch, let alone fine tuning. It's definitely high for Adam. In some sense, starting with a small learning rate is the "fine" part of fine tuning. Don't force the weights to shift quite so much. Especially if you're adjusting the whole model rather than the last (few) layer(s).
(2) is the increasing sparsity and shift toward negative weights. Based on your sparsity plots (good idea btw), it looks to me like some weights might be getting stuck in a sparse configuration as a result of overcorrection. I.e., as a result of a high initial rate, the weights are "overshooting" their optimal position and getting stuck somewhere that makes it hard for them to recover and contribute to the model. That is, slightly negative and close to zero is not good in a ReLU network.
As I've mentioned (repeatedly) I'm not very familiar with RMSprop. But, since you're already running lots of training iterations, give low, low, low initial rates a shot and work your way up. I mean, see how 1e-8 works. It's possible the model won't respond to training with a rate that low, but do something of an informal hyperparameter search with the learning rate. In my experience with Inception using Adam, 1e-4 to 1e-8 worked well.

TensorFlow RandomForest vs Deep learning

I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?

If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.

A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.

Tensorflow neural network loss value NaN

I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.
The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).
For some information about my model:
3 layer model. all fully connected layers.
batch size of 1000
learning rate of .001 (i also tried .1 and .01 but nothing changed)
using CrossEntropyLoss (i did add an epsilon value to prevent log0)
using AdamOptimizer
learning rate decay is .95
The exact code for the model is below: (I'm using the TF-Slim library)
input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}
Any help would be greatly appreciated! Thank you so much!

Two (possibly more) reasons why it doesn't work:
You skipped or inappropriately applied feature scaling of your
inputs and outputs. Consequently, data may be difficult to handle
for Tensorflow.
Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.

For some reasons, your training process has diverged, and you may have infinite values in your weights, wich gives NaN losses. The reasons can be many, try changing your training parameters (use smaller batchs for test).
Also, using a relu for the last output in a classifier is not the usual method, try using a sigmoid.

From my understanding Relu doesn't put a cap on the upper bound for Neural Networks so its more likely to deconverge depending upon its implementation.
Try switching all the activation functions to tanh or sigmoid. Relu is generally used for convolution in cnns.
Its also difficult to determine if your deconverging due to cross entropy as we don't know how you effected it with your epsilon value. Try just using the residual its much simpler but still effective.
Also a 5000-5000-4500 neural network is huge. Its unlikely you actually need a network that large.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.