My loss first decreased for few epochs but then started increasing and then increased up to a certain point and then stopped moving. I think now it has converged. Now, can we say that my model is underfitting? Because my interpretation is that (slide 93 link) if my loss is going down and then increasing it means that I have a high learning rate and which after every 2 epochs I'm decaying so after few epochs loss stopped increasing because learning rate is low now, because I'm still decaying my learning rate, now loss should start decreasing again, according to slide 93 because learning rate is low, but it doesn't. Can we say that loss is not decreasing further because my model is underfitting?
So, to summarize, the loss on the training data:
first went down
then it went up again
then it remains at the same level
then the learning rate is decayed
and the loss doesn't go back down again (still stays the same with decayed learning rate)
To me it sounds like the learning rate was too high initially, and it got stuck in a local minimum afterwards. Decaying the learning rate at that point, once it's already stuck in a local minimum, is not going to help it escape that minimum. Setting the initial learning rate at a lower value is more likely to be beneficial, so that you don't end up in the ''bad'' local minimum to begin with.
It is possible that your model is now underfitting, and that making the model more complex (more nodes in hidden layers, for instance) would help. This is not necessarily the case though.
Are you using any techniques to avoid overfitting? For example, regularization and/or dropout? If so, it is also possible that your model was initially overfitting (when the loss was going down, before it went back up again). To get a better idea of what's going on, it would be beneficial to plot not only your loss on the training data, but also loss on a validation set. If the loss on your training data drops significantly below the loss on the validation data, you know it's overfitting.
Related
i'm working on a regression problem using neural network. the mse loss would decrease at the beginning of train and the accuracy is satisfactory, yet, when the train process goes on, the loss had a huge jump, and maintain at a certain value,like the curve in the picture. i don't know why and how to fix it? and i wanna ask if i could use the train coefficient before the jump, like train step at 8000, as my final result?
This is a typical case of model training where the value of the accuracy metric stops improving (and even get worse) from a certain number of training epochs.
I'll suggest you to implement Early Stopping meaning that, "yes", you can take the training accuracy at step 8000 as you final result if your only goal is to minimize the training loss.
This TF documentation explains how to implement Early Stopping with Tensorflow's tf.keras.callbacks.EarlyStopping() function.
However if your goal is a model that generalizes well on unseen data (test/validation data) as this is generally the case, you might want to evaluate your model's test accuracy in order to take it into account when implementing Early Stopping.
This article gives a very good example of end-to-end implementation of early stopping with Tensorflow.
I am going through Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron and I'm trying to make sense of what I'm doing wrong while solving an exercise. It's exercise 8 from Chapter 11. What I have to do is train a neural network with 20 hidden layers, 100 neurons each, with the activation function ELU and weight initializer He Normal on the CIFAR10 dataset (I know 20 hidden layers of 100 neurons is a lot, but that's the point of the exercise, so bear with me). I have to use Early Stopping and Nadam optimizer.
The problem that I have is that I didn't know what learning rate to use. In the solutions notebook, the author listed a bunch of learning rates that he tried and used the best one he found. I wasn't satisfied by this and I decided to try to find the best learning rate myself. So I used a technique that was recommended in the book: train the network for one epoch, exponentially increasing the learning rate at each iteration. Then plot the loss as a function of the learning rate, see where the loss hits its minumum, and choose a slightly smaller learning rate (since that's the upper bound).
This is the code from my model:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
activation="elu",
kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))
optimizer = keras.optimizers.Nadam(lr=1e-5)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])
(Ignore the value of the learning rate, it doesn't matter yet since I'm trying to find the right one.)
Here is the code that was used to find the optimal learning rate:
class ExponentialLearningRate(keras.callbacks.Callback):
def __init__(self, factor):
self.factor = factor
self.rates = []
self.losses = []
def on_batch_end(self, batch, logs):
self.rates.append(keras.backend.get_value(self.model.optimizer.lr))
self.losses.append(logs["loss"])
keras.backend.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)
def find_learning_rate(model, X, y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
init_weights = model.get_weights()
init_lr = keras.backend.get_value(model.optimizer.lr)
iterations = len(X) // batch_size * epochs
factor = np.exp(np.log(max_rate / min_rate) / iterations)
keras.backend.set_value(model.optimizer.lr, min_rate)
exp_lr = ExponentialLearningRate(factor)
history = model.fit(X, y, epochs = epochs, batch_size = batch_size, callbacks = [exp_lr])
keras.backend.set_value(model.optimizer.lr, init_lr)
model.set_weights(init_weights)
return exp_lr.rates, exp_lr.losses
def plot_lr_vs_losses(rates, losses):
plt.figure(figsize=(10, 5))
plt.plot(rates, losses)
plt.gca().set_xscale("log")
plt.hlines(min(losses), min(rates), max(rates))
plt.axis([min(rates), max(rates), min(losses), losses[0] + min(losses) / 2])
plt.xlabel("Learning rate")
plt.ylabel("Loss")
The find_learning_rate() function exponentially increases the learning rate at each iteration, going from the minimum learning rate of 10^(-5) to the maximum learning rate of 10. After that, I plotted the curve using the function plot_lr_vs_losses() and this is what I got:
Looks like using a learning rate of 1e-2 would be great, right? But when I re-compile the model, with a learning rate of 1e-2 the model's accuracy on both the training set and the validation set is about 10%, which is like choosing randomly, since we have 10 classes. I used early stopping, so I can't say that I let the model train for too many epochs (I used 100). But even during training, the model doesn't learn anything, the accuracy of both the training set and the validation set is always at around 10%.
This whole problem disappears when I use a much smaller learning rate (the one used by the author in the solutions notebook). When I use a learning rate of 5e-5 the model is learning and reaches around 50% accuracy on the validation set (which is what the exercise expects, that's the same accuracy the author got). But how is it that using the learning rate indicated by the plot is so bad? I read a little bit on the internet and this method of exponentially increasing the learning rate seems to be used by many people, so I really don't understand what I did wrong.
You're using a heuristic search method on an unknown exploration space. Without more information on the model/data characteristics, it's hard to say what went wrong.
My first worry is the abrupt rise to effective infinity for the loss; you have an edge in yoru exploration space that is not smooth, suggesting that the larger space (including many epochs of training) has a highly irruptive boundary. It's possible that any learning rate near the epoch-=1 boundary will stumble across the cliff at later epochs, leaving you with random classifications.
The heuristic you used is based on a couple of assumptions.
Convergence speed as a function of learning rate is relatively smooth
Final accuracy is virtually independent of the learning rate.
It appears that your model does not exhibit these characteristics.
The heuristic trains on only one epoch; how many epochs does it take to converge the model at various learning rates? If the learning rate is too large, the model may do that final convergence very slowly, as it circles the optimum point. It's also possible that you never got close to that point with a rate that was too large.
Without mapping the convergence space with respect to that epoch-1 test, we can't properly analyze the problem. However, you can try a related experiment: starting at, perhaps, 10^-4, fully train your model (detect convergence and stop). Repeat, multiplying the LR by 3 each time. When you cross into non-convergence around .0081, you have a feeling for where you no longer converge.
Now subdivide that range [.0027, .0081] as you see fit. Once you find an upper endpoint that does converge, you can use that to guide a final search for the optimal learning rate.
I am training a NN and getting this result on loss and validation loss:
These are 200 epochs, a batch size of 16, 500 training samples and 200 validation samples.
As you can see, after about 20 epochs, the validation loss begins to do a very exaggerated zig-zagging.
Do you know which could be the reason for that behavior?
I tried to increase the number of validation samples but that just increased the zig-zagging and made it more exaggerated.
Also, I added a decay value to the optimizer, but the loss and validation loss did not look so good.
.
I was looking for another way to improve it.
Any idea on which is the zig-zagging reason and how could I minimize it?
This might be a case of overfitting:
Overfitting refers to a model that models the “training data” too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data source.
Basically, you have a very small training sample (500), but are training for a very long time (200 epochs!).
The network will start learning your training data by heart and won't learn to generalise. It will thus seem to be very good during training, but will fail miserably on the test set.
early stopping is a nice way to avoid overfitting: basically, stop as soon as the validation loss becomes erratic/starts increasing. Another way to lower the chances of overfitting is to use techniques such as dropout or simply to increase the training data.
tldr; you are overfitting. To avoid this issue, many possibilities: reduce drastically the number of epochs, use a dev set and a stopping criterion, have more training data, ...
For alternative explanations, see also this question on QUORA.
I would suggest that don't be worry for the zigzag fashion of the validation loss or validation accuracy. See, what happens when training of the neural network goes on, it makes the mistakes and update the weights, right ?( if you know the math behind it). So it is obvious that testing data will create zigzag because model is in training mode (learning stage). Once the model will get trained fully , you will notice that ... zigzag will decrease (if you have chose correct number of epochs).
So don't worry for this.
I'm building a Keras sequential model to do a binary image classification. Now when I use like 70 to 80 epochs I start getting good validation accuracy (81%). But I was told that this is a very big number to be used for epochs which would affect the performance of the network.
My question is: is there a limited number of epochs that I shouldn't exceed, note that I have 2000 training images and 800 validation images.
If the number of epochs are very high, your model may overfit and your training accuracy will reach 100%. In that approach you plot the error rate on training and validation data. The horizontal axis is the number of epochs and the vertical axis is the error rate. You should stop training when the error rate of validation data is minimum.
You need to have a trade-off between your regularization parameters. Major problem in Deep Learning is overfitting model. Various regularization techniques are used,as
i) Reducing batch-size
ii) Data Augmentation(only if your data is not diverse)
iii) Batch Normalization
iv) Reducing complexity in architecture(mainly convolutional layers)
v) Introducing dropout layer(only if you are using any dense layer)
vi) Reduced learning rate.
vii) Transfer learning
Batch-size vs epoch tradeoff is quite important. Also it is dependent on your data and varies from application to application. In that case, you have to play with your data a little bit to know the exact figure. Normally a batch size of 32 medium size images requires 10 epochs for good feature extraction from the convolutional layers. Again, it is relative
There's this Early Stopping function that Keras supply which you simply define.
EarlyStopping(patience=self.patience, verbose=self.verbose, monitor=self.monitor)
Let's say that the epochs parameter equals to 80, like you said before. When you use the EarlyStopping function the number of epochs becomes the maximum number of epochs.
You can define the EarlyStopping function to monitor the validation loss, for example, when ever this loss does not improve no more it'll give it a few last chances (the number you put in the patience parameter) and if after those last chances the monitored value didn't improve the training process will stop.
The best practice, in my opinion, is to use both EarlyStopping and ModelCheckpoint, which is another callback function supplied in Keras' API that simply saves your last best model (you decide what best means, best loss or other value that you test your results with).
This is the Keras solution for the problem your trying to deal with. In addition there is a lot of online material that you can read about how to deal with overfit.
Yaa! Their is a solution for your problem. Select epochs e.g 1k ,2k just use early stoping on your neural net.
Early Stopping:
Keras supports the early stopping of training via a callback called Early-stopping.
This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process. For example you apply a trigger that stop the training if accuracy is not increasing in previous 5 epochs. So keras will see previous 5 epochs through call backs and stop training if your accuracy is not increasing
Early Stopping link :
I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.