Using the optimal learning rate results in random guessing accuracy

Using the optimal learning rate results in random guessing accuracy - python

I am going through Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron and I'm trying to make sense of what I'm doing wrong while solving an exercise. It's exercise 8 from Chapter 11. What I have to do is train a neural network with 20 hidden layers, 100 neurons each, with the activation function ELU and weight initializer He Normal on the CIFAR10 dataset (I know 20 hidden layers of 100 neurons is a lot, but that's the point of the exercise, so bear with me). I have to use Early Stopping and Nadam optimizer.
The problem that I have is that I didn't know what learning rate to use. In the solutions notebook, the author listed a bunch of learning rates that he tried and used the best one he found. I wasn't satisfied by this and I decided to try to find the best learning rate myself. So I used a technique that was recommended in the book: train the network for one epoch, exponentially increasing the learning rate at each iteration. Then plot the loss as a function of the learning rate, see where the loss hits its minumum, and choose a slightly smaller learning rate (since that's the upper bound).
This is the code from my model:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
model.add(keras.layers.Dense(100,
activation="elu",
kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))
optimizer = keras.optimizers.Nadam(lr=1e-5)
model.compile(loss="sparse_categorical_crossentropy",
optimizer=optimizer,
metrics=["accuracy"])
(Ignore the value of the learning rate, it doesn't matter yet since I'm trying to find the right one.)
Here is the code that was used to find the optimal learning rate:
class ExponentialLearningRate(keras.callbacks.Callback):
def __init__(self, factor):
self.factor = factor
self.rates = []
self.losses = []
def on_batch_end(self, batch, logs):
self.rates.append(keras.backend.get_value(self.model.optimizer.lr))
self.losses.append(logs["loss"])
keras.backend.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)
def find_learning_rate(model, X, y, epochs=1, batch_size=32, min_rate=10**-5, max_rate=10):
init_weights = model.get_weights()
init_lr = keras.backend.get_value(model.optimizer.lr)
iterations = len(X) // batch_size * epochs
factor = np.exp(np.log(max_rate / min_rate) / iterations)
keras.backend.set_value(model.optimizer.lr, min_rate)
exp_lr = ExponentialLearningRate(factor)
history = model.fit(X, y, epochs = epochs, batch_size = batch_size, callbacks = [exp_lr])
keras.backend.set_value(model.optimizer.lr, init_lr)
model.set_weights(init_weights)
return exp_lr.rates, exp_lr.losses
def plot_lr_vs_losses(rates, losses):
plt.figure(figsize=(10, 5))
plt.plot(rates, losses)
plt.gca().set_xscale("log")
plt.hlines(min(losses), min(rates), max(rates))
plt.axis([min(rates), max(rates), min(losses), losses[0] + min(losses) / 2])
plt.xlabel("Learning rate")
plt.ylabel("Loss")
The find_learning_rate() function exponentially increases the learning rate at each iteration, going from the minimum learning rate of 10^(-5) to the maximum learning rate of 10. After that, I plotted the curve using the function plot_lr_vs_losses() and this is what I got:
Looks like using a learning rate of 1e-2 would be great, right? But when I re-compile the model, with a learning rate of 1e-2 the model's accuracy on both the training set and the validation set is about 10%, which is like choosing randomly, since we have 10 classes. I used early stopping, so I can't say that I let the model train for too many epochs (I used 100). But even during training, the model doesn't learn anything, the accuracy of both the training set and the validation set is always at around 10%.
This whole problem disappears when I use a much smaller learning rate (the one used by the author in the solutions notebook). When I use a learning rate of 5e-5 the model is learning and reaches around 50% accuracy on the validation set (which is what the exercise expects, that's the same accuracy the author got). But how is it that using the learning rate indicated by the plot is so bad? I read a little bit on the internet and this method of exponentially increasing the learning rate seems to be used by many people, so I really don't understand what I did wrong.

You're using a heuristic search method on an unknown exploration space. Without more information on the model/data characteristics, it's hard to say what went wrong.
My first worry is the abrupt rise to effective infinity for the loss; you have an edge in yoru exploration space that is not smooth, suggesting that the larger space (including many epochs of training) has a highly irruptive boundary. It's possible that any learning rate near the epoch-=1 boundary will stumble across the cliff at later epochs, leaving you with random classifications.
The heuristic you used is based on a couple of assumptions.
Convergence speed as a function of learning rate is relatively smooth
Final accuracy is virtually independent of the learning rate.
It appears that your model does not exhibit these characteristics.
The heuristic trains on only one epoch; how many epochs does it take to converge the model at various learning rates? If the learning rate is too large, the model may do that final convergence very slowly, as it circles the optimum point. It's also possible that you never got close to that point with a rate that was too large.
Without mapping the convergence space with respect to that epoch-1 test, we can't properly analyze the problem. However, you can try a related experiment: starting at, perhaps, 10^-4, fully train your model (detect convergence and stop). Repeat, multiplying the LR by 3 each time. When you cross into non-convergence around .0081, you have a feeling for where you no longer converge.
Now subdivide that range [.0027, .0081] as you see fit. Once you find an upper endpoint that does converge, you can use that to guide a final search for the optimal learning rate.

Related

Number of epochs to be used in a Keras sequential model

I'm building a Keras sequential model to do a binary image classification. Now when I use like 70 to 80 epochs I start getting good validation accuracy (81%). But I was told that this is a very big number to be used for epochs which would affect the performance of the network.
My question is: is there a limited number of epochs that I shouldn't exceed, note that I have 2000 training images and 800 validation images.

If the number of epochs are very high, your model may overfit and your training accuracy will reach 100%. In that approach you plot the error rate on training and validation data. The horizontal axis is the number of epochs and the vertical axis is the error rate. You should stop training when the error rate of validation data is minimum.
You need to have a trade-off between your regularization parameters. Major problem in Deep Learning is overfitting model. Various regularization techniques are used,as
i) Reducing batch-size
ii) Data Augmentation(only if your data is not diverse)
iii) Batch Normalization
iv) Reducing complexity in architecture(mainly convolutional layers)
v) Introducing dropout layer(only if you are using any dense layer)
vi) Reduced learning rate.
vii) Transfer learning
Batch-size vs epoch tradeoff is quite important. Also it is dependent on your data and varies from application to application. In that case, you have to play with your data a little bit to know the exact figure. Normally a batch size of 32 medium size images requires 10 epochs for good feature extraction from the convolutional layers. Again, it is relative

There's this Early Stopping function that Keras supply which you simply define.
EarlyStopping(patience=self.patience, verbose=self.verbose, monitor=self.monitor)
Let's say that the epochs parameter equals to 80, like you said before. When you use the EarlyStopping function the number of epochs becomes the maximum number of epochs.
You can define the EarlyStopping function to monitor the validation loss, for example, when ever this loss does not improve no more it'll give it a few last chances (the number you put in the patience parameter) and if after those last chances the monitored value didn't improve the training process will stop.
The best practice, in my opinion, is to use both EarlyStopping and ModelCheckpoint, which is another callback function supplied in Keras' API that simply saves your last best model (you decide what best means, best loss or other value that you test your results with).
This is the Keras solution for the problem your trying to deal with. In addition there is a lot of online material that you can read about how to deal with overfit.

Yaa! Their is a solution for your problem. Select epochs e.g 1k ,2k just use early stoping on your neural net.
Early Stopping:
Keras supports the early stopping of training via a callback called Early-stopping.
This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process. For example you apply a trigger that stop the training if accuracy is not increasing in previous 5 epochs. So keras will see previous 5 epochs through call backs and stop training if your accuracy is not increasing
Early Stopping link :

Interpreting tensorboard plots

I'm still newbie in tensorflow and I'm trying to understand what's happenning in details while my models' training goes on. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. Here are some plots extracted from tensorboard for 2 separate models:
Model_1 (InceptionResnet_V2)
Model_2 (InceptionV4)
So far, both models have poor results on the validation sets (Average Az (Area under the ROC curve) = 0.7 for Model_1 & 0.79 for Model_2). My interpretation to these plots is that the weights are not changing over the mini-batches. It's only the biases that change over the mini-batches and this might be the problem. But I don't know where to look to verify this point. This is the only interpretation I can think of but it might be wrong considering the fact that I'm still newbie. Can u please share with me your thoughts? Don't hesitate to ask for more plots in case needed.
EDIT:
As you can see in the plots below, it seems the weights are barely changing over time. This is applied for all other weights for both networks. This led me to think that there is a problem somewhere but don't know how to interpret it.
InceptionV4 weights
InceptionResnetV2 weights
EDIT2:
These models were first trained on ImageNet and these plots are the results of finetuning them on my dataset. I'm using a dataset of 19 classes with roughly 800000 images in it. I'm doing a multi-label classification problem and I'm using sigmoid_crossentropy as a loss function. The classes are highly unbalanced. In the table below, we're showing the percentage of presence of each class in the 2 subsets (train, validation):
Objects train validation
obj_1 3.9832 % 0.0000 %
obj_2 70.6678 % 33.3253 %
obj_3 89.9084 % 98.5371 %
obj_4 85.6781 % 81.4631 %
obj_5 92.7638 % 71.4327 %
obj_6 99.9690 % 100.0000 %
obj_7 90.5899 % 96.1605 %
obj_8 77.1223 % 91.8368 %
obj_9 94.6200 % 98.8323 %
obj_10 88.2051 % 95.0989 %
obj_11 3.8838 % 9.3670 %
obj_12 50.0131 % 24.8709 %
obj_13 0.0056 % 0.0000 %
obj_14 0.3237 % 0.0000 %
obj_15 61.3438 % 94.1573 %
obj_16 93.8729 % 98.1648 %
obj_17 93.8731 % 97.5094 %
obj_18 59.2404 % 70.1059 %
obj_19 8.5414 % 26.8762 %
The values of the hyperparams:
batch_size=32
weight_decay = 0.00004 #'The weight decay on the model weights.'
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9 #'Decay term for RMSProp.'
learning_rate_decay_type = exponential #Specifies how the learning rate is decayed
learning_rate = 0.01 #Initial learning rate.
learning_rate_decay_factor = 0.94 #Learning rate decay factor
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
Concerning the sparsity of the layers, here are some samples of the sparsity of the layers for both networks:
sparsity (InceptionResnet_V2)
sparsity (InceptionV4)
EDITED3:
Here are the plots of the losses for both models:
Losses and regularization loss (InceptionResnet_V2)
Losses and regularization loss (InceptionV4)

I agree with your assessment - the weights aren't changing very much across the minibatches. It does appear they are changing somewhat.
As I'm sure you're aware, you are doing fine tuning with very large models. As such, backprop can sometimes take a while. But, you're running many training iterations. I don't really think this is the problem.
If I'm not mistaken, both of these were originally trained on ImageNet. If your images are in a totally different domain than something in ImageNet, that could explain the problem.
The backprop equations do make it easier for biases to change with certain activation ranges. ReLU can be one if the model is highly sparse (i.e. if many layers have activation values of 0, then weights will struggle to adjust but biases will not). Also, if activations are in the range [0, 1], the gradient with respect to a weight will be higher than the gradient with respect to a bias. (This is why sigmoid is a bad activation function).
It could also be related to your readout layer - specifically the activation function. How are you calculating error? Is this a classification or regression problem? If at all possible, I recommend using something other than sigmoid as your final activation function. tanh could be marginally better. Linear readout sometimes speeds up training, too (all the gradients have to "pass through" the readout layer. If the derivative of the readout layer is always 1 - linear - you're "letting more gradient through" to adjust the weights further down the model).
Lastly I notice your weights histograms are pushing towards negative weights. Sometimes, especially with models that have a lot of ReLU activation, that can be an indicator of the model learning sparsity. Or an indicator of the dead neuron problem. Or both - the two are somewhat linked.
Ultimately, I think your model is just struggling to learn. I've encountered very similar histograms retraining Inception. I was using a dataset of about 2000 images, and I was struggling to push it over 80% accuracy (as it happens, the dataset was heavily biased - that accuracy was roughly as good as random guessing). It helped when I made the convolution variables constant and only made changes to the fully connected layer.
Indeed this is a classification problem and sigmoid cross entropy is the appropriate activation function. And you do have a sizable dataset - certainly big enough to fine tune these models.
With this new information, I would suggest lowering the initial learning rate. I have a two-fold reasoning here:
(1) is my own experience. As I mentioned, I'm not especially familiar with RMSprop. I've only used it in the context of DNCs (though, DNCs with convolutional controllers), but my experience there backs up what I'm about to say. I think .01 is high for training a model from scratch, let alone fine tuning. It's definitely high for Adam. In some sense, starting with a small learning rate is the "fine" part of fine tuning. Don't force the weights to shift quite so much. Especially if you're adjusting the whole model rather than the last (few) layer(s).
(2) is the increasing sparsity and shift toward negative weights. Based on your sparsity plots (good idea btw), it looks to me like some weights might be getting stuck in a sparse configuration as a result of overcorrection. I.e., as a result of a high initial rate, the weights are "overshooting" their optimal position and getting stuck somewhere that makes it hard for them to recover and contribute to the model. That is, slightly negative and close to zero is not good in a ReLU network.
As I've mentioned (repeatedly) I'm not very familiar with RMSprop. But, since you're already running lots of training iterations, give low, low, low initial rates a shot and work your way up. I mean, see how 1e-8 works. It's possible the model won't respond to training with a rate that low, but do something of an informal hyperparameter search with the learning rate. In my experience with Inception using Adam, 1e-4 to 1e-8 worked well.

TensorFlow RandomForest vs Deep learning

I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?

If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.

A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.

Getting started with Keras for machine learning

I'm getting started with machine learning tools and I'd like to learn more about what the heck I'm doing. For instance, the script:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization
from keras.initializers import RandomUniform
import numpy
numpy.random.seed(13)
RandomUniform(seed=13)
model = Sequential()
model.add(Dense(6, input_dim=6))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(optimizer='sgd', loss='mean_absolute_error', metrics=['accuracy'])
data = numpy.loadtxt('train', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
model.fit(X, Y, batch_size=1, epochs=1000)
data = numpy.loadtxt('test', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
score = model.evaluate(X, Y, verbose=1)
print ('\n\nThe error is:\n', score, "\n")
print('\n\nPrediction:\n')
Y = model.predict(X, batch_size=1, verbose=1)
print('\nResult:\n', Y, '\n')
It's a Frankenstein I made from some examples I found on the internet and I have many unanswered questions about it:
The file train has 60 rows. Is 1000 epochs too little? Is it too much? Can I get an Underfit/Overfit?
What does the result I get from model.evaluate() mean? I know it's the loss but, if I get a [7.0506157875061035, 0.0], does it mean that my model has a 7% error?
And last, I'm getting a prediction of 0.99875391, 0.99875391, 0.9362126, 0.99875391, 0.99875391, 0.99875391, 0.93571019 when the expected values were anything close to 7.86, 3.57, 8.93, 6.57, 11.7, 8.53, 9.06, which means it's a real bad prediction. Clearly there's a lot of things I'm doing wrong. Could you guys give me a few pointers?
I know it all depends on the type of data I'm using, but is there anything I shouldn't do at all? Or maybe something I should be doing?

1
There is never a ready answer for how many epochs is a good number. It varies wildly depending on the size of your data, your model, and what you want to achieve. Normally, small models require less epochs, bigger models require more. Yours seem small enough and 1000 epochs seems way too much.
It also depends on the learning rate, a parameter given to the optimizer that defines how long are the steps your model takes to update its weights. Bigger learning rates mean less epochs, but there is a chance that you simply never find a good point because you're adjusting weights beyond what is good. Smaller learning rates mean more epochs and better learning.
Normally, if the loss reaches a limit, you're approaching a point where training is not useful anymore. (Of course, there may be problems with the model too, there is really no simple answer for this one).
To detect overfitting, you need besides the training data (X and Y), another group with test data (say Xtest and Ytest, for instance).
Then you use it in model.fit(X,Y, validation_data=(Xtest,Ytest), ...)
Test data is not given for training, it's kept separate just to see if your model can predict good things from data it has never seen in training.
If the training loss goes down, but the validation loss doesn't, you're overfitting (roughly, your model is capable of memorizing the training data without really understanding it).
An underfit, on the contrary, happens when you never achieve the accuracy you expect (of course we always expect a 100% accuracy, no mistakes, but good models get around the 90's, some applicatoins go better 99%, some worse, again, it's very subjective).
2
model.evaluate() gives you the losses and the metrics you added in the compile method.
The loss value is something your model will always try to decrease during training. It roughly means how distant your model is from the exact values. There is no rule for what the loss value means, it could even be negative (but usually keras uses positive losses). The point is: it must decrease during training, that means your model is evolving.
The accuracy value means how many right predictions your model outputs compared to the true values (Y). It seems your accuracy is 0%, your model is getting everything wrong. (You can see that from the values you typed).
3
In your model, you used activation functions. These normalize the results so they don't get too big. This avoids overflowing problems, numeric errors propagating, etc.
It's very very usual to work with values within such bounds.
tanh - outputs values between -1 and 1
sigmoid - outputs values between 0 and 1
Well, if you used a sigmoid activation in the last layer, your model will never output 3 for instance. It tries, but the maximum value is 1.
What you should do is prepare your data (Y), so it's contained between 0 and 1. (This is the best to do in classification problems, often done with images too)
But if you actually want numerical values, then you should just remove the activation and let the output be free to reach higher values. (It all depends on what you want to achieve with your model)

Epoch is a single pass through the full training set. I my mind it seems a lot, but you'd have to check for overfitting and evaluate the predictions. There are many ways of checking and controlling for overfitting in a model. If you understand the methods of doing so from here, coding them in Keras should be no problem.
According to the documentation .evaluate returns:
Scalar test loss (if the model has no metrics) or list of scalars (if the model computes other metrics)
so these are the evaluation metrics of your model, they tell you how good your model is given some notion of good. Those metrics depend on the model and type of data that you've used. Some explanation on those can be found here and here. As mentioned in the documentation,
The attribute model.metrics_names will give you the display labels for the scalar outputs.
So you can know what metric you are looking at. It is easier to do that interactively through the console (ipython, bpython) or Jupyter notebook.
I can't see your data, but a if you are doing a classification problem as suggested by metrics=['accuracy'], the loss=mean_absolute_error doesn't make sense, since it is made for regression problems. To learn more about those I refer you to here and here which discuss classification and regression problems with Keras.
PS: question 3 is not related to software per se, but to the theoretical construct supporting the software. In such cases, I'd recommend asking them at Cross Validated.

Is my model underfitting, tensorflow?

My loss first decreased for few epochs but then started increasing and then increased up to a certain point and then stopped moving. I think now it has converged. Now, can we say that my model is underfitting? Because my interpretation is that (slide 93 link) if my loss is going down and then increasing it means that I have a high learning rate and which after every 2 epochs I'm decaying so after few epochs loss stopped increasing because learning rate is low now, because I'm still decaying my learning rate, now loss should start decreasing again, according to slide 93 because learning rate is low, but it doesn't. Can we say that loss is not decreasing further because my model is underfitting?

So, to summarize, the loss on the training data:
first went down
then it went up again
then it remains at the same level
then the learning rate is decayed
and the loss doesn't go back down again (still stays the same with decayed learning rate)
To me it sounds like the learning rate was too high initially, and it got stuck in a local minimum afterwards. Decaying the learning rate at that point, once it's already stuck in a local minimum, is not going to help it escape that minimum. Setting the initial learning rate at a lower value is more likely to be beneficial, so that you don't end up in the ''bad'' local minimum to begin with.
It is possible that your model is now underfitting, and that making the model more complex (more nodes in hidden layers, for instance) would help. This is not necessarily the case though.
Are you using any techniques to avoid overfitting? For example, regularization and/or dropout? If so, it is also possible that your model was initially overfitting (when the loss was going down, before it went back up again). To get a better idea of what's going on, it would be beneficial to plot not only your loss on the training data, but also loss on a validation set. If the loss on your training data drops significantly below the loss on the validation data, you know it's overfitting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.