I am using Stochastic Weight Averaging (SWA) with Batch Normalization layers in Tensorflow 2.2. For Batch Norm I use tf.keras.layers.BatchNormalization. For SWA I use my own code to average the weights (I wrote my code before tfa.optimizers.SWA appeared). I have read in multiple sources that if using batch norm and SWA we must run a forward pass to make certain data (running mean and st dev of activation weights and/or momentum values?) available to the batch norm layers. What I do not understand - despite a lot of reading - is exactly what needs to be done and how. Specifically:
When must the forward/prediction pass be run? At the end of each
mini-batch, end of each epoch, end of all training?
When the forward pass is run, how are the running mean & stdev values made available
to the batch norm layers?
Is this process performed magically by the tfa.optimizers.SWA class?
When must the forward/prediction pass be run? At the end of each
mini-batch, end of each epoch, end of all training?
At the end of training. Think of it like this, SWA is performed by swapping your final weights with a running average. But all batch norm layers are still calculated based on statistics from your old weights. So we need to run a forward pass to let them catch up.
When the forward pass is run, how are the running mean & stdev values
made available to the batch norm layers?
During a normal forward pass (prediction) the running mean and standard deviation will not be updated. So what we actually need to do is to train the network, but not update the weights. This is what the paper refers to when it says to run the forward pass in "training mode".
The easiest way to achieve this (that I know) is to reset the batch normalization layers and train one additional epoch with learning rate set to 0.
Is this process performed magically by the tfa.optimizers.SWA class?
I don't know. But if you are using Tensorflow Keras then I have made this Keras SWA callback that does it like in the paper including the learning rate schedules.
Related
I searched all over the web but I couldn't find it. how does Keras calculate the loss if we have multiple output values.
It depends on which loss function you are using.
Usually, For each batch during training, Keras will call the
loss function to compute the loss and use it to perform a Gradient
Descent step. Moreover, it will keep track of the total loss since the
beginning of the epoch, and it will display the mean loss.
Regarding multiple outputs, same process happens and calculation wise you can pick any loss function mentioned in the document here and check the examples.
Let's say for SparseCategoricalCrossentropy you can check this document.
A have a deep convolution network with approximate 120 layers. I attempt to train on [7,368,368,1] batches successfully. Also, I can easily give [1,2700,2700,1] as an input on validation step. But I can not increase batch size from 7 to higher (for example 20), tensorflow fails with
OOM when allocating tensor with shape[20,10,368,368].
How is it possible, when on validation we have [1,10,2700,2700] tensor?
You do only forward pass on validation dataset. But you do forward and backward passes on training. Backward pass probably uses more resources.
On validation we just have forward pass so the data just flows forward and there is no need to keep output of previous layers. But in training you need to keep some layers output and use them in backward pass (on the other hand I think tensorflow keeps parameters gradient in each iteration to pass them to optimizer). As the result, the training will use much more RAM
We set model.train() during training, but during my training iterations, I also want to do a forward pass of the training dataset to see what my new loss is. When doing this, should I temporarily set model.eval()?
If your network has layers which act different during inference (torch.nn.BatchNormNd and torch.nn.DropoutNd could be an example, for the second case all neurons will be used but scaled by inverted probability of keeping neurons, see here or here for example) and you want to test how your network performs currently (which is usually called a validation step) then it is mandatory to use module.eval().
It is a common (and very good!) practice to always switch to eval mode when doing inference-like things no matter if this changes your actual model.
EDIT:
You should also use with torch.no_grad(): block during inference, see official tutorial code as gradients are not needed during this phase and it's wasteful to compute them.
I am trying to write some custom layers in Keras. The ultimate goal is that certain parameters (updated according to a fixed formula after each batch of data is optimized over in the training process) be passed to the loss function. I do not believe it is possible to use dynamic loss functions in Keras, but that I should be able to pass these parameters to the loss function using multiple inputs and a custom layer.
I want to know whether it is possible to create a layer in Keras having parameters that are not trainable (and not optimized over at all in the training process), but instead updated according to a fixed formula at the end of each batch optimization in the training process.
The simplest example I can give: instead of optimizing a generic cost function (like cross-entropy), I want to optimize something proportional to the cross entropy (c*cross_entropy). After one batch of data is processed in the training procedure, I want to set, for example, c = 1.2*c, and this to be used as the c value in the batch of data.
(This should be more or less useless in this case as a positive constant times the loss function shouldn't affect the minima but it's fairly close to what I actually need to do).
I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!
The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!