I'm trying to train a model using transfer learning method, which involves loading and freezing a very large embedding layer. As training speed is also what I concern in this task, I expect freezing layers to also boost training speed. However, I haven't observed any speed improvement at all.
I freeze the embedding layer using
self.embedding_layer.trainable = False
And did observered non-trainable parameters increased in model summary
As the frozen layers may occur in the middle of a model, and the gradients need to be passed down to the first several trainable layers, I reckon tensorflow still calculate the gradients of frozen layers but just skip them during updating stage.
If my guess is correct, is there any way to remove the frozen layer from gradient calculation?
Related
I'm reading the code of a Keras implementation of YOLOv4 object detector.
It uses a custom Batch Norm layer, like this:
class BatchNormalization(tf.keras.layers.BatchNormalization):
"""
"Frozen state" and "inference mode" are two separate concepts.
`layer.trainable = False` is to freeze the layer, so the layer will use
stored moving `var` and `mean` in the "inference mode", and both `gama`
and `beta` will not be updated !
"""
def call(self, x, training=False):
if not training:
training = tf.constant(False)
training = tf.logical_and(training, self.trainable)
return super().call(x, training)
Even though I understand how the usual Batch Norm layer works during training and inference, I don't understand the comment nor the need for this modification. What do you think?
Usually (in another types of layers) freeze training of the layer doesn't necessarily mean that the layer is run in inference mode.
Inference mode is normally controller by the training argument.
In the case of batchnorm layer, when the layer is freeze we want that not only the layer parameters will not modify during the training process, we want addionaly that the model will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch.
So despite the difference between a layer training freeze and an inference mode that is common. In the case of the batchnorm there is more similarity between the two modes.
In both, we want that in addition to freezing the parameters of the layer the model used the general mean and standard deviation and will not be affected by the current batch statistics.
(It's important for the stability of the model).
From BatchNormalization layer guide - Keras
About setting layer.trainable = False on a BatchNormalization layer:
Freezing a layer in all the types of layers:
The meaning of setting layer.trainable = False is to freeze the layer, i.e. its internal state will not change during training: its trainable weights will not be updated during fit() or train_on_batch(), and its state updates will not be run.
additional behavior when freezing a batchnorm layer:
However, in the case of the BatchNormalization layer, setting trainable = False on the layer means that the layer will be subsequently run in inference mode (meaning that it will use the moving mean and the moving variance to normalize the current batch, rather than using the mean and variance of the current batch).
TensorFlow's official tutorial says that we should pass base_model(trainin=False) during training in order for the BN layer not to update mean and variance. my question is: why? why we don't need to update mean and variance, I mean BN has imagenet mean and variance and why it is useful to use imagenet's mean and variance, and not update them on new data? even during fine tunning, in this case whole model updates weights but BN layer still is going to have imagenet mean and variance.
edit: i am using this tutorial :https://www.tensorflow.org/tutorials/images/transfer_learning
When model is trained from initialization, batchnorm should be enabled to tune their mean and variance as you mentioned. Finetuning or transfer learning is a bit different thing: you already has a model that can do more than you need and you want to perform particular specialization of pre-trained model to do your task/work on your data set. In this case part of weights are frozen and only some layers closest to output are changed. Since BN layers are used all around model you should froze them as well. Check again this explanation:
Important note about BatchNormalization layers Many models contain
tf.keras.layers.BatchNormalization layers. This layer is a special
case and precautions should be taken in the context of fine-tuning, as
shown later in this tutorial.
When you set layer.trainable = False, the BatchNormalization layer
will run in inference mode, and will not update its mean and variance
statistics.
When you unfreeze a model that contains BatchNormalization layers in
order to do fine-tuning, you should keep the BatchNormalization layers
in inference mode by passing training = False when calling the base
model. Otherwise, the updates applied to the non-trainable weights
will destroy what the model has learned.
Source: transfer learning, details regarding freeze
The following content comes from Keras tutorial
This behavior has been introduced in TensorFlow 2.0, in order to enable layer.trainable = False to produce the most commonly expected behavior in the convnet fine-tuning use case.
Why we should freeze the layer when fine-tuning a convolutional neural network? Is it because some mechanisms in tensorflow keras or because of the algorithm of batch normalization? I run an experiment myself and I found that if trainable is not set to false the model tends to catastrophic forgetting what has been learned before and returns very large loss at first few epochs. What's the reason for that?
During training, varying batch statistics act as a regularization mechanism that can improve ability to generalize. This can help to minimize overfitting when training for a high number of iterations. Indeed, using a very large batch size can harm generalization as there is less variation in batch statistics, decreasing regularization.
When fine-tuning on a new dataset, batch statistics are likely to be very different if fine-tuning examples have different characteristics to examples in the original training dataset. Therefore, if batch normalization is not frozen, the network will learn new batch normalization parameters (gamma and beta in the batch normalization paper) that are different to what the other network paramaters have been optimised for during the original training. Relearning all the other network parameters is often undesirable during fine-tuning, either due to the required training time or small size of the fine-tuning dataset. Freezing batch normalization avoids this issue.
I have designed convolutional neural network(tf. Keras) which has few parallel convolutional units with different kernal sizes. Then, each output results of that convolution layers are fed into another convolutional units which are in parallel. Then all the outputs are concatenated. Next flattening is done. After that I added fully connected layer and connected to the final softmax layer for multi class classification. I trained it and had good results in validation test.
However I remove the fully connected layer and accuracy was higher than the previous.
Please someone can explain, how does it happen, it will be very helpful.
Thank you for your valuable time.
Parameters as follows.
When you remove a layer, your model will have less chance of over-fitting the training set. Consequently, by making the network shallower, you make your model more robust to unknown examples and the validation accuracy increases.
Since your training accuracy is also increasing, it can be an indication that -
Exploding or vanishing gradients. You can try solving this problem using careful weight initialization, proper regularization, adding shortcuts, or gradient clipping.
You are not training for enough epochs to learn a deeper network. You can try few more epochs.
You do not have enough data to train a deeper network.
Eliminating the dense layer reduces the tendency for over fitting. Therefore your validation loss should improve as long as your model's training accuracy remains high. Alternatively you can add an additional dropout layer. You can also reduce the tendency for over fitting by using regularizers. Documentation for that is here.
I'm trying to implement this paper in Keras, with a tensorflow backend.
In my understanding, they progressively grow a GAN, fading in additional blocks of layers as the model is trained. New layers are faded in linearly over iterations.
I'm not sure how to introduce the "fading in" they used.
To do this Keras, I'm thinking I'll probably need a Lambda layer -- but that's about as much as I know.
Any suggestions?
Thanks!
I think you should use Keras Functional API. That way you can rearrange the inputs to the layer as well as the outputs and interconnect them the way you want to while the layers keep the weights they learned. You will have to have a few nested if-statements, but I think it should work.
Or you could have all the models prepared (functions that return you the model architecture and can set layer weights) and just populate layers in the new model with weights from corresponding layers from the old model.