Tensorflow He initializer for layers with with skip connections

Tensorflow He initializer for layers with with skip connections - python

In tensorflow he initializer is implemented here. My network has additive skip connections (like residual networks) in it. So the fan_in should change accordingly if unit variance is to be maintained by the network. Does the tensorflow initializer take care of that or do i need to write my own initializer for that?

Usually, in Resnet with skip connections, we will use He Initializer which will address the issue which you have mentioned.
However, if you want to make sure Variance to be maintained you can use Glorot Initialization which was developed for the same purpose as explained below.
Glorot and Bengio propose a way to significantly alleviate
the unstable gradients problem. They point out that we need the signal to
flow properly in both directions: in the forward direction when making
predictions, and in the reverse direction when backpropagating gradients.
We don’t want the signal to die out, nor do we want it to explode and
saturate. For the signal to flow properly, the authors argue that we need the
variance of the outputs of each layer to be equal to the variance of its
inputs, 2 and we need the gradients to have equal variance before and after
flowing through a layer in the reverse direction (please check out the paper
if you are interested in the mathematical details). It is actually not
possible to guarantee both unless the layer has an equal number of inputs
and neurons (these numbers are called the fan-in and fan-out of the layer),
but Glorot and Bengio proposed a good compromise that has proven to
work very well in practice: the connection weights of each layer must be
initialized randomly.
Since Initialization parameters depend on the activation function in the layer, the below table is suggested Initializer for different activation functions.
I hope this answers your question, Happy Learning!

Related

Keras - Hyper Tuning the initial state of the model

I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?

... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?

I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.

maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)

Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.

Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

Implementing LeNet in Pytorch

Sorry if this question is incredibly basic. I feel like there is a wealth of resources online, but most of them are half-complete or skip over the details that I want to know.
I am trying to implement LeNet with Pytorch for practice.
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html
How come in this examples and many examples online, they define the convolutional layers and the fc layers in init, but the subsampling and activation functions in forward?
What is the purpose of using torch.nn.functional for some functions, and torch.nn for others? For example, you have convolution with torch.nn (https://pytorch.org/docs/stable/nn.html#conv1d) and convolution with torch.nn.functional (https://pytorch.org/docs/stable/nn.functional.html#conv1d). Why choose one or the other?
Let's say I want to try different image sizes, like 28x28 (MNIST). The tutorial recommends I resize MNIST. Is there a way to instead change the values of LeNet? What happens if I don't change them?
What is the purpose of num_flat_features? If you wanted to flatten the features, couldn't you just do x = x.view(-1, 16*5*5)?

How come in this examples and many examples online, they define the
convolutional layers and the fc layers in init, but the subsampling
and activation functions in forward?
Any layer with trainable parameters should be defined in __init__. Subsampling, certain activations, dropout, etc.. don't have any trainable parameters so can be defined either in __init__ or used directly via the torch.nn.functional interface during forward.
What is the purpose of using torch.nn.functional for some functions, and torch.nn for others?
The torch.nn.functional functions are the actual functions that are used at the heart of the majority of torch.nn layers, they call into C++ compiled code. For example nn.Conv2d subclasses nn.Module, as should any custom layer or model which contains trainable parameters. The class handles registering parameters and encapsulates some other necessary functionality required for training and testing. During forward it actually uses nn.functional.conv2d to apply the convolution operation. As mentioned in the first question, when performing a parameterless operation like ReLU there's effectively no difference between using the nn.ReLU class and the nn.functional.relu function.
The reason they are provided is they give some freedom to do unconventional things. For example in this answer which I wrote the other day, providing a solution without nn.functional.conv2d would have been difficult.
Let's say I want to try different image sizes, like 28x28 (MNIST). The
tutorial recommends I resize MNIST. Is there a way to instead change
the values of LeNet? What happens if I don't change them?
There's no obvious way to change an existing, trained model to support different image sizes. The size of the input to the linear layer is necessarily fixed and the number of features at that point in the model is generally determined by the size of the input to the network. If the size of the input differs from the size that the model was designed for then when the data progresses to the linear layers it will have the wrong number of elements and cause the program will crash. Some models can handle a range of input sizes, usually by using something like an nn.AdaptiveAvgPool2d layer before the linear layer to ensure the input shape to the linear layer is always the same. Even so, if the input image size is too small then the downsampling and/or pooling operations in the network will cause the feature maps to vanish at some point, causing the program to crash.
What is the purpose of num_flat_features? If you wanted to flatten the
features, couldn't you just do x = x.view(-1, 16*5*5)?
When you define the linear layer you need to tell it how large the weight matrix is. A linear layer's weights are simply an unconstrained matrix (and bias vector). The shape of the weight matrix therefore is determined by the input shape, but you don't know the input shape before you run forward so it needs to be provided as an additional parameter (or hard coded) when you initialize the model.
To get to the actual question. Yes, during forward you could simply use
x = x.view(-1, 16*5*5)
Better yet, use
x = torch.flatten(x, start_dim=1)
This tutorial was written before the .flatten function was added to the library. The authors effectively just wrote their own flatten functionality which could be used regardless of the shape of x. This was probably so you had some portable code that could be used in your model without hard coding sizes. From a programming perspective it's nice to generalize such things since it means you wouldn't have to worry about changing those magic numbers if you decide to change part of the model (though this concern didn't appear to extend to the initialization).

Pytorch method for conditional use of intermediate layer instead of final cnn layer output. ie: allow nn to learn to use more layers or less

I'm implementing a residual cnn(modified smaller version of xception) in a low latency environment. I've done a lot of manual tuning to minimize the run time speed of my network (reducing number of filters, removing layers, etc).
But now I want to try allowing my network to make its classification prediction(final fcnn layer) on the residual connection after each residual block.
basic logic-
attempt final prediction with residual connection as input
if this fcnn layer predicts a certain class with a probability > a set threshold:
return fcnn output as if it was normal final layer
else:
do next residual block like normal and try the previous conditional again unless we are already at final block
My hope is this will allow my network to learn to solve easier problems with less computation while allowing it to still do the additional layers if it is still unsure of the classification.
So my basic question is: In pytorch, whats the best way to implement this conditional in a way that allows my nn at run time to decide whether to do more processing or not
Currently Ive tested returning the intermediate x's after the blocks in the forward function, but I dont know how best to setup the conditional to chose which x to return
Also side note: I believe I may end up needing another cnn layer between the residual and fcnn to serve as a function to convert the internal representation for processing to a representation the fcnn understands for classification.

It has already been done and presented in ICLR 2018.
It appears as if in ResNets the first few bottlenecks learn representations (and therefore cannot be skipped) while the remaining bottlenecks refine the features and therefore can be skipped at a moderate loss of accuracy. (Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, Yoshua Bengio Residual Connections Encourage Iterative Inference, ICLR 2018).
This idea was taken to the extreme with sharing weights across bottlenecks in Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, Jan Kautz IamNN: Iterative and Adaptive Mobile Neural Network for efficient image classification, ICLR 2018.

Default Initialization for Tensorflow LSTM states and weights?

I am using the LSTM cell in Tensorflow.
lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units)
I was wondering how the weights and states are initialized or rather what the default initializer is for LSTM cells (states and weights) in Tensorflow?
And is there an easy way to manually set an Initializer?
Note: For tf.get_variable() the glorot_uniform_initializer is used as far as I could find out from the documentation.

First of all, there is a difference between the weights of a LSTM (the usual parameter set of a ANN), which are by default also initialized by the Glorot or also known as the Xavier initializer (as mentioned in the question).
A different aspect is the cell state and the state of the initial recurrent input to the LSTM. Those are initialized by a matrix usually denoted as initial_state.
Leaving us with the question, how to initialize this initial_state:
Zero State Initialization is good practice if the impact of initialization is low
The default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small.
Zero State Initialization in each batch can lead to overfitting
Zero Initialization for each batch will lead to the following: Losses at the early steps of a sequence-to-sequence model (i.e., those immediately after a state reset) will be larger than those at later steps, because there is less history. Thus, their contribution to the gradient during learning will be relatively higher. But if all state resets are associated with a zero-state, the model can (and will) learn how to compensate for precisely this. As the ratio of state resets to total observations increases, the model parameters will become increasingly tuned to this zero state, which may affect performance on later time steps.
Do we have other options?
One simple solution is to make the initial state noisy (to decrease the loss for the first time step). Look here for details and other ideas

I don't think you can initialize an individual cell, but when you execute the LSTM with tf.nn.static_rnn or tf.nn.dynamic_rnn, you can set the initial_state argument to a tensor containing the LSTM's initial values.

Does convolution kernel need to be designed in CNN (Convolutional Neural Networks)?

I am new to Convolutional Neural Networks. I am reading some tutorial and testing some sample codes using Keras. To add a convolution layer, basically I just need to specify the number of kernels and the size of the kernel.
My question is what each kernel looks like? Are they generic to all computer vision applications?

My question is what each kernel looks like?
This depends on the parameters you chose for your Convolutional Layer:
It will indeed depend on the kernel_size parameter you mentioned, as it will determine the shape and size of your kernel. Say you pass this parameter as (3,3) (on a Conv2D layer naturally), you will then obtain a 3x3 Kernel Matrix.
It will depend on your kernel_initializer parameter, which determines the way that MxN Kernel Matrix is going to be filled. It's default value is "glorot_uniform", which is explained on its doc page:
Glorot uniform initializer, also called Xavier uniform initializer. It draws samples from a uniform distribution within [-limit, limit] where limit is sqrt(6 / (fan_in + fan_out)) where fan_in is the number of input units in the weight tensor and fan_out is the number of output units in the weight tensor.
This is telling us the specific way it fills that kernel matrix. You may well select any other kernel initializer you desire to fit your needs. You may even build Custom Initializers, also exemplified in that doc page:
from keras import backend as K
def my_init(shape, dtype=None):
#or whatever you want to customize
return K.random_normal(shape, dtype=dtype)
model.add(Dense(64, kernel_initializer=my_init))
Furthermore, it will depend on your kernel_regularizer parameter, which defines regularization functions applied to the weights of your kernel. It's default value is None but you can select others from the ones available. You can again define your own custom initializers in a similar fashion:
def l1_reg(weight_matrix):
#same here, fit your own needs
return 0.01 * K.sum(K.abs(weight_matrix))
model.add(Dense(64, input_dim=64,
kernel_regularizer=l1_reg)
Are they generic to all computer vision applications?
This I think may be a bit broad, however I would venture and say yes. Keras has available many kernels that were designed to specifically adapt to Deep Learning applications; it includes those ones that are most commonly used throughout the literature and well-known applications.
The good thing is that, as illustrated before, if any of those kernels does not fit your needs you could well define your own custom initializer, or well enhance it by using regularizes. This enables you to tackle those really specific CV problems you may have.

The actual kernel values are learned during the learning process, that's why you only need to set the number of kernels and their size.
What might be confusing is that the learned kernel values actually mimic things like Gabor and edge detection filters. These are generic to many computer vision applications, but instead of being engineered manually, they are learned from a big classification dataset (like ImageNet).
Also the kernel values are part of a feature hierarchy that can be used directly as features for a variety of computer vision problems. In that terms they are also generic.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.