Default Initialization for Tensorflow LSTM states and weights?

Default Initialization for Tensorflow LSTM states and weights? - python

I am using the LSTM cell in Tensorflow.
lstm_cell = tf.contrib.rnn.BasicLSTMCell(lstm_units)
I was wondering how the weights and states are initialized or rather what the default initializer is for LSTM cells (states and weights) in Tensorflow?
And is there an easy way to manually set an Initializer?
Note: For tf.get_variable() the glorot_uniform_initializer is used as far as I could find out from the documentation.

First of all, there is a difference between the weights of a LSTM (the usual parameter set of a ANN), which are by default also initialized by the Glorot or also known as the Xavier initializer (as mentioned in the question).
A different aspect is the cell state and the state of the initial recurrent input to the LSTM. Those are initialized by a matrix usually denoted as initial_state.
Leaving us with the question, how to initialize this initial_state:
Zero State Initialization is good practice if the impact of initialization is low
The default approach to initializing the state of an RNN is to use a zero state. This often works well, particularly for sequence-to-sequence tasks like language modeling where the proportion of outputs that are significantly impacted by the initial state is small.
Zero State Initialization in each batch can lead to overfitting
Zero Initialization for each batch will lead to the following: Losses at the early steps of a sequence-to-sequence model (i.e., those immediately after a state reset) will be larger than those at later steps, because there is less history. Thus, their contribution to the gradient during learning will be relatively higher. But if all state resets are associated with a zero-state, the model can (and will) learn how to compensate for precisely this. As the ratio of state resets to total observations increases, the model parameters will become increasingly tuned to this zero state, which may affect performance on later time steps.
Do we have other options?
One simple solution is to make the initial state noisy (to decrease the loss for the first time step). Look here for details and other ideas

I don't think you can initialize an individual cell, but when you execute the LSTM with tf.nn.static_rnn or tf.nn.dynamic_rnn, you can set the initial_state argument to a tensor containing the LSTM's initial values.

Related

Do LSTMs remember previous windows or is the hidden state reset?

I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?

What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.

Keras - Hyper Tuning the initial state of the model

I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?

... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?

I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.

maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)

Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.

Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.

Should I set model.eval for getting the current training loss in Pytorch?

We set model.train() during training, but during my training iterations, I also want to do a forward pass of the training dataset to see what my new loss is. When doing this, should I temporarily set model.eval()?

If your network has layers which act different during inference (torch.nn.BatchNormNd and torch.nn.DropoutNd could be an example, for the second case all neurons will be used but scaled by inverted probability of keeping neurons, see here or here for example) and you want to test how your network performs currently (which is usually called a validation step) then it is mandatory to use module.eval().
It is a common (and very good!) practice to always switch to eval mode when doing inference-like things no matter if this changes your actual model.
EDIT:
You should also use with torch.no_grad(): block during inference, see official tutorial code as gradients are not needed during this phase and it's wasteful to compute them.

A2C algorithm in tf.keras: actor loss function

I'm learning about Action-Critic Reinforcement Learning techniques, in particular A2C algorithm.
I've found a good description of a simple version of the algorithm (i.e. without experience replay, batching or other tricks) with implementation here: https://link.medium.com/yi55uKWwV2. The complete code from that article is available on GitHub.
I think I understand ok-ish what's happening here, but to make sure I actually do, I'm trying to reimplement it from scratch using higher-level tf.keras APIs. Where I'm getting stuck is how do I implement training loop correctly, and how do I formulate actor's loss function.
What is the correct way to pass action and advantage into the loss function?
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
The code I have so far: https://gist.github.com/nevkontakte/beb59f29e0a8152d99003852887e7de7
Edit: I suppose some of my confusion stems from a poor understanding of magic behind gradient computation in Keras/TensorFlow, so any pointers there would be appreciated.

First, credit where credit is due: information provided by ralf htp and Simon was instrumental in helping me to figure out the right answers eventually.
Before I go into detailed answers to my own questions, here's the original code I was trying to rewrite in tf.keras terms, and here's my result.
What is the correct way to pass action and advantage into a loss function in Keras?
There is a difference between what raw TF optimizer considers a loss function and what Keras does. When using an optimizer directly, it simply expects a tensor (lazy or eager depending on your configuration), which will be evaluated under tf.GradientTape() to compute the gradient and update weights.
Example from https://medium.com/#asteinbach/actor-critic-using-deep-rl-continuous-mountain-car-in-tensorflow-4c1fb2110f7c:
# Below norm_dist is the output tensor of the neural network we are training.
loss_actor = -tfc.log(norm_dist.prob(action_placeholder) + 1e-5) * delta_placeholder
training_op_actor = tfc.train.AdamOptimizer(
lr_actor, name='actor_optimizer').minimize(loss_actor)
# Later, in the training loop...
_, loss_actor_val = sess.run([training_op_actor, loss_actor],
feed_dict={action_placeholder: np.squeeze(action),
state_placeholder: scale_state(state),
delta_placeholder: td_error})
In this example it computes the whole graph, including making an inference, capture the gradient and adjust weights. So to pass whatever values you need into the loss function/gradient computation you just pass necessary values into the computation graph.
Keras is a bit more formal in what loss function should look like:
loss: String (name of objective function), objective function or tf.keras.losses.Loss instance. See tf.keras.losses. An objective function is any callable with the signature scalar_loss = fn(y_true, y_pred). If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.
Keras will do the inference (forward pass) for you and pass the output into the loss function. The loss function is supposed to do some extra computation on the predicted value and y_true label, and return the result. This whole process will be tracked for the purpose of gradient computation.
Although it is very convenient for traditional training, this is a bit restrictive when we want to pass some extra data in, like TD error. It is possible to work around that and shove all the extra data into y_true, and pull it apart inside the loss function (I found this trick somewhere on the web, but unfortunately lost the link to source).
Here's how I rewrote the above in the end:
def loss(y_true, y_pred):
action_true = y_true[:, :n_outputs]
advantage = y_true[:, n_outputs:]
return -tfc.log(y_pred.prob(action_true) + 1e-5) * advantage
# Below, in the training loop...
# A trick to pass TD error *and* actual action to the loss function: join them into a tensor and split apart
# Inside the loss function.
annotated_action = tf.concat([action, td_error], axis=1)
actor_model.train_on_batch([scale_state(state)], [annotated_action])
Actor's loss function involves computing probability of the action taken given to normal distribution. How can I ensure that mu and sigma of the normal distribution during loss function computation actually match the ones were during prediction?
When I asked this question, I didn't understand well enough how TF compute graph works. So the answer is simple: every time sess.run() is invoked, it must compute the whole graph from scratch. Parameters of the distribution would be the same (or similar) as long as graph inputs (e.g. observed state) and NN weights are the same (or similar).
The way it is in the original, the actor's loss function doesn't care about y_pred, it only does about action that was chosen while interacting with the environment. This seems to be wrong, but I'm not sure how.
What's wrong is the assumption "the actor's loss function doesn't care about y_pred" :) Actor's loss function involves norm_dist (which is action probability distribution), which is effectively an analog of y_pred in this context.

As far as i understand A2C it is the machine learning implementation of activator-inhibitor systems that are also called two-component reaction diffusion systems (https://en.wikipedia.org/wiki/Reaction%E2%80%93diffusion_system). Activator-inhibitor models are important in any field of science as they describe pattern formations like i.e. the Turing mechanism (simply search the net for activator-inhibitor model and you find a vast amount of information, a very common application are predator-prey models). Also cf the graphic
source of graphic : https://www.researchgate.net/figure/Activator-Inhibitor-System_fig1_23671770/
with the explanatory graphic of the A2C algorithm in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69
Activator-inhibitor models are closely linked to the theory of nonlinear dynamical systems (or 'chaos theory') this also becomes obvious in the comparison of the bifurcation tree-like structure in https://medium.com/#asteinbach/rl-introduction-simple-actor-critic-for-continuous-actions-4e22afb712 and the bifurcation tree of a nonlinear dynamical systems like i.e. the logistic map (https://en.wikipedia.org/wiki/Logistic_map, the logistic map is one of the simplest predator-prey models or activator-inhibitor models). Another similarity is the sensitivity to initial condition in A2C models that is described as
This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees.
in https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and the curse of dimensionality appears also in chaos theory, i.e. in attractor reconstruction
From the viewpoint of systems theory the A2C algorithm tries to adapt the initial value (start state) in a way that it ends up at a given endpoint when increasing the growth rate of a dynamical systems i.e. the logistic map (r-value is increased and the initial value (start state) is constantly re-adapted to choose the correct bifurations (actions) in the bifurcation tree )
So A2C tries to numerically solve a chaos theory problem, namely finding the initial value for a given outcome of a nonlinear dynamical system in its chaotic region. Analytically this problem is in most cases not solveable.
The action is the bifurcation points in the bifurcation tree, the states are the future bifurctions.
Both, actions and states, are modeled by two coupled neural networks and this coupling of two neural nets is the great innovation of A2C algorithms.
In https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 is well documented keras code for implementing A2C, so you have a possible implementation there.
The loss function here is defined as the temporal difference (TD) function that is the exact difference between state at the actual bifurcation point and the state at the estimated future one, however this mathematically exactly defined is prone to stochastic error (or noise), so the stochastic error is included in the definition of exact, because in the end machine learning is based on stochastic systems or error calculus, meaning systems that are composed of a deterministic and a stochastic component. To zero this error stochastic gradient descend is used. In keras this is simply implmeneted by choosing optimizer=sge.
This interaction of actual and future step is implemented as memory on https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 in the function remember and this function also links the actor and the critic network (or activator and inhibitor network). This general structure of trial (action), call predict (TD function ), remember and train (i.e. stochastic gradient descent) is fundamental to all reinforcement learning algorithms, and is linked to the structure actual state, action, reward, new state :
The prediction code is also very much the same as it was in previous reinforcement learning algorithms. That is, we just have to iterate through the trial and call predict, remember, and train on the agent:
In the implementation on your first question is solved by applying remember on the critic and the train the critic with these values (this is in the main function), where training always evaluates the loss function, so action and reward are passed to the loss function by remember in this implementation :
actor_critic.remember(cur_state, action, reward, new_state, done)
actor_critic.train()
Because of your second question : i am not sure but i think this is achieved by the optimization algorithm (i.e. stochastic gradient descent)
Third question : In the predator-prey model the actors or activator is the prey and the behavior of the prey is only determined by the size or capacity of the habitat (the amount of grass) and the size of the predator (inhibitor) population, so modelling it in this way is consistent with nature or an activator-inhibitor system again. In the main function in https://towardsdatascience.com/reinforcement-learning-w-keras-openai-actor-critic-models-f084612cfd69 also only the critic or inhibitor / predator is trained.

Tensorflow He initializer for layers with with skip connections

In tensorflow he initializer is implemented here. My network has additive skip connections (like residual networks) in it. So the fan_in should change accordingly if unit variance is to be maintained by the network. Does the tensorflow initializer take care of that or do i need to write my own initializer for that?

Usually, in Resnet with skip connections, we will use He Initializer which will address the issue which you have mentioned.
However, if you want to make sure Variance to be maintained you can use Glorot Initialization which was developed for the same purpose as explained below.
Glorot and Bengio propose a way to significantly alleviate
the unstable gradients problem. They point out that we need the signal to
flow properly in both directions: in the forward direction when making
predictions, and in the reverse direction when backpropagating gradients.
We don’t want the signal to die out, nor do we want it to explode and
saturate. For the signal to flow properly, the authors argue that we need the
variance of the outputs of each layer to be equal to the variance of its
inputs, 2 and we need the gradients to have equal variance before and after
flowing through a layer in the reverse direction (please check out the paper
if you are interested in the mathematical details). It is actually not
possible to guarantee both unless the layer has an equal number of inputs
and neurons (these numbers are called the fan-in and fan-out of the layer),
but Glorot and Bengio proposed a good compromise that has proven to
work very well in practice: the connection weights of each layer must be
initialized randomly.
Since Initialization parameters depend on the activation function in the layer, the below table is suggested Initializer for different activation functions.
I hope this answers your question, Happy Learning!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.