Training a modified fully-connected neural network - python

Take a simple 3 layer MLP neural network such as this. Each hidden layer implements y=xw+b where y is the output activations matrix of the layer of shape [batch_size, output_size], x is the input activations matrix of shape [batch_size, input_size], w is the trainable weights matrix of shape [input_size, output_size] and b is the trainable bias vector of shape [output_size].
Now modify the layer definition so each layer implements y = x(w mod m) + b where m is a trainable matrix similar to w and of same shape as w. Since tensorflow implements gradient of the modulo function for backprop, propagating gradients due to the added modulo shouldn't be an issue. Making this fairly trivial modification in the network breaks the MLP and the network stops learning altogether. In other words, the accuracy falls to ~10% for MNIST (10 digit classification) equivalent to random guessing.
Would anyone have any guesses as to why the network fails to learn with the added mod operator? I am able to implement y=xw + (b mod m) which works just fine. The problem seems to appear only when mod is used with the xw term.

Related

Weights and Biases of LSTM Layer Python

I have developed an LSTM Model with 1 LSTM layer and 3 dense layers as shown below
model = Sequential()
model.add(LSTM(units = 120, activation ='relu', return_sequences = False,input_shape
(train_in.shape[1],5)))
model.add(Dense(100,activation='relu'))
model.add(Dense(50,activation='relu'))
model.add(Dense(1))
I have trained the model and obtained the trained weights and biases of the model. The details are shown below.
w = model.get_weights()
w[0].shape, w[1].shape,w[2].shape,w[3].shape,w[4].shape,w[5].shape,w[6].shape,w[7].shape,w[8].shape
The output I got is,
((5, 480),(120, 480),(480,),(120, 100),(100,),(100, 50),(50,),(50, 1),(1,))
It has given out 2 weight matrices of dimensions (5,480)&(120,480) and one bias matrix of dim (480,) corresponding to the LSTM layer. the others are related to the dense layers.
The thing I want to know is that, LSTM has 4 layers. So How can I get the weights and biases of these 4 layers separately? Can I divide the total weights(5,480) into 4 equal parts and consider the 1st 120 correspond to the 1st layer of LSTM, 2nd 120 belong to the 2nd layer of LSTM and so on??
Please share your valuable thoughts on this. Also any good references please
An LSTM doesn't have 4 layers but 4 weight matrices due to its internal gate-cell structure. If this is confusing, it is helpful to read some resources on how an LSTM works. To summarize, the internals consist of 3 gates and 1 cell state which are used to calculate the final hidden state.
If you check the underlying implementation, you can see in which order they are concatenated:
[i, f, c, o]
# i is input gate weights (W_i).
# f is forget gate weights (W_f).
# o is output gate weights (W_o).
# c is cell gate weights (W_c).
So on the example of your bias tensor (480,), you can divide this into 4 subtensors with size 120, where w[:120] represents the input gate weights, w[120:240] represents the forget gate weights and so on.

Kernel and Recurrent Kernel in Keras LSTMs

I'm trying to draw in my mind the structure of the LSTMs and I don't understand what are the Kernel and Recurrent Kernel. According to this post in LSTMs section, the Kernel it's the four matrices that are multiplied by the inputs and Recurrent Kernel it's the four matrices that are multiplied by the hidden state, but, what are those 4 matrices in this diagram?
Are the gates?
I was testing with this app how the unit variable of the code below affect the kernel, recurrent kernel and bias:
model = Sequential()
model.add(LSTM(unit = 1, input_shape=(1, look_back)))
with look_back = 1 it returns me that:
with unit = 2 it returns me this
With unit = 3 this
Testing with this values I could deducted this expressions
but I don't know how this works by inside. What does mean <1x(4u)> or <ux(4u)>? u = units
The kernels are basically the weights handled by the LSTM cell
units = neurons, like the classic multilayer perceptron
It is not shown in your diagram, but the input is a vector X with 1 or more values, and each value is sent in a neuron with its own weight w (the which we are going to learn with backpropagation)
The four matrices are these (expressed as Wf, Wi, Wc, Wo):
When you add a neuron, you are adding other 4 weights\kernel
So for your input vector X you have four matrix. And therefore
1 * 4 * units = kernel
Regarding the recurrent_kernel here you can find the answer.
Basically in keras input and hidden state are not concatenated like in the example diagrams (W[ht-1, t]) but they are split and handled with other four matrices called U:
Because you have a hidden state x neuron, the weights U (all four U) are:
units * (4 * units) = recurrent kernel
ht-1 comes in a recurrent way from all your neurons. Like in a multilayer perceptron, each output of a neuron goes in all the next recurrent layer neurons
source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Tensorflow - building LSTM model - need for tf.keras.layers.Dense()

Python 3.7 tensorflow
I am experimenting Time series forecasting w Tensorflow
I understand the second line creates a LSTM RNN i.e. a Recurrent Neural Network of type Long Short Term Memory.
Why do we need to add a Dense(1) layer in the end?
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Tutorial for Dense() says
Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).
would you like to rephrase or elaborate on need for Dense() here ?
The following line
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
creates an LSTM layer which transforms each input step of size #features into a latent representation of size 32. You want to predict a single value so you need to convert this latent representation of size 32 into a single value. Hence, you add the following line
single_step_model.add(tf.keras.layers.Dense(1))
which adds a Dense Layer (Fully-Connected Neural Network) with one neuron in the output which, obviously, produces a single value. Look at it as a way to transform an intermediate result of higher dimensionality into the final result.
Well in the tutorial you are following Time series forecasting, they are trying to forecast temperature (6 hrs ahead). For which they are using an LSTM followed by a Dense layer.
single_step_model = tf.keras.models.Sequential()
single_step_model.add(tf.keras.layers.LSTM(32, input_shape=x_train_single.shape[-2:]))
single_step_model.add(tf.keras.layers.Dense(1))
Dense layer is nothing but a regular fully-connected NN layer. In this case you are bringing down the output dimensionality to 1, which should represent some proportionality (need not be linear) to the temperature you are trying to predict. There are other layers you could use as well. Check out, Keras Layers.
If you are confused about the input and output shape of LSTM, check out
I/O Shape.

Implementing weight normalization using TensorFlow layers' `kernel_constraint`

Some of the TensorFlow layers, such as tf.layers.dense and tf.layers.conv2d, take in a kernel_constraint argument, which according to the tf api docs docs implements an
Optional projection function to be applied to the kernel after being updated by an Optimizer (e.g. used to implement norm constraints or value constraints for layer weights).
In [1], Salimans et al. present a neural network normalization technique, called weight normalization, which normalizes the weight vectors of the network layers, in contrast to, for example the batch normalization [2], which normalizes the actual data batch flowing through the layer. In some cases the computational overhead of the weight normalization method is lower and it can also be used in cases where the use of batch normalization is not feasible.
My question is: is it possible to implement the weight normalization using the abovementioned TensorFlow layers' kernel_constraint? Assuming x is an input with shape (batch, height, width, channels), I thought I could implement it as follows:
x = tf.layers.conv2d(
inputs=x,
filters=16,
kernel_size=(3, 3),
strides=(1, 1),
kernel_constraint=lambda kernel: (
tf.nn.l2_normalize(w, list(range(kernel.shape.ndims-1)))))
What would be a simple test case to validate/invalidate my solution?
[1] SALIMANS, Tim; KINGMA, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems. 2016. p. 901-909.
[2] IOFFE, Sergey; SZEGEDY, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Despite the title, the paper by Salimans and Kingma suggests to decouple the weight norm and their direction, rather than actually normalising the weights (i.e. setting their l2 norm to one as you suggested).
If you want to verify that your code has the intended effect even if it is not what they proposed, you can get the weights of the model and check their norm.
In pseudo-code:
model = tf.models.Model(inputs=inputs, outputs=x)
weights = model.get_weights()[i] # checking the weights of the i-th layer
flat_weights = weights.flatten()
import numpy as np
print(np.linalg.norm(flat_weights, 2))

tensorflow basic word2vec example: Shouldn't we be using weights [nce_weight Transpose] for the representation and not embedding matrix?

I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.
Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.

Categories