Kernel and Recurrent Kernel in Keras LSTMs

Kernel and Recurrent Kernel in Keras LSTMs - python

I'm trying to draw in my mind the structure of the LSTMs and I don't understand what are the Kernel and Recurrent Kernel. According to this post in LSTMs section, the Kernel it's the four matrices that are multiplied by the inputs and Recurrent Kernel it's the four matrices that are multiplied by the hidden state, but, what are those 4 matrices in this diagram?
Are the gates?
I was testing with this app how the unit variable of the code below affect the kernel, recurrent kernel and bias:
model = Sequential()
model.add(LSTM(unit = 1, input_shape=(1, look_back)))
with look_back = 1 it returns me that:
with unit = 2 it returns me this
With unit = 3 this
Testing with this values I could deducted this expressions
but I don't know how this works by inside. What does mean <1x(4u)> or <ux(4u)>? u = units

The kernels are basically the weights handled by the LSTM cell
units = neurons, like the classic multilayer perceptron
It is not shown in your diagram, but the input is a vector X with 1 or more values, and each value is sent in a neuron with its own weight w (the which we are going to learn with backpropagation)
The four matrices are these (expressed as Wf, Wi, Wc, Wo):
When you add a neuron, you are adding other 4 weights\kernel
So for your input vector X you have four matrix. And therefore
1 * 4 * units = kernel
Regarding the recurrent_kernel here you can find the answer.
Basically in keras input and hidden state are not concatenated like in the example diagrams (W[ht-1, t]) but they are split and handled with other four matrices called U:
Because you have a hidden state x neuron, the weights U (all four U) are:
units * (4 * units) = recurrent kernel
ht-1 comes in a recurrent way from all your neurons. Like in a multilayer perceptron, each output of a neuron goes in all the next recurrent layer neurons
source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Related

Weights and Biases of LSTM Layer Python

I have developed an LSTM Model with 1 LSTM layer and 3 dense layers as shown below
model = Sequential()
model.add(LSTM(units = 120, activation ='relu', return_sequences = False,input_shape
(train_in.shape[1],5)))
model.add(Dense(100,activation='relu'))
model.add(Dense(50,activation='relu'))
model.add(Dense(1))
I have trained the model and obtained the trained weights and biases of the model. The details are shown below.
w = model.get_weights()
w[0].shape, w[1].shape,w[2].shape,w[3].shape,w[4].shape,w[5].shape,w[6].shape,w[7].shape,w[8].shape
The output I got is,
((5, 480),(120, 480),(480,),(120, 100),(100,),(100, 50),(50,),(50, 1),(1,))
It has given out 2 weight matrices of dimensions (5,480)&(120,480) and one bias matrix of dim (480,) corresponding to the LSTM layer. the others are related to the dense layers.
The thing I want to know is that, LSTM has 4 layers. So How can I get the weights and biases of these 4 layers separately? Can I divide the total weights(5,480) into 4 equal parts and consider the 1st 120 correspond to the 1st layer of LSTM, 2nd 120 belong to the 2nd layer of LSTM and so on??
Please share your valuable thoughts on this. Also any good references please

An LSTM doesn't have 4 layers but 4 weight matrices due to its internal gate-cell structure. If this is confusing, it is helpful to read some resources on how an LSTM works. To summarize, the internals consist of 3 gates and 1 cell state which are used to calculate the final hidden state.
If you check the underlying implementation, you can see in which order they are concatenated:
[i, f, c, o]
# i is input gate weights (W_i).
# f is forget gate weights (W_f).
# o is output gate weights (W_o).
# c is cell gate weights (W_c).
So on the example of your bias tensor (480,), you can divide this into 4 subtensors with size 120, where w[:120] represents the input gate weights, w[120:240] represents the forget gate weights and so on.

Define a network in pytorch with incomplete connections, like convolution

I'd like to train a small neural network in Pytorch that takes as an input an 8-dimensional vector and predicts one of three possible categories. The first hidden layer should contain 6 neurons, where each neuron takes the activations of only 3 consecutive dimensions in the input layer. The second hidden layer should also contain 6 nodes and be fully connected, and the last layer should be the output layer with 3 neurons. Thus the topology is:
network topology
Let's say that a mini batch consists of 64 (8-dimensional) data points.
I tried to implement the first layer using 1D convolution. Since a 1D convolution filter assumes the input is a sequence of points, I thought a good approach is to define 6 filters operating on 8 1-dimensional points:
import torch.nn as nn
import torch.nn.functional as functional
class ExampleNet(nn.Module):
def __init__(self, batch_size, input_channels, output_channels):
super(ExampleNet, self).__init__()
self._layer1 = nn.Conv1d(in_channels=1, out_channels=input_channels - 2, kernel_size=3, stride=1)
self._layer2 = nn.Linear(in_features=input_channels - 2, out_features=input_channels - 2)
self._layer3 = nn.Linear(in_features=input_channels - 2, out_features=output_channels)
def forward(self, x):
x = functional.relu(self._layer1(x))
x = functional.relu(self._layer2(x))
x = functional.softmax(self._layer3(x))
return x
net = ExampleNet(64, 8, 3)
I know that Pytorch expects a sequence of arrays of size 64 x 8 x 1 each when training the network. However, since I apply 1D convolutional filters in an untraditional way, I think I should have input arrays of size 64 x 1 x 8, and I am expecting an output of size 64 x 3. I use the following mini-batch of random points to run through the network:
# Generate a mini-batch of 64 samples
input = torch.randn(64, 1, 8)
out = net(input)
print(out.size())
And the output I get tells me that I defined a wrong topology. How would you advise me to define the layers I need? Is using Conv1d a good approach in my case? I saw that another approach is to use a masked layer but I don't know how to define it.

Can you reverse a PyTorch neural network and activate the inputs from the outputs?

Can we activate the outputs of a NN to gain insight into how the neurons are connected to input features?
If I take a basic NN example from the PyTorch tutorials. Here is an example of a f(x,y) training example.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
After I've finished training the network to predict y from x inputs. Is it possible to reverse the trained NN so that it can now predict x from y inputs?
I don't expect y to match the original inputs that trained the y outputs. So I expect to see what features the model activates on to match x and y.
If it is possible, then how do I rearrange the Sequential model without breaking all the weights and connections?

It is possible but only for very special cases. For a feed-forward network (Sequential) each of the layers needs to be reversible; that means the following arguments apply to each layer separately. The transformation associated with one layer is y = activation(W*x + b) where W is the weight matrix and b the bias vector. In order to solve for x we need to perform the following steps:
Reverse activation; not all activation functions have an inverse though. For example the ReLU function does not have an inverse on (-inf, 0). If we used tanh on the other hand we can use its inverse which is 0.5 * log((1 + x) / (1 - x)).
Solve W*x = inverse_activation(y) - b for x; for a unique solution to exist W must have similar row and column rank and det(W) must be non-zero. We can control the former by choosing a specific network architecture while the latter depends on the training process.
So for a neural network to be reversible it must have a very specific architecture: all layers must have the same number of input and output neurons (i.e. square weight matrices) and the activation functions all need to be invertible.
Code: Using PyTorch we will have to do the inversion of the network manually, both in terms of solving the system of linear equations as well as finding the inverse activation function. Consider the following example of a 1-layer neural network (since the steps apply to each layer separately extending this to more than 1 layer is trivial):
import torch
N = 10 # number of samples
n = 3 # number of neurons per layer
x = torch.randn(N, n)
model = torch.nn.Sequential(
torch.nn.Linear(n, n), torch.nn.Tanh()
)
y = model(x)
z = y # use 'z' for the reverse result, start with the model's output 'y'.
for step in list(model.children())[::-1]:
if isinstance(step, torch.nn.Linear):
z = z - step.bias[None, ...]
z = z[..., None] # 'torch.solve' requires N column vectors (i.e. shape (N, n, 1)).
z = torch.solve(z, step.weight)[0]
z = torch.squeeze(z) # remove the extra dimension that we've added for 'torch.solve'.
elif isinstance(step, torch.nn.Tanh):
z = 0.5 * torch.log((1 + z) / (1 - z))
print('Agreement between x and z: ', torch.dist(x, z))

If I've understood correctly, there are two questions here:
Is it possible to determine what features in the input have activated neurons?
If so, is it possible to use this information to generate samples from p(x|y)?
Regarding 1, a basic way to determine if a neuron is sensitive to an input feature x_i is to compute the gradient of this neuron's output w.r.t x_i. A high gradient will indicate sensitivity to a particular input element. There is a rich literature on the subject, for example, you can have a look at guided backpropagation or at GradCam (the latter is about classification with convnets, but it does contain useful ideas).
As for 2, I don't think that your approach to "reversing the problem" is correct. The problem is that your network is discriminative and what it outputs can be seen as argmax_y p(y|x). Note that this is a point-wise estimation, not a full modeling of the distribution. However, the inverse problem that you're interested in seems to be sampling from
p(x|y)=constant*p(y|x)p(x).
You don't know how to sample from p(y|x) and you don't know anything about p(x). Even if you use a method to discover correlations between the neurons and specific input features, you have only discovered which features where more important to the networks prediction, but depending on the nature of y this might be insufficiant. Consider a toy example where your inputs x are 2d points distributed according to some distribution in R^2 and where the output y is binary, such that any (a,b) in R^2 is classified as 1 if a<1 and it is classified as 0 if a>1. Then a discriminative network could learn the vertical line x=1 as its decision boundary. Inspecting correlations between neurons and input features will reveal that only the first coordinate was useful in this prediction, but this information is not sufficient for sampling from the full 2d distribution of inputs.
I think that Variational autoencoders could be what you're looking for.

Training a modified fully-connected neural network

Take a simple 3 layer MLP neural network such as this. Each hidden layer implements y=xw+b where y is the output activations matrix of the layer of shape [batch_size, output_size], x is the input activations matrix of shape [batch_size, input_size], w is the trainable weights matrix of shape [input_size, output_size] and b is the trainable bias vector of shape [output_size].
Now modify the layer definition so each layer implements y = x(w mod m) + b where m is a trainable matrix similar to w and of same shape as w. Since tensorflow implements gradient of the modulo function for backprop, propagating gradients due to the added modulo shouldn't be an issue. Making this fairly trivial modification in the network breaks the MLP and the network stops learning altogether. In other words, the accuracy falls to ~10% for MNIST (10 digit classification) equivalent to random guessing.
Would anyone have any guesses as to why the network fails to learn with the added mod operator? I am able to implement y=xw + (b mod m) which works just fine. The problem seems to appear only when mod is used with the xw term.

How to interpret clearly the meaning of the units parameter in Keras?

I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?

First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.

You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps

The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.

I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.