Output from LSTM not changing for different inputs

Output from LSTM not changing for different inputs - python

I have the an LSTM implemented in PyTorch as below.
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class LSTM(nn.Module):
"""
Defines an LSTM.
"""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
super(LSTM, self).__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
def forward(self, input_data):
lstm_out_pre, _ = self.lstm(input_data)
return lstm_out_pre
model = LSTM(input_dim=2, hidden_dim=2, output_dim=1, num_layers=8)
random_data1 = torch.Tensor(np.random.standard_normal(size=(1, 5, 2)))
random_data2 = torch.Tensor(np.random.standard_normal(size=(1, 5, 2)))
out1 = model(random_data1).detach().numpy()
out2 = model(random_data2).detach().numpy()
print(out1)
print(out2)
I am simply creating an LSTM network and passing two random inputs into it. The outputs does not make sense because no matter what random_data1 and random_data2 is, out1 and out2 are always the same. This does not make any sense to me as random inputs multiplied with random weights should give different outputs.
This does not seem to be the case if I use less number of hidden layers. With num_layers=2, this effect seems to be nil. And as you increase it, out1 and out2 keeps on getting closer. This does not make sense to me because with more layers of the LSTM stacked on top of each other, we are multiplying the input with more number of random weights which should magnify the differences in the input and give a very different output.
Can someone please explain this behavior? Is there something wrong with my implementation?
In one particular run, random_data1 is
tensor([[[-2.1247, -0.1857],
[ 0.0633, -0.1089],
[-0.6460, -0.1079],
[-0.2451, 0.9908],
[ 0.4027, 0.3619]]])
random_data2 is
tensor([[[-0.9725, 1.2400],
[-0.4309, -0.7264],
[ 0.5053, -0.9404],
[-0.6050, 0.9021],
[ 1.4355, 0.5596]]])
out1 is
[[[0.12221643 0.11449362]
[0.18342148 0.1620608 ]
[0.2154751 0.18075559]
[0.23373817 0.18768947]
[0.24482158 0.18987371]]]
out2 is
[[[0.12221643 0.11449362]
[0.18342148 0.1620608 ]
[0.2154751 0.18075559]
[0.23373817 0.18768945]
[0.24482158 0.18987371]]]
EDIT:
I am running on the following configurations -
PyTorch - 1.0.1.post2
Python - 3.6.8 with GCC 7.3.0
OS - Pop!_OS 18.04 (Ubuntu 18.04, more-or-less)
CUDA - 9.1.85
Nvidia driver - 410.78

Initial weights for LSTM are small numbers close to 0, and by adding more layers the initial weighs and biases are getting smaller: all the weights and biases are initialized from -sqrt(k) to -sqrt(k), where k = 1/hidden_size (https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM)
By adding more layers you effectively multiply the input by many small numbers, so effect of the input is basically 0 and only biases in the later layers matter.
If you try LSTM with bias=False, you will see that output getting closer and closer to 0 with adding more layers.

I tried changing the number of layers to a lower number and the values differ, it is because the values are getting multiplied by a small number over and over again which reduces the significance of input.

I initialized all the weights in the using kaiming_normal and it works fine.

Related

python keras tensorflow - change Dense layer dot product to cosine distance

Creating a small example of the following will be a little bit difficult, so I will give a more abstract example. If it will be needed I can construct a reproducible example.
In general, I'm trying to build a 'fine tuning' model.
I have an 'embedding' architecture that takes an image and outputs a 256 vector.
I have 22 classes I try to predict.
After some initial work I have come with a (22, 256) matrix representing the embeddings of those classes.
So, now after I have my embedding layer, I'm adding a Dense layer (named 'layert') to follow it. This dense layer's kernel (weights) will hold the above matrix (22, 256) which represents my classes.
I will set 'layert' weights to be this matrix, and its biases to be 0.
The key of the question here is, how do I correctly make that Dense layer do a cosine similarity computation (between whatever comes from the embedding, and this Dense layer)
I have overcome this problem with building my own Dense layer by inheriting keras.layers.Dense but I feel this is not a good solution and wouldn't hold in production.
Let's show some code:
# self.embedding is a working keras model as written above
# self.mean_embd_logos is the matrix (22, 256) described above (it's of tf.Tensor type)
inputs = self.embedding.inputs
x = self.embedding(inputs)
# This is the Dense layer which will get 256 sized tensor and outputs a 22 sized tensor
layert = keras.layers.Dense(units=self.mean_embd_logos.shape[0], name='mean_logos_tensor',
bias_initializer='zeros', kernel_initializer='zeros')
output = layert(x)
# Here we override the initial weights and initialize them with the (22, 256) matrix vector
w = self.mean_embd_logos.numpy().T
b = layert.get_weights()[1]
layert.set_weights([w, b])
output = keras.layers.activation.softmax.Softmax()(output)
self.finetune_model = Model(inputs=inputs, outputs=output)
Ofcourse by this code example we will have a simple dot product between what ever comes out of the embedding layer to the Dense layer
How I solved it:
as described above, inherit from Dense and override call:
from keras.layers import Dense
import tensorflow as tf
class DenseCosineSimilarity(Dense):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def call(self, inputs):
# Everything is copied until this part:
# ...
a = tf.nn.l2_normalize(inputs, -1)
b = tf.nn.l2_normalize(self.kernel, 0)
outputs = tf.matmul(a=a, b=b)
So this is how I build a "Dense layer" that calculates a cosine similarity instead of dot product.
Original keras Dense layer has something like this:
outputs = tf.matmul(inputs, self.kernel)
Every comment/suggestion will be much appreciated.

How to train the Shared Layers in PyTorch

I have the follow code
import torch
import torch.nn as nn
from torchviz import make_dot, make_dot_from_trace
class Net(nn.Module):
def __init__(self, input, output):
super(Net, self).__init__()
self.fc = nn.Linear(input, output)
def forward(self, x):
x = self.fc(x)
x = self.fc(x)
return x
model = Net(12, 12)
print(model)
x = torch.rand(1, 12)
y = model(x)
make_dot(y, params = dict(model.named_parameters()))
Here I reuse the self.fc twice in the forward.
The computational graph is look
I am confused about the computational graph and,
I am curious how to train this model in back propagation? It seem for me the gradient will live in a loop forever. Thanks a lot.

There are no issues with your graph. You can train it the same way as any other feed-forward model.
Regarding looping: Since it is a directed acyclic graph, the are no actual loops (check out the arrow directions).
Regarding backprop: Let’s consider fc.bias parameter. Since you are reusing the same layer two times, the bias has two outgoing arrows (used in two places of your net). During backpropagation stage the direction is reversed: bias will get gradients from two places, and these gradients will add up.
Regarding the graph: An FC layer can be represented as this: Addmm(bias, x, T(weight), where T is transposing and Addmm is matrix multiplication plus adding a vector. So, you can see how data (weight, bias) is passed into functions (Addmm, T)
https://pytorch.org/docs/stable/generated/torch.addmm.html
https://pytorch.org/docs/stable/generated/torch.t.html

Run multiple models of an ensemble in parallel with PyTorch

My neural network has the following architecture:
input -> 128x (separate fully connected layers) -> output averaging
I am using a ModuleList to hold the list of fully connected layers. Here's how it looks at this point:
class MultiHead(nn.Module):
def __init__(self, dim_state, dim_action, hidden_size=32, nb_heads=1):
super(MultiHead, self).__init__()
self.networks = nn.ModuleList()
for _ in range(nb_heads):
network = nn.Sequential(
nn.Linear(dim_state, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, dim_action)
)
self.networks.append(network)
self.cuda()
self.optimizer = optim.Adam(self.parameters())
Then, when I need to calculate the output, I use a for ... in construct to perform the forward and backward pass through all the layers:
q_values = torch.cat([net(observations) for net in self.networks])
# skipped code which ultimately computes the loss I need
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
This works! But I am wondering if I couldn't do this more efficiently. I feel like by doing a for...in, I am actually going through each separate FC layer one by one, while I'd expect this operation could be done in parallel.

In the case of Convnd in place of Linear you could use the groups argument for "grouped convolutions" (a.k.a. "depthwise convolutions"). This let's you handle all parallel networks simultaneously.
If you use a convolution kernel of size 1, then the convolution does nothing else than applying a Linear layer, where each channel is considered an input dimension. So the rough structure of your network would look like this:
Modify the input tensor of shape B x dim_state as follows: add an additional dimension and replicate by nb_state-times B x dim_state to B x (dim_state * nb_heads) x 1
replace the two Linear with
nn.Conv1d(in_channels=dim_state * nb_heads, out_channels=hidden_size * nb_heads, kernel_size=1, groups=nb_heads)
and
nn.Conv1d(in_channels=hidden_size * nb_heads, out_channels=dim_action * nb_heads, kernel_size=1, groups=nb_heads)
we now have a tensor of size B x (dim_action x nb_heads) x 1 you can now modify it to whatever shape you want (e.g. B x nb_heads x dim_action)
While CUDA natively supports grouped convolutions, there were some issues in pytorch with the speed of grouped convolutions (see e.g. here) but I think that was solved now.

How to verify structure a neural network in keras model?

I'm new in Keras and Neural Networks. I'm writing a thesis and trying to create a SimpleRNN in Keras as it is illustrated below:
As it is shown in the picture, I need to create a model with 4 inputs + 2 outputs and with any number of neurons in the hidden layer.
This is my code:
model = Sequential()
model.add(SimpleRNN(4, input_shape=(1, 4), activation='sigmoid', return_sequences=True))
model.add(Dense(2))
model.compile(loss='mean_absolute_error', optimizer='adam')
model.fit(data, target, epochs=5000, batch_size=1, verbose=2)
predict = model.predict(data)
1) Does my model implement the graph?
2) Is it possible to specify connections between neurons Input and Hidden layers or Output and Input layers?
Explanation:
I am going to use backpropagation to train my network.
I have input and target values
Input is a 10*4 array and target is a 10*2 array which I then reshape:
input = input.reshape((10, 1, 4))
target = target.reshape((10, 1, 2))
It is crucial for to able to specify connections between neurons as they can be different. For instance, here you can have an example:

1) Not really. But I'm not sure about what exactly you want in that graph. (Let's see how Keras recurrent layers work below)
2) Yes, it's possible to connect every layer to every layer, but you can't use Sequential for that, you must use Model.
This answer may not be what you're looking for. What exactly do you want to achieve? What kind of data you have, what output you expect, what is the model supposed to do? etc...
1 - How does a recurrent layer work?
Documentation
Recurrent layers in keras work with an "input sequence" and may output a single result or a sequence result. It's recurrency is totally contained in it and doesn't interact with other layers.
You should have inputs with shape (NumberOrExamples, TimeStepsInSequence, DimensionOfEachStep). This means input_shape=(TimeSteps,Dimension).
The recurrent layer will work internally with each time step. The cycles happen from step to step and this behavior is totally invisible. The layer seems to work just like any other layer.
This doesn't seem to be what you want. Unless you have a "sequence" to input. The only way I know if using recurrent layers in Keras that is similar to you graph is when you have a segment of a sequence and want to predict the next step. If that's the case, see some examples by searching for "predicting the next element" in Google.
2 - How to connect layers using Model:
Instead of adding layers to a sequential model (which will always follow a straight line), start using the layers independently, starting from an input tensor:
from keras.layers import *
from keras.models import Model
inputTensor = Input(shapeOfYourInput) #it seems the shape is "(2,)", but we must see your data.
#A dense layer with 2 outputs:
myDense = Dense(2, activation=ItsAGoodIdeaToUseAnActivation)
#The output tensor of that layer when you give it the input:
denseOut1 = myDense(inputTensor)
#You can do as many cycles as you want here:
denseOut2 = myDense(denseOut1)
#you can even make a loop:
denseOut = Activation(ItsAGoodIdeaToUseAnActivation)(inputTensor) #you may create a layer and call it with the input tensor in just one line if you're not going to reuse the layer
#I'm applying this activation layer here because since we defined an activation for the dense layer and we're going to cycle it, it's not going to behave very well receiving huge values in the first pass and small values the next passes....
for i in range(n):
denseOut = myDense(denseOut)
This kind of usage allows you to create any kind of model, with branches, alternative ways, connections from anywhere to anywhere, provided you respect the shape rules. For a cycle like that, inputs and outputs must have the same shape.
At the end, you must define a model from one or many inputs to one or many outputs (you must have training data to match all inputs and outputs you choose):
model = Model(inputTensor,denseOut)
But notice that this model is static. If you want to change the number of cycles, you will have to create a new model.
In this case, it would be as simple as repeating the loop step denseOut = myDense(denseOut) and creating another model2=Model(inputTensor,denseOut).
3 - Trying to create something like the image below:
I am supposing C and F will participate in all iterations. If not,
Since there are four actual inputs, and we are going to treat them all separately, let's create 4 inputs instead, all like (1,).
Your input array should be divided in 4 arrays, all being (10,1).
from keras.models import Model
from keras.layers import *
inputA = Input((1,))
inputB = Input((1,))
inputC = Input((1,))
inputF = Input((1,))
Now the layers N2 and N3, that will be used only once, since C and F are constant:
outN2 = Dense(1)(inputC)
outN3 = Dense(1)(inputF)
Now the recurrent layer N1, without giving it the tensors yet:
layN1 = Dense(1)
For the loop, let's create outA and outB. They start as actual inputs and will be given to the layer N1, but in the loop they will be replaced
outA = inputA
outB = inputB
Now in the loop, let's do the "passes":
for i in range(n):
#unite A and B in one
inputAB = Concatenate()([outA,outB])
#pass through N1
outN1 = layN1(inputAB)
#sum results of N1 and N2 into A
outA = Add()([outN1,outN2])
#this is constant for all the passes except the first
outB = outN3 #looks like B is never changing in your image....
Now the model:
finalOut = Concatenate()([outA,outB])
model = Model([inputA,inputB,inputC,inputF], finalOut)

Keras: zero division error

I'm trying to get the activation values for each layer in this baseline autoencoder built using Keras since I want to add a sparsity penalty to the loss function based on the Kullbach-Leibler (KL) divergence, as shown here, pag. 14.
In this scenario, I'm going to calculate the KL divergence for each layer and then sum all of them with the main loss function, e.g. mse.
I therefore made a script in Jupyter where I do that but all the time, when I try to compile I get ZeroDivisionError: integer division or modulo by zero.
This is the code
import numpy as np
from keras.layers import Conv2D, Activation
from keras.models import Sequential
from keras import backend as K
from keras import losses
x_train = np.random.rand(128,128).astype('float32')
kl = K.placeholder(dtype='float32')
beta = K.constant(value=5e-1)
p = K.constant(value=5e-2)
# encoder
model = Sequential()
model.add(Conv2D(filters=16,kernel_size=(4,4),padding='same',
name='encoder',input_shape=(128,128,1)))
model.add(Activation('relu'))
# get the average activation
A = K.mean(x=model.output)
# calculate the value for the KL divergence
kl = K.concatenate([kl, losses.kullback_leibler_divergence(p, A)],axis=0)
# decoder
model.add(Conv2D(filters=1,kernel_size=(4,4),padding='same', name='encoder'))
model.add(Activation('relu'))
B = K.mean(x=model.output)
kl = K.concatenate([kl, losses.kullback_leibler_divergence(p, B)],axis=0)
Here seems the cause
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in _normalize_axis(axis, ndim)
989 else:
990 if axis is not None and axis < 0:
991 axis %= ndim <----------
992 return axis
993
so there might be something wrong in the mean calculation. If I print the value I get
Tensor("Mean_10:0", shape=(), dtype=float32)
that is quite strange because the weights and the biases are non-zero initialised. Thus, there might be something wrong in the way of getting the activation values either.
I really would not know hot to fix it, I'm not much of a skilled programmer.
Could anyone help me in understanding where I'm wrong?

First, you shouldn't be doing calculations outside layers. The model must keep track of all calculations.
If you need a specific calculation to be done in the middle of the model, you should use a Lambda layer.
If you need that a specific output be used in the loss function, you should split your model for that output and do calculations inside a custom loss function.
Here, I used Lambda layer to calculate the mean, and a customLoss to calculate the kullback-leibler divergence.
import numpy as np
from keras.layers import *
from keras.models import Model
from keras import backend as K
from keras import losses
x_train = np.random.rand(128,128).astype('float32')
kl = K.placeholder(dtype='float32') #you'll probably not need this anymore, since losses will be treated individually in each output.
beta = beta = K.constant(value=5e-1)
p = K.constant(value=5e-2)
# encoder
inp = Input((128,128,1))
lay = Convolution2D(filters=16,kernel_size=(4,4),padding='same', name='encoder',activation='relu')(inp)
#apply the mean using a lambda layer:
intermediateOut = Lambda(lambda x: K.mean(x),output_shape=(1,))(lay)
# decoder
finalOut = Convolution2D(filters=1,kernel_size=(4,4),padding='same', name='encoder',activation='relu')(lay)
#but from that, let's also calculate a mean output for loss:
meanFinalOut = Lambda(lambda x: K.mean(x),output_shape=(1,))(finalOut)
#Now, you have to create a model taking one input and those three outputs:
splitModel = Model(inp,[intermediateOut,meanFinalOut,finalOut])
And finally, compile your model with your custom loss function (we will define that later). But since I don't know if you're actually using the final output (not mean) for training, I'll suggest creating one model for training and another for predicting:
trainingModel = Model(inp,[intermediateOut,meanFinalOut])
trainingModel.compile(...,loss=customLoss)
predictingModel = Model(inp,finalOut)
#you don't need to compile the predicting model since you're only training the trainingModel
#both will share the same weights, you train one, and predict in the other
Our custom loss function should then deal with the kullback.
def customLoss(p,mean):
return #your own kullback expression (I don't know how it works, but maybe keras' one can be used with single values?)
Alternatively, if you want a single loss function to be called instead of two:
summedMeans = Add([intermediateOut,meanFinalOut])
trainingModel = Model(inp, summedMeans)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Output from LSTM not changing for different inputs - python

I tried changing the number of layers to a lower number and the values differ, it is because the values are getting multiplied by a small number over and over again which reduces the significance of input.

I initialized all the weights in the using kaiming_normal and it works fine.

Related

python keras tensorflow - change Dense layer dot product to cosine distance

How to train the Shared Layers in PyTorch

Run multiple models of an ensemble in parallel with PyTorch

How to verify structure a neural network in keras model?

Keras: zero division error

Categories

Resources