We are trying to reconstruct a model using LSTM.
LSTM Model Image
In Pytorch, LSTMs know that 3D data is entered.
From the picture above, I am currently going to put it in the following shape.
(batch, lstm_num, dv_batch, dvector)
lstm_num: Number of LSTMs used
I am wondering if there is another way to process the 4-dimensional data using lstm_num as a loop or a tensor itself.
The original processing method code is as follows.
In the original, we get a
(batch, dv_batch, dvector) in 3-dimensional data.
for epoch in range(init_epoch, max_epochs):
for i, (dvec_batch, prob_batch) in enumerate(data_loader):
dvec_batch = torch.reshape(dvec_batch,
(-1, dvec_batch.size(2))).to(device)
prob_batch = torch.reshape(prob_batch, (-1, )).to(device)
outputs = model(dvec_batch).squeeze()
loss = criterion(outputs, prob_batch)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
adjust_learning_rate(optimizer, epoch)
LSTM in pytorch takes input tensor with shape
(seq_len, batch, input_size)
or if you've set batch_first = True
(batch, seq_len, input_size)
The shape of the input tensor is not related to the number of lstm layers.
Related
This seems to be one of the most common questions about LSTMs in PyTorch, but I am still unable to figure out what should be the input shape to PyTorch LSTM.
Even after following several posts (1, 2, 3) and trying out the solutions, it doesn't seem to work.
Background: I have encoded text sequences (variable length) in a batch of size 12 and the sequences are padded and packed using pad_packed_sequence functionality. MAX_LEN for each sequence is 384 and each token (or word) in the sequence has a dimension of 768. Hence my batch tensor could have one of the following shapes: [12, 384, 768] or [384, 12, 768].
The batch will be my input to the PyTorch rnn module (lstm here).
According to the PyTorch documentation for LSTMs, its input dimensions are (seq_len, batch, input_size) which I understand as following.
seq_len - the number of time steps in each input stream (feature vector length).
batch - the size of each batch of input sequences.
input_size - the dimension for each input token or time step.
lstm = nn.LSTM(input_size=?, hidden_size=?, batch_first=True)
What should be the exact input_size and hidden_size values here?
You have explained the structure of your input, but you haven't made the connection between your input dimensions and the LSTM's expected input dimensions.
Let's break down your input (assigning names to the dimensions):
batch_size: 12
seq_len: 384
input_size / num_features: 768
That means the input_size of the LSTM needs to be 768.
The hidden_size is not dependent on your input, but rather how many features the LSTM should create, which is then used for the hidden state as well as the output, since that is the last hidden state. You have to decide how many features you want to use for the LSTM.
Finally, for the input shape, setting batch_first=True requires the input to have the shape [batch_size, seq_len, input_size], in your case that would be [12, 384, 768].
import torch
import torch.nn as nn
# Size: [batch_size, seq_len, input_size]
input = torch.randn(12, 384, 768)
lstm = nn.LSTM(input_size=768, hidden_size=512, batch_first=True)
output, _ = lstm(input)
output.size() # => torch.Size([12, 384, 512])
The image passed to CNN layer and lstm layer,the feature map shape changes like this
BCHW->BCHW(BxCx1xW),
the CNN's output shape should has the height 1.
then sqeeze the dim of height.
BCHW->BCW
in rnn ,shape name changes,[batch ,seqlen,input_size],in image,[batch,width,channel],
**BCW->BWC,**this is batch_first tensor for LSTM layer(like pytorch).
Finally:
BWC is [batch,seqlen,channel].
I am learning deep learning and am trying to understand the pytorch code given below. I'm struggling to understand how the probability calculation works. Can somehow break it down in lay-man terms. Thanks a ton.
ps = model.forward(images[0,:])
# Hyperparameters for our network
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
# Build a feed-forward network
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
nn.ReLU(),
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
nn.ReLU(),
nn.Linear(hidden_sizes[1], output_size),
nn.Softmax(dim=1))
print(model)
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
ps = model.forward(images[0,:])
I'm a layman so I'll help you with the layman's terms :)
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
These are parameters for the layers in your network. Each neural network consists of layers, and each layer has an input and an output shape.
Specifically input_size deals with the input shape of the first layer. This is the input_size of the entire network. Each sample that is input into the network will be a 1 dimension vector that is length 784 (array that is 784 long).
hidden_size deals with the shapes inside the network. We will cover this a little later.
output_size deals with the output shape of the last layer. This means that our network will output a 1 dimensional vector that is length 10 for each sample.
Now to break up model definition line by line:
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
The nn.Sequential part simply defines a network, each argument that is input defines a new layer in that network in that order.
nn.Linear(input_size, hidden_sizes[0]) is an example of such a layer. It is the first layer of our network takes in an input of size input_size, and outputs a vector of size hidden_sizes[0]. The size of the output is considered "hidden" in that it is not the input or the output of the whole network. It "hidden" because it's located inside of the network far from the input and output ends of the network that you interact with when you actually use it.
This is called Linear because it applies a linear transformation by multiplying the input by its weights matrix and adding its bias matrix to the result. (Y = Ax + b, Y = output, x = input, A = weights, b = bias).
nn.ReLU(),
ReLU is an example of an activation function. What this function does is apply some sort of transformation to the output of the last layer (the layer discussed above), and outputs the result of that transformation. In this case the function being used is the ReLU function, which is defined as ReLU(x) = max(x, 0). Activation functions are used in neural networks because they create non-linearities. This allows your model to model non-linear relationships.
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
From what we discussed above, this is a another example of a layer. It takes an input of hidden_sizes[0] (same shape as the output of the last layer) and outputs a 1D vector of length hidden_sizes[1].
nn.ReLU(),
Apples the ReLU function again.
nn.Linear(hidden_sizes[1], output_size)
Same as the above two layers, but our output shape is the output_size this time.
nn.Softmax(dim=1))
Another activation function. This activation function turns the logits outputted by nn.Linear into an actual probability distribution. This lets the model output the probability for each class. At this point our model is built.
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
These are simply just preprocessing training data and putting it into the correct format
ps = model.forward(images[0,:])
This passes the images through the model (forward pass) and applies the operations previously discussed in layer. You get the resultant output.
The output layer of my CNN should use the RBF function, described as "each neuron outputs the square of the Euclidean distance between its input vector and its weight vector". I've implemented this as
dense2 = tf.square(tf.norm(dense1 - tf.transpose(dense2_W)))
where dense1 is a tensor of shape (?, 84). I've tried declaring dense2_W, the weights, as a variable of shape (84, 10) since it's doing number classification and should have 10 outputs. Running the code with a batch of 100 I get this error: InvalidArgumentError: Incompatible shapes: [100,84] vs. [10,84]. I believe it is due to the subtraction.
I train the network by iterating this code:
x_batch, y_batch = mnist.train.next_batch(100)
x_batch = tf.pad(x_batch, [[0,0],[2,2],[2,2],[0,0]]).eval() # Pad 28x28 -> 32x32
sess.run(train_step, {X: x_batch, Y: y_batch})
and then test it using the entire test set, thus the batch size in the network must be dynamic.
How can I work around this? The batch size must be dynamic, as in dense1's case, but I don't understand how to make a variable with dynamic size and transposing it (dense2_W).
You need the shapes of the two tensors to match. Assuming you want to share the weights across the batch and also having separate set of weights for each output class, you could reshape both of the tensors in order to be correctly broadcasted, e.g:
# broadcasting will copy the input to every output class neuron
input_dense = tf.expand_dims(dense1, axis=2)
# broadcasting here will copy the weights across the batch
weights = tf.expand_dims(tf.transpose(dense2_W), axis=0)
dense2 = tf.square(tf.norm(input_dense - weights, axis=1))
The resulting tensor dense2 should have shape of [batch_size, num_classes], which is [100, 10] in your case (so it will hold logits for every data instance over the number of output classes)
EDIT: added axis argument to the tf.norm call so that the distance is computed in the hidden dimension (not over the whole matrices).
My current LSTM network looks like this.
rnn_cell = tf.contrib.rnn.BasicRNNCell(num_units=CELL_SIZE)
init_s = rnn_cell.zero_state(batch_size=1, dtype=tf.float32) # very first hidden state
outputs, final_s = tf.nn.dynamic_rnn(
rnn_cell, # cell you have chosen
tf_x, # input
initial_state=init_s, # the initial hidden state
time_major=False, # False: (batch, time step, input); True: (time step, batch, input)
)
# reshape 3D output to 2D for fully connected layer
outs2D = tf.reshape(outputs, [-1, CELL_SIZE])
net_outs2D = tf.layers.dense(outs2D, INPUT_SIZE)
# reshape back to 3D
outs = tf.reshape(net_outs2D, [-1, TIME_STEP, INPUT_SIZE])
Usually, I apply tf.layers.batch_normalization as batch normalization. But I am not sure if this works in a LSTM network.
b1 = tf.layers.batch_normalization(outputs, momentum=0.4, training=True)
d1 = tf.layers.dropout(b1, rate=0.4, training=True)
# reshape 3D output to 2D for fully connected layer
outs2D = tf.reshape(d1, [-1, CELL_SIZE])
net_outs2D = tf.layers.dense(outs2D, INPUT_SIZE)
# reshape back to 3D
outs = tf.reshape(net_outs2D, [-1, TIME_STEP, INPUT_SIZE])
If you want to use batch norm for RNN (LSTM or GRU), you can check out this implementation , or read the full description from blog post.
However, the layer-normalization has more advantage than batch norm in sequence data. Specifically, "the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent networks" (from the paper Ba, et al. Layer normalization).
For layer normalization, it normalizes the summed inputs within each layer. You can check out the implementation of layer-normalization for GRU cell:
Based on this paper: "Layer Normalization" - Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
Tensorflow now comes with the tf.contrib.rnn.LayerNormBasicLSTMCell a LSTM unit with layer normalization and recurrent dropout.
Find the documentation here.
I am referreing to this sample code
in the code snippet below:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
# Construct the variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
Now NCE_Loss function is nothing but a single hidden layer neural network with softmax at the optput layer [knowing is takes only a few negative sample]
This part of the graph will only update the weights of the network, it is not doing anything to the "embeddings" matrix/ tensor.
so ideally once the network is trained we must again pass it once through the embeddings_matrix first and then multiply by the transpose of the "nce_weights" [considering it as the same weight auto-encoder, at input & output layers] to reach to the hidden layer representation of each word, which we are are calling word2vec (?)
But if look at the later part of the code, the value of the embeddings matrix is being used a word representation. This
Even the tensorflow doc for NCE loss, mentions input (to which we are passing embed, which uses embeddings) as just the 1st layer input activation values.
inputs: A Tensor of shape [batch_size, dim]. The forward activations of the input network.
A normal back propagation stops at the first layer of the network,
does this implementation of NCE loss, goes beyond and propagates the loss to the input values (and hence to the embedding) ?
This seems an extra step?
Refer this for why I am calling it an extra step, he has a same explanation.
Want I have figured out reading and going through tensorflow is that
though the entire thing is single hidden layer neural network, a auto-encoder indeed. But the weights are not tied, which I assumed.
The encoder is made of the weight matrix embeddings and the decoder is made of the nce_weights. And now embed is nothing but the hidden layer output, given by multiplying input with embeddings.
So with this, embeddings and nce_weights both will be updated in the graph. And we can choose any of the two weight matrix, embeddings is more preferred here.
Edit1:
Actually for both tf.nn.nce_loss and tf.nn.sampled_softmax_loss, the parameters, weights and bias are for the input Weights(tranpose) X + bias, to objective function, which can be logistic regression/ softmax function [refer].
But the back-propagation/ gradient descent happens till the very base of the graph you are building and does not stop at the weights and bias of the function only. Hence the input parameter in both tf.nn.nce_loss and tf.nn.sampled_softmax_loss are also updated which in-turn is build of embeddings matrix.