I have developed an Encoder(CNN)-Decoder (RNN) network for image captioning in pytorch. The decoder network takes in two inputs- Context feature vector from the Encoder and the word embeddings of the caption for training. The context feature vector is of size = embed_size , which is also the embedding size of each word in the caption. My question here is more concerned with the output of the Class DecoderRNN. Please refer to the code below.
class DecoderRNN(nn.Module):
def __init__(self, embed_size, hidden_size, vocab_size, num_layers=1):
super(DecoderRNN, self).__init__()
self.embed_size = embed_size
self.hidden_size = hidden_size
self.vocab_size = vocab_size
self.num_layers = num_layers
self.linear = nn.Linear(hidden_size, vocab_size)
self.embed = nn.Embedding(vocab_size, embed_size)
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first = True)
def forward(self, features, captions):
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings),1)
hiddens,_ = self.lstm(embeddings)
outputs = self.linear(hiddens)
return outputs
In the forward function , I send in a sequence of (batch_size, caption_length+1, embed_size) (concatenated tensor of context feature vector and the embedded caption) . The output of the sequence should be captions and of the shape (batch_size, caption_length, vocab_size) , but I am still receiving an output of shape (batch_size, caption_length +1, vocab_size) . Can anyone please suggest what should I alter in my forward function so that extra 2nd dimension is not received? Thanks in advance
Since in LSTM (or in any RNN) for each time step (or caption length here), there will be one output, I do not see any problem here. What you need to do is make input size (caption_length) at the second dimension to get the required output. (or people usually add a < END of SENTENCE > tag to target. Hence the target length is caption+1)
Related
I recently did a massive refactor to my PyTorch LSTM code, in order to support multitask learning. I created an MTLWrapper, which holds a BaseModel (which can be one of several variations on a regular LSTM network), which remained the same as it was before the refactor, minus a linear hidden2tag layer (takes hidden sequence and converts to tag space), which now sits in the wrapper. The reason for this is that for multitask learning, all the parameters are shared, except for the final linear layer, which I have one of for each task. These are stored in a nn.ModuleList, not just a regular python list.
What happens now is that my forward pass returns a list of tag scores tensors (one for each task), rather than a single tensor of the tag scores for a single task. I compute the losses for each of these tasks and then try to backpropagate with the average of these losses (technically also averaged over all the sentences of a batch, but this was true before the refactor too). I call model.zero_grad() before running the forward pass on each sentence in a batch.
I don't know exactly where it happened, but after this refactor, I started getting this error (on the second batch):
RuntimeError: Trying to backward through the graph a second time, but
the buffers have already been freed. Specify retain_graph=True when
calling backward the first time.
Following the advice, I added the retain_graph=True flag, but now I get the following error instead (also on the second backward step):
RuntimeError: one of the variables needed for gradient computation has
been modified by an inplace operation: [torch.FloatTensor [100, 400]],
which is output 0 of TBackward, is at version 2; expected version 1
instead. Hint: the backtrace further above shows the operation that
failed to compute its gradient. The variable in question was changed
in there or anywhere later. Good luck!
The hint in the backtrace is not actually helpful, because I have no idea where a tensor of the shape [100, 400] even came from - I don't have any parameters of size 400.
I have a sneaky suspicion that the problem is actually that I shouldn't need the retain_graph=True, but I have no way to confirm that vs. finding the mystery variable that is being changed according to the second error. Either way, I'm at a complete loss how to solve this issue. Any help is appreciated!
Code snippets:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MTLWrapper(nn.Module):
def __init__(self, embedding_dim, hidden_dim, dropout,..., directions=1, device='cpu', model_type):
super(MTLWrapper, self).__init__()
self.base_model = model_type(embedding_dim, hidden_dim, dropout, ..., directions, device)
self.linear_taggers = []
for tagset_size in tagset_sizes:
self.linear_taggers.append(nn.Linear(hidden_dim*directions, tagset_size))
self.linear_taggers = nn.ModuleList(self.linear_taggers)
def init_hidden(self, hidden_dim):
return self.base_model.init_hidden(hidden_dim)
def forward(self, sentence):
lstm_out = self.base_model.forward(sentence)
tag_scores = []
for linear_tagger in self.linear_taggers:
tag_space = linear_tagger(lstm_out.view(len(sentence), -1))
tag_scores.append(F.log_softmax(tag_space))
tag_scores = torch.stack(tag_scores)
return tag_scores
Inside the train function:
for i in range(math.ceil(len(train_sents)/batch_size)):
batch = r[i*batch_size:(i+1)*batch_size]
losses = []
for j in batch:
sentence = train_sents[j]
tags = train_tags[j]
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()
# Also, we need to clear out the hidden state of the LSTM,
# detaching it from its history on the last instance.
model.hidden = model.init_hidden(hidden_dim)
sentence_in = sentence
targets = tags
# Step 3. Run our forward pass.
tag_scores = model(sentence_in)
loss = [loss_function(tag_scores[i], targets[i]) for i in range(len(tag_scores))]
loss = torch.stack(loss)
avg_loss = sum(loss)/len(loss)
losses.append(avg_loss)
losses = torch.stack(losses)
total_loss = sum(losses)/len(losses) # average over all sentences in batch
total_loss.backward(retain_graph=True)
running_loss += total_loss.item()
optimizer.step()
count += 1
And code for one possible BaseModel (the others are practically identical):
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim, hidden_dim, dropout, vocab_size, alphabet_size,
directions=1, device='cpu'):
super(LSTMTagger, self).__init__()
self.device = device
self.hidden_dim = hidden_dim
self.directions = directions
self.dropout = nn.Dropout(dropout)
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states
# with dimensionality hidden_dim.
self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout, bidirectional=directions == 2)
# The linear layer that maps from hidden state space to tag space
self.hidden = self.init_hidden(hidden_dim)
def init_hidden(self, dim):
# Before we've done anything, we don't have any hidden state.
# Refer to the PyTorch documentation to see exactly
# why they have this dimensionality.
# The axes semantics are (num_layers, minibatch_size, hidden_dim)
return (torch.zeros(self.directions, 1, dim).to(device=self.device),
torch.zeros(self.directions, 1, dim).to(device=self.device))
def forward(self, sentence):
word_idxs = []
for word in sentence:
word_idxs.append(word[0])
embeds = self.word_embeddings(torch.LongTensor(word_idxs).to(device=self.device))
lstm_out, self.hidden = self.lstm(
embeds.view(len(sentence), 1, -1), self.hidden)
lstm_out = self.dropout(lstm_out)
return lstm_out
The problem is that when I was resetting the hidden states of the model (model.hidden = model.init_hidden(hidden_dim)) I didn't actually reassign the reinitialized weights to the BaseModel, but only in the MTLWrapper (which doesn't technically even use hidden layers).
I amended my MTLWrapper's init_hidden() function as follows:
class MTLWrapper(nn.Module):
def init_hidden(self, hidden_dim):
self.base_model.hidden = self.base_model.init_hidden(hidden_dim)
return self.base_model.init_hidden(hidden_dim)
This resolved the first error, and my code runs without the retain_graph=True flag.
I have an image as input on my model, but I need to input some floats as well as support information about the image, but I don´t want it to go through all the convolutions, I want it to go directly to my dense layers as information on how to train it. I know about the concatenate layer but I don´t know how to use it in the input, or if that is how it should be done.
Assuming you have a backbone which can be any convolutional neural nets (VGG, ResNet, etc.). Before the dense layer, you usually have a Flatten() one (or, in modern neural nets, you usually have a pooling layer like GAP or GeM) which prepares a 1D vector as input to your Dense layer. That's where you can concatenate with your floats.
Code example using Functional API:
class MyModel(tf.keras.Model):
def __init__(self, num_output_classes):
super().__init__()
self.backbone = tf.keras.applications.ResNet50(
input_shape=(224, 224, 3), include_top=False)
self.pool = tf.keras.layers.GlobalAveragePooling2D()
self.concat = tf.keras.layers.Concatenate(axis=-1)
self.dense = tf.keras.layers.Dense(num_output_classes)
def call(self, inputs):
# Unpack the inputs. `additional_floats` should be 1D
image, additional_floats = inputs
# Run image through backbone and get a feature vector
x = self.backbone(image)
x = self.pool(x)
# Concatenate with your additional floats
x = self.concat([x, additional_inputs])
# Classification, or whatever you might need on top
return self.dense(x, activation='softmax')
I'm new to pytorch, I followed a tutorial on sentence generation with RNN and I'm trying to modify it to generate sequences of positions, however I'm having trouble with defining the correct model parameters such as input_size, output_size, hidden_dim, batch_size.
Background:
I have 596 sequences of x,y positions, each looking like [[x1,y1],[x2,y2],...,[xn,yn]]. Each sequence represents the 2D path of a vehicle. I would like to to train a model that, given a starting point (or a partial sequence), could generate one of these sequences.
-I have padded/truncated the sequences so that they all have length 50, meaning each sequence is an array of shape [50,2]
-I then divided this data into input_seq and target_seq:
input_seq: tensor of torch.Size([596, 49, 2]). contains all the 596 sequences, each without its last position.
target_seq: tensor of torch.Size([596, 49, 2]). contains all the 596 sequences, each without its first position.
The model class:
class Model(nn.Module):
def __init__(self, input_size, output_size, hidden_dim, n_layers):
super(Model, self).__init__()
# Defining some parameters
self.hidden_dim = hidden_dim
self.n_layers = n_layers
#Defining the layers
# RNN Layer
self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
# Fully connected layer
self.fc = nn.Linear(hidden_dim, output_size)
def forward(self, x):
batch_size = x.size(0)
# Initializing hidden state for first input using method defined below
hidden = self.init_hidden(batch_size)
# Passing in the input and hidden state into the model and obtaining outputs
out, hidden = self.rnn(x, hidden)
# Reshaping the outputs such that it can be fit into the fully connected layer
out = out.contiguous().view(-1, self.hidden_dim)
out = self.fc(out)
return out, hidden
def init_hidden(self, batch_size):
# This method generates the first hidden state of zeros which we'll use in the forward pass
# We'll send the tensor holding the hidden state to the device we specified earlier as well
hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
return hidden
I instantiate the model with the following parameters:
input_size of 2 (an [x,y] position)
output_size of 2 (an [x,y] position)
hidden_dim of 2 (an [x,y] position) (or should this be 50 as in the length of a full sequence?)
model = Model(input_size=2, output_size=2, hidden_dim=2, n_layers=1)
n_epochs = 100
lr=0.01
# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# Training Run
for epoch in range(1, n_epochs + 1):
optimizer.zero_grad() # Clears existing gradients from previous epoch
output, hidden = model(input_seq)
loss = criterion(output, target_seq.view(-1).long())
loss.backward() # Does backpropagation and calculates gradients
optimizer.step() # Updates the weights accordingly
if epoch%10 == 0:
print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
print("Loss: {:.4f}".format(loss.item()))
When I run the training loop, it fails with this error:
ValueError Traceback (most recent call last)
<ipython-input-9-ad1575e0914b> in <module>
3 optimizer.zero_grad() # Clears existing gradients from previous epoch
4 output, hidden = model(input_seq)
----> 5 loss = criterion(output, target_seq.view(-1).long())
6 loss.backward() # Does backpropagation and calculates gradients
7 optimizer.step() # Updates the weights accordingly
...
ValueError: Expected input batch_size (29204) to match target batch_size (58408).
I tried modifying input_size, output_size, hidden_dim and batch_size and reshaping the tensors, but the more I try the more confused I get. Could someone point out what I am doing wrong?
Furthermore, since batch size is defined as x.size(0) in Model.forward(self,x), this means I only have a single batch of size 596 right? What would be the correct way to have multiple smaller batches?
The output has size [batch_size * seq_len, 2] = [29204, 2], and you flatten the target_seq, which has size [batch_size * seq_len * 2] = [58408]. They don't have the same number of dimensions, while having the same number of total elements, therefore the first dimensions are not identical.
Regardless of the dimension mismatch, nn.CrossEntropyLoss is a categorical loss function, which means it would only predict a class from the output. You don't have any classes, but you are trying to predict coordinates, which are continuous values. For this you need to use a regression loss function, such as nn.MSELoss, which calculates the squared error/distance between the predicted and target coordinates.
criterion = nn.MSELoss()
# .flatten() does the same thing as .view(-1) but is more descriptive
loss = criterion(output.flatten(), target_seq.flatten())
The flattening can be avoided as the loss functions as well as the linear layer can operate on multidimensional inputs, which removes the potential risk of getting lost with the flattening and restoring of the dimensions, and the output is more comprehensible to inspect or use later outside of the training. For the linear layer, only the last dimension of the input needs to match the in_features of nn.Linear, which is hidden_dim in your case.
def forward(self, x):
batch_size = x.size(0)
# Initializing hidden state for first input using method defined below
hidden = self.init_hidden(batch_size)
# Passing in the input and hidden state into the model and obtaining outputs
# out size: [batch_size, seq_len, hidden_dim]
out, hidden = self.rnn(x, hidden)
# out size: [batch_size, seq_len, output_size]
out = self.fc(out)
return out, hidden
Now the output of the model has the same size as the target_seq and you can directly call the loss function without flattening:
loss = criterion(output, target_seq)
hidden_dim of 2 (an [x,y] position) (or should this be 50 as in the length of a full sequence?)
The hidden_dim is not a pair of [x, y] and is completely unrelated to both the input_size and output_size. It defines the number of hidden features of the RNN, which is kind of its complexity, and bigger sizes potentially have more room to retain essential information, but also require more computations. There is no perfect hidden size and it largely depends on the use case. You can experiment with different sizes, e.g. 100, 256, etc. and see whether that improves your results.
Furthermore, since batch size is defined as x.size(0) in Model.forward(self,x), this means I only have a single batch of size 596 right? What would be the correct way to have multiple smaller batches?
Yes, you only have a single batch of size 596. If you want to use smaller batches, for example if you cannot fit all of them into a more complex model, you could easily use slices of them, but it would be better to use PyTorch's data utilities: torch.utils.data.TensorDataset to get a dataset, where each sequence of the input has a corresponding target, in combination with torch.utils.data.DataLoader to create batches for you.
from torch.utils.data import DataLoader, TensorDataset
# Match each sequence of the input_seq to the corresponding target_seq.
# e.g. dataset[0] == (input_seq[0], target_seq[0])
dataset = TensorDataset(input_seq, target_seq)
# Randomly shuffle the data and load it in batches of 16
data_loader = DataLoader(dataset, batch_size=16, shuffle=True)
# Process one batch at a time
for input, target in data_loader:
output, hidden = model(input)
loss = criterion(output, target)
class LSTM(nn.Module):
def __init__(self, input_size=1, output_size=1, hidden_size=100, num_layers=16):
super().__init__()
self.hidden_size = hidden_size
self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
self.linear = nn.Linear(hidden_size, output_size)
self.num_layers = num_layers
self.hidden_cell = (torch.zeros(self.num_layers,12 ,self.hidden_size).to(device),
torch.zeros(self.num_layers,12 ,self.hidden_size).to(device))
def forward(self, input_seq):
#lstm_out, self.hidden_cell = self.lstm(input_seq.view(len(input_seq) ,1, -1), self.hidden_cell)
lstm_out, self.hidden_cell = self.lstm(input_seq, self.hidden_cell)
predictions = self.linear(lstm_out[:,-1,:])
return predictions
This is my LSTM model, Input is a 4 dimension vector. Batch size is 16 and time stamp is 12. I want to find 13th vector with using 12 sequence vector. My LSTM block have [16,12,48] output. I did not understand why i have choose the last one:
out[:,-1,:]
From how it looks, your problem is like a text (i.e., sequence) classification problem, with output_size being the number of classes that you want to assign text to. By choosing lstm_out[:,-1,:], you actually intend to predict the label associated with the input text only using the last hidden state of your LSTM network, which totally makes sense. This is what people commonly do for text classification problems. Your linear layer, thereafter, will output logits for each class, and then you can use nn.Softmax() to get the probabilities of those.
The last hidden state of the LSTM network is the propagation of all previous hidden states of the LSTM, meaning that it has the aggregated information of previous input states that it has encoded (let's consider that you're using uni-directional LSTM as in your example). So for classifying an input text, you will have to do the classification based on the overall information of all tokens within the input text (which is encoded in the last hidden state of your LSTM). That is why you feed only the last hidden state to the linear layer that is upon your LSTM network.
Note: If you intended to do sequence-tagging (such as Named-entity recognition), then you would use all the hidden state outputs from your LSTM network. In such tasks, you would actually need information about a specific token within the input.
I'm trying to create a model which has words as inputs. Most of those words are in the glove word vector set (~50000). However, some of the frequent words are not (~1000). The question is, how do I concatenate the following two embedding layers to create one giant Embedding lookup table?
trained_em = Embedding(50000, 50,
weights=np.array([word2glove[w] for w in words_in_glove]),
trainable=False)
untrained_em = Embedding(1000, 50)
As far as I understand these are simply two lookup tables with same number of dimensions. So I'm hoping that there is a way to stack these two lookup tables.
Edit 1:
I just realised that this is probably going to be more than stacking Embedding layers because the input sequence would be a number from 0-50999. However untrained_em above only expect a number from 0-999. So perhaps a different solution is required.
Edit 2:
This is what I would expect to do in a numpy array representing the Embedding:
np.random.seed(42) # Set seed for reproducibility
pretrained = np.random.randn(15,3)
untrained = np.random.randn(5,3)
final_embedding = np.vstack([pretrained, untrained])
word_idx = [2, 5, 19]
np.take(final_embedding, word_idx, axis=0)
I believe the last bit can be done with something to do with keras.backend.gather but not sure how to put it all together.
Turns out that I need to implement a custom layer. Which was implemented by tweaking the orignial Embedding class.
The two most important parts shown in the class below are self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0) and out = K.gather(self.embeddings, inputs). The first is hopefully self explanatory while the second picks out the relevant input rows from the embeddings table.
However, in the particular application that I'm working on, turns out that it works out better using an Embedding layer instead of the modified layer. Perhaps because the learning rate is too high. I will report back on this after I have experimented more.
from keras.engine.topology import Layer
import keras.backend as K
from keras import initializers
import numpy as np
class Embedding2(Layer):
def __init__(self, input_dim, output_dim, fixed_weights, embeddings_initializer='uniform',
input_length=None, **kwargs):
kwargs['dtype'] = 'int32'
if 'input_shape' not in kwargs:
if input_length:
kwargs['input_shape'] = (input_length,)
else:
kwargs['input_shape'] = (None,)
super(Embedding2, self).__init__(**kwargs)
self.input_dim = input_dim
self.output_dim = output_dim
self.embeddings_initializer = embeddings_initializer
self.fixed_weights = fixed_weights
self.num_trainable = input_dim - len(fixed_weights)
self.input_length = input_length
def build(self, input_shape, name='embeddings'):
initializer = initializers.get(self.embeddings_initializer)
shape1 = (self.num_trainable, self.output_dim)
variable_weight = K.variable(initializer(shape1), dtype=K.floatx(), name=name+'_var')
fixed_weight = K.variable(self.fixed_weights, name=name+'_fixed')
self._trainable_weights.append(variable_weight)
self._non_trainable_weights.append(fixed_weight)
self.embeddings = K.concatenate([fixed_weight, variable_weight], axis=0)
self.built = True
def call(self, inputs):
if K.dtype(inputs) != 'int32':
inputs = K.cast(inputs, 'int32')
out = K.gather(self.embeddings, inputs)
return out
def compute_output_shape(self, input_shape):
if not self.input_length:
input_length = input_shape[1]
else:
input_length = self.input_length
return (input_shape[0], input_length, self.output_dim)
So, my suggestion is to use only one Embedding layer (taking into consideration your indexing problem), and transfer the weights from the old layer to the new one.
So, what you're going to to in this suggestion is...
Create your new model with 51000 words:
inp = Input((1,))
emb = Embedding(51000,50)(inp)
out = the rest of the model.....
model = Model(inp,out)
Now take the embedding layer and give it the weights you had:
weights = np.array([word2glove[w] for w in words_in_glove])
newWeights = model.layers[1].get_weights()[0]
newWeights[:50000,:] = weights
model.layers[1].set_weights([newWeights])
This will give you a new embedding, larger than the previous one, with a great part of its weights already trained, and the remaining randomly initialized.
Unfortunately, you will have to let everything be trained.