I am looking at an implementation of RCNN for text classification using PyTorch. Full Code. There are two points where the dimensions of tensors are permuted using the permute function. The first is after the LSTM layer and before tanh. The second is after a linear layer and before a max pooling layer.
Could you please explain why the permutation is necessary or useful?
Relevant Code
def forward(self, x):
# x.shape = (seq_len, batch_size)
embedded_sent = self.embeddings(x)
# embedded_sent.shape = (seq_len, batch_size, embed_size)
lstm_out, (h_n,c_n) = self.lstm(embedded_sent)
# lstm_out.shape = (seq_len, batch_size, 2 * hidden_size)
input_features = torch.cat([lstm_out,embedded_sent], 2).permute(1,0,2)
# final_features.shape = (batch_size, seq_len, embed_size + 2*hidden_size)
linear_output = self.tanh(
self.W(input_features)
)
# linear_output.shape = (batch_size, seq_len, hidden_size_linear)
linear_output = linear_output.permute(0,2,1) # Reshaping fot max_pool
max_out_features = F.max_pool1d(linear_output, linear_output.shape[2]).squeeze(2)
# max_out_features.shape = (batch_size, hidden_size_linear)
max_out_features = self.dropout(max_out_features)
final_out = self.fc(max_out_features)
return self.softmax(final_out)
Similar Code in other Repositories
Similar implementations of RCNN use permute or transpose. Here are examples:
https://github.com/prakashpandey9/Text-Classification-Pytorch/blob/master/models/RCNN.py
https://github.com/jungwhank/rcnn-text-classification-pytorch/blob/master/model.py
What permute function does is rearranges the original tensor according to the desired ordering, note permute is different from reshape function, because when apply permute, the elements in tensor follow the index you provide where in reshape it's not.
Example code:
import torch
var = torch.randn(2, 4)
pe_var = var.permute(1, 0)
re_var = torch.reshape(var, (4, 2))
print("Original size:\n{}\nOriginal var:\n{}\n".format(var.size(), var) +
"Permute size:\n{}\nPermute var:\n{}\n".format(pe_var.size(), pe_var) +
"Reshape size:\n{}\nReshape var:\n{}\n".format(re_var.size(), re_var))
Outputs:
Original size:
torch.Size([2, 4])
Original var:
tensor([[ 0.8250, -0.1984, 0.5567, -0.7123],
[-1.0503, 0.0470, -1.9473, 0.9925]])
Permute size:
torch.Size([4, 2])
Permute var:
tensor([[ 0.8250, -1.0503],
[-0.1984, 0.0470],
[ 0.5567, -1.9473],
[-0.7123, 0.9925]])
Reshape size:
torch.Size([4, 2])
Reshape var:
tensor([[ 0.8250, -0.1984],
[ 0.5567, -0.7123],
[-1.0503, 0.0470],
[-1.9473, 0.9925]])
With the role of permute in mind we could see what first permute does is reordering the concatenate tensor for it to fit the inputs format of self.W, i.e with batch as first dimension; and the second permute does similar thing because we want to max pool the linear_output along the sequence and F.max_pool1d will pool along the last dimension.
I am adding this answer to provide additional PyTorch-specific details.
It is necessary to use permute between nn.LSTM and nn.Linear because the output shape of LSTM does not correspond to the expected input shape of Linear.
nn.LSTM outputs output, (h_n, c_n). Tensor output has shape seq_len, batch, num_directions * hidden_size nn.LSTM. nn.Linear expects an input tensor with shape N,∗,H, where N is batch size and H is number of input features. nn.Linear.
It is necessary to use permute between nn.Linear and nn.MaxPool1d because the output of nn.Linear is N, L, C, where N is batch size, C is the number of features, and and L is sequence length. nn.MaxPool1d expects an input tensor of shape N, C, L. nn.MaxPool1d
I reviewed seven implementations of RCNN for text classification with PyTorch on GitHub and gitee and found that permute and transpose are the normal ways to convert the output of one layer to the input of a subsequent layer.
Related
I am trying to train a pretty simple 2-layer neural network for a multi-class classification class. I am using CrossEntropyLoss and I get the following error: ValueError: Expected target size (128, 4), got torch.Size([128]) in my training loop at the point where I am trying to calculate the loss.
My last layer is a softmax so it outputs the probabilities of each of the 4 classes. My target values are a vector of dimension 128 (just the class values). Am I initializing the CrossEntropyLoss object incorrectly?
I looked up existing posts, this one seemed the most relevant:
https://discuss.pytorch.org/t/valueerror-expected-target-size-128-10000-got-torch-size-128-1/29424 However, if I had to squeeze my target values, how would that work? Like right now they are just class values for e.g., [0 3 1 0]. Is that not how they are supposed to look? I would think that the loss function maps the highest probability from the last layer and associates that to the appropriate class index.
Details:
This is using PyTorch
Python version is 3.7
NN architecture is: embedding -> pool -> h1 -> relu -> h2 -> softmax
Model Def (EDITED):
self.embedding_layer = create_embedding_layer(embeddings)
self.pool = nn.MaxPool1d(1)
self.h1 = nn.Linear(embedding_dim, embedding_dim)
self.h2 = nn.Linear(embedding_dim, 4)
self.s = nn.Softmax(dim=2)
forward pass:
x = self.embedding_layer(x)
x = self.pool(x)
x = self.h1(x)
x = F.relu(x)
x = self.h2(x)
x = self.s(x)
return x
The issue is that the output of your model is a tensor shaped as (batch, seq_length, n_classes). Each sequence element in each batch is a four-element tensor corresponding to the predicted probability associated with each class (0, 1, 2, and 3). Your target tensor is shaped (batch,) which is usually the correct shape (you didn't use one-hot-encodings). However, in this case, you need to provide a target for each one of the sequence elements.
Assuming the target is the same for each element of your sequence (this might not be true though and is entirely up to you to decide), you may repeat the targets seq_length times. nn.CrossEntropyLoss allows you to provide additional axes, but you have to follow a specific shape layout:
Input: (N, C) where C = number of classes, or (N, C, d_1, d_2, ..., d_K) with K≥1 in the case of K-dimensional loss.
Target: (N) where each value is 0 ≤ targets[i] ≤ C−1 , or (N, d_1, d_2, ..., d_K) with K≥1 in the case of K-dimensional loss.
In your case, C=4 and seq_length (what you referred to as D) would be d_1.
>>> seq_length = 10
>>> out = torch.rand(128, seq_length, 4) # mocking model's output
>>> y = torch.rand(128).long() # target tensor
>>> criterion = nn.CrossEntropyLoss()
>>> out_perm = out.permute(0, 2, 1)
>>> out_perm.shape
torch.Size([128, 4, 10]) # (N, C, d_1)
>>> y_rep = y[:, None].repeat(1, seq_length)
>>> y_rep.shape
torch.Size([128, 10]) # (N, d_1)
Then call your loss function with criterion(out_perm, y_rep).
Some notes: I'm using tensorflow 2.3.0, python 3.8.2, and numpy 1.18.5 (not sure if that one matters though)
I'm writing a custom layer that stores a non-trainable tensor N of shape (a, b) internally, where a, b are known values (this tensor is created during init). When called on an input tensor, it flattens the input tensor, flattens its stored tensor, and concatenates the two together. Unfortunately, I can't seem to figure out how to preserve the unknown batch dimension during this concatenation. Here's minimal code:
import tensorflow as tf
from tensorflow.keras.layers import Layer, Flatten
class CustomLayer(Layer):
def __init__(self, N): # N is a tensor of shape (a, b), where a, b > 1
super(CustomLayer, self).__init__()
self.N = self.add_weight(name="N", shape=N.shape, trainable=False, initializer=lambda *args, **kwargs: N)
# correct me if I'm wrong in using this initializer approach, but for some reason, when I
# just do self.N = N, this variable would disappear when I saved and loaded the model
def build(self, input_shape):
pass # my reasoning is that all the necessary stuff is handled in init
def call(self, input_tensor):
input_flattened = Flatten()(input_tensor)
N_flattened = Flatten()(self.N)
return tf.concat((input_flattened, N_flattened), axis=-1)
The first problem I noticed was that Flatten()(self.N) would return a tensor with the same shape (a, b) as the original self.N, and as a result, the returned value would have a shape of (a, num_input_tensor_values+b). My reasoning for this was that the first dimension, a, was treated as the batch size. I modified the call function:
def call(self, input_tensor):
input_flattened = Flatten()(input_tensor)
N = tf.expand_dims(self.N, axis=0) # N would now be shape (1, a, b)
N_flattened = Flatten()(N)
return tf.concat((input_flattened, N_flattened), axis=-1)
This would return a tensor with shape (1, num_input_vals + a*b), which is great, but now the batch dimension is permanently 1, which I realized when I started training a model with this layer and it would only work for a batch size of 1. This is also really apparent in the model summary - if I were to put this layer after an input and add some other layers afterwards, the first dimension of the output tensors goes like None, 1, 1, 1, 1.... Is there a way to store this internal tensor and use it in call while preserving the variable batch size? (For example, with a batch size of 4, a copy of the same flattened N would be concatenated onto the end of each of the 4 flattened input tensors.)
You have to have as many flattened N vectors, as you have samples in your input, because you are concatenating to every sample. Think of it like pairing up rows and concatenating them. If you have only one N vector, then only one pair can be concatenated.
To solve this, you should use tf.tile() to repeat N as many times as there are samples in your batch.
Example:
def call(self, input_tensor):
input_flattened = Flatten()(input_tensor) # input_flattened shape: (None, ..)
N = tf.expand_dims(self.N, axis=0) # N shape: (1, a, b)
N_flattened = Flatten()(N) # N_flattened shape: (1, a*b)
N_tiled = tf.tile(N_flattened, [tf.shape(input_tensor)[0], 1]) # repeat along the first dim as many times, as there are samples and leave the second dim alone
return tf.concat((input_flattened, N_tiled), axis=-1)
I am currently working on a neural network that takes some inputs and returns 2 outputs. I used 2 outputs in a regression problem where they both are 2 coordinates, X and Y.
My problem doesn't need X and Y values but angle it is facing which is atan2(y,x).
I am trying to to create a custom keras metric and a loss function that does a atan2 operation between the elements of the predicted tensor and true tensor so as to better train the network on my task.
The shape of the output tensor in metric is [?, 2] and I want to do a function where I can loop through the tensor and apply atan2(tensor[itr, 1], tensor[itr, 0]) on it to get an array of another tensors.
I have tried using tf.slit and tf.slice
I don't want to convert it into a numpy array and back to tensorflow due to performance reasons.
I have tried to get the shape of tensors using tensor.get_shape().as_list() and iterate through it.
self.model.compile(loss="mean_absolute_error",
optimizer=tf.keras.optimizers.Adam(lr=0.01),
metrics=[vect2d_to_angle_metric])
# This is the function i want to work on
def vect2d_to_angle_metric(y_true, y_predicted):
print("y_true = ", y_true)
print("y_predicted = ", y_predicted)
print("y_true shape = ", y_true.shape())
print("y_predicted shape = ", y_predicted.shape())
The print out of the above function being
y_true = Tensor("dense_2_target:0", shape=(?, ?), dtype=float32)
y_predicted = Tensor("dense_2/BiasAdd:0", shape=(?, 2), dtype=float32)
y_true shape = Tensor("metrics/vect2d_to_angle_metric/Shape:0", shape=(2,), dtype=int32)
y_predicted shape = Tensor("metrics/vect2d_to_angle_metric/Shape_1:0", shape=(2,), dtype=int32)
Python pseudo-code of the functionality I want to apply to the tensorflow function
def evaluate(self):
mean_array = []
for i in range(len(x_test)):
inputs = x_test[i]
prediction = self.model.getprediction(i)
predicted_angle = np.arctan2(result[i][1], result[i][0])
real_angle = np.arctan2(float(self.y_test[i][1]), float(self.y_test[i][0]))
mean_array.append(([abs(predicted_angle - real_angle)]/real_angle) * 100)
i += 1
I expect to slide the 2 sides of the tensor [i][0] and [i][1] and to a tf.atan2() function on both of them and finally make another tensor out of them so as to follow with other calculations and pass the custom loss.
The output layer of my CNN should use the RBF function, described as "each neuron outputs the square of the Euclidean distance between its input vector and its weight vector". I've implemented this as
dense2 = tf.square(tf.norm(dense1 - tf.transpose(dense2_W)))
where dense1 is a tensor of shape (?, 84). I've tried declaring dense2_W, the weights, as a variable of shape (84, 10) since it's doing number classification and should have 10 outputs. Running the code with a batch of 100 I get this error: InvalidArgumentError: Incompatible shapes: [100,84] vs. [10,84]. I believe it is due to the subtraction.
I train the network by iterating this code:
x_batch, y_batch = mnist.train.next_batch(100)
x_batch = tf.pad(x_batch, [[0,0],[2,2],[2,2],[0,0]]).eval() # Pad 28x28 -> 32x32
sess.run(train_step, {X: x_batch, Y: y_batch})
and then test it using the entire test set, thus the batch size in the network must be dynamic.
How can I work around this? The batch size must be dynamic, as in dense1's case, but I don't understand how to make a variable with dynamic size and transposing it (dense2_W).
You need the shapes of the two tensors to match. Assuming you want to share the weights across the batch and also having separate set of weights for each output class, you could reshape both of the tensors in order to be correctly broadcasted, e.g:
# broadcasting will copy the input to every output class neuron
input_dense = tf.expand_dims(dense1, axis=2)
# broadcasting here will copy the weights across the batch
weights = tf.expand_dims(tf.transpose(dense2_W), axis=0)
dense2 = tf.square(tf.norm(input_dense - weights, axis=1))
The resulting tensor dense2 should have shape of [batch_size, num_classes], which is [100, 10] in your case (so it will hold logits for every data instance over the number of output classes)
EDIT: added axis argument to the tf.norm call so that the distance is computed in the hidden dimension (not over the whole matrices).
After reading several articles, I am still quite confused about correctness of my implementation of getting last hidden states from BiLSTM.
Understanding Bidirectional RNN in PyTorch (TowardsDataScience)
PackedSequence for seq2seq model (PyTorch forums)
What's the difference between “hidden” and “output” in PyTorch LSTM? (StackOverflow)
Select tensor in a batch of sequences (Pytorch formums)
The approach from the last source (4) seems to be the cleanest for me, but I am still uncertain if I understood the thread correctly. Am I using the right final hidden states from LSTM and reversed LSTM? This is my implementation
# pos contains indices of words in embedding matrix
# seqlengths contains info about sequence lengths
# so for instance, if batch_size is 2 and pos=[4,6,9,3,1] and
# seqlengths contains [3,2], we have batch with samples
# of variable length [4,6,9] and [3,1]
all_in_embs = self.in_embeddings(pos)
in_emb_seqs = pack_sequence(torch.split(all_in_embs, seqlengths, dim=0))
output,lasthidden = self.rnn(in_emb_seqs)
if not self.data_processor.use_gru:
lasthidden = lasthidden[0]
# u_emb_batch has shape batch_size x embedding_dimension
# sum last state from forward and backward direction
u_emb_batch = lasthidden[-1,:,:] + lasthidden[-2,:,:]
Is it correct?
In a general case if you want to create your own BiLSTM network, you need to create two regular LSTMs, and feed one with the regular input sequence, and the other with inverted input sequence. After you finish feeding both sequences, you just take the last states from both nets and somehow tie them together (sum or concatenate).
As I understand, you are using built-in BiLSTM as in this example (setting bidirectional=True in nn.LSTM constructor). Then you get the concatenated output after feeding the batch, as PyTorch handles all the hassle for you.
If it is the case, and you want to sum the hidden states, then you have to
u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])
assuming you have only one layer. If you have more layers, your variant seem better.
This is because the result is structured (see documentation):
h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len
By the way,
u_emb_batch_2 = output[-1, :, :HIDDEN_DIM] + output[-1, :, HIDDEN_DIM:]
should provide the same result.
Here's a detailed explanation for those working with unpacked sequences:
output is of shape (seq_len, batch, num_directions * hidden_size) (see documentation). This means that the output of the forward and backward passes of your GRU are concatenated along the 3rd dimension.
Assuming batch=2 and hidden_size=256 in your example, you can easily separate the outputs of both forward and backward passes by doing:
output = output.view(-1, 2, 2, 256) # (seq_len, batch_size, num_directions, hidden_size)
output_forward = output[:, :, 0, :] # (seq_len, batch_size, hidden_size)
output_backward = output[:, :, 1, :] # (seq_len, batch_size, hidden_size)
(Note: the -1 tells pytorch to infer that dimension from the others. See this question.)
Equivalently, you can use the torch.chunk function on the original output of shape (seq_len, batch, num_directions * hidden_size):
# Split in 2 tensors along dimension 2 (num_directions)
output_forward, output_backward = torch.chunk(output, 2, 2)
Now you can torch.gather the last hidden state of the forward pass using seqlengths (after reshaping it), and the last hidden state of the backward pass by selecting the element at position 0
# First we unsqueeze seqlengths two times so it has the same number of
# of dimensions as output_forward
# (batch_size) -> (1, batch_size, 1)
lengths = seqlengths.unsqueeze(0).unsqueeze(2)
# Then we expand it accordingly
# (1, batch_size, 1) -> (1, batch_size, hidden_size)
lengths = lengths.expand((1, -1, output_forward.size(2)))
last_forward = torch.gather(output_forward, 0, lengths - 1).squeeze(0)
last_backward = output_backward[0, :, :]
Note that I subtracted 1 from lengths because of the 0-based indexing
A this point both last_forward and last_backward are of shape (batch_size, hidden_dim)
I tested the biLSTM output and h_n:
# shape of x is size(batch_size, time_steps, input_size)
# shape of output (batch_size, time_steps, hidden_size * num_directions)
# shape of h_n is size(num_directions, batch_size, hidden_size)
output, (h_n, _c_n) = biLSTM(x)
print('first step (element) of output from reverse == h_n from reverse?',
output[:, 0, hidden_size:] == h_n[1])
print('last step (element) of output from reverse == h_n from reverse?',
output[:, -1, hidden_size:] == h_n[1])
output
first step (element) of output from reverse == h_n from reverse? True
last step (element) of output from reverse == h_n from reverse? False
This confirmed that the h_n of the reverse direction is the hidden state of the first time step.
So, if you really need the hidden state of the last time step from both forward and reverse direction, you should use:
sum_lasthidden = output[:, -1, :hidden_size] + output[:, -1, hidden_size:]
not
h_n[0,:,:] + h_n[1,:,:]
As h_n[1,:,:] is the hidden state of the first time step from the reverse direction.
So the answer from #igrinis
u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])
is not correct.
But in theory, last time step hidden state from the reverse direction only contains information from the last time step of the sequence.