XLM/BERT sequence outputs to pooled output with weighted average pooling - python

Let's say I have a tokenized sentence of length 10, and I pass it to a BERT model.
bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])
This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector
In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:
hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])
pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
Any other ways to obtain a good sentence representation?

There are two simple ways to get a sentence representation:
Get the vector for the CLS token.
Get the pooler_output
Assuming the input is [batch_size, seq_length, d_model], where batch_size is the number of sentences, then to get the CLS token for every sentence:
bert_out = bert(**bert_inp)
hidden_states = bert_out['last_hidden_state']
cls_tokens = hidden_states[:, 0, :] # 0 for the CLS token for every sentence.
You will have a tensor with shape (batch_size, d_model).
To get the pooler_output:
bert_out = bert(**bert_inp)
pooler_output = bert_out['pooler_output']
Again you get a tensor with shape (batch_size, d_model).

Related

Why do I get a ValueError when using CrossEntropyLoss

I am trying to train a pretty simple 2-layer neural network for a multi-class classification class. I am using CrossEntropyLoss and I get the following error: ValueError: Expected target size (128, 4), got torch.Size([128]) in my training loop at the point where I am trying to calculate the loss.
My last layer is a softmax so it outputs the probabilities of each of the 4 classes. My target values are a vector of dimension 128 (just the class values). Am I initializing the CrossEntropyLoss object incorrectly?
I looked up existing posts, this one seemed the most relevant:
https://discuss.pytorch.org/t/valueerror-expected-target-size-128-10000-got-torch-size-128-1/29424 However, if I had to squeeze my target values, how would that work? Like right now they are just class values for e.g., [0 3 1 0]. Is that not how they are supposed to look? I would think that the loss function maps the highest probability from the last layer and associates that to the appropriate class index.
Details:
This is using PyTorch
Python version is 3.7
NN architecture is: embedding -> pool -> h1 -> relu -> h2 -> softmax
Model Def (EDITED):
self.embedding_layer = create_embedding_layer(embeddings)
self.pool = nn.MaxPool1d(1)
self.h1 = nn.Linear(embedding_dim, embedding_dim)
self.h2 = nn.Linear(embedding_dim, 4)
self.s = nn.Softmax(dim=2)
forward pass:
x = self.embedding_layer(x)
x = self.pool(x)
x = self.h1(x)
x = F.relu(x)
x = self.h2(x)
x = self.s(x)
return x
The issue is that the output of your model is a tensor shaped as (batch, seq_length, n_classes). Each sequence element in each batch is a four-element tensor corresponding to the predicted probability associated with each class (0, 1, 2, and 3). Your target tensor is shaped (batch,) which is usually the correct shape (you didn't use one-hot-encodings). However, in this case, you need to provide a target for each one of the sequence elements.
Assuming the target is the same for each element of your sequence (this might not be true though and is entirely up to you to decide), you may repeat the targets seq_length times. nn.CrossEntropyLoss allows you to provide additional axes, but you have to follow a specific shape layout:
Input: (N, C) where C = number of classes, or (N, C, d_1, d_2, ..., d_K) with K≥1 in the case of K-dimensional loss.
Target: (N) where each value is 0 ≤ targets[i] ≤ C−1 , or (N, d_1, d_2, ..., d_K) with K≥1 in the case of K-dimensional loss.
In your case, C=4 and seq_length (what you referred to as D) would be d_1.
>>> seq_length = 10
>>> out = torch.rand(128, seq_length, 4) # mocking model's output
>>> y = torch.rand(128).long() # target tensor
>>> criterion = nn.CrossEntropyLoss()
>>> out_perm = out.permute(0, 2, 1)
>>> out_perm.shape
torch.Size([128, 4, 10]) # (N, C, d_1)
>>> y_rep = y[:, None].repeat(1, seq_length)
>>> y_rep.shape
torch.Size([128, 10]) # (N, d_1)
Then call your loss function with criterion(out_perm, y_rep).

BERT sentence embedding by summing last 4 layers

I used Chris Mccormick tutorial on BERT using pytorch-pretained-bert to get a sentence embedding as follows:
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# Holds the list of 12 layer embeddings for each token
# Will have the shape: [# tokens, # layers, # features]
token_embeddings = []
# For each token in the sentence...
for token_i in range(len(tokenized_text)):
# Holds 12 layers of hidden states for each token
hidden_layers = []
# For each of the 12 layers...
for layer_i in range(len(encoded_layers)):
# Lookup the vector for `token_i` in `layer_i`
vec = encoded_layers[layer_i][batch_i][token_i]
hidden_layers.append(vec)
token_embeddings.append(hidden_layers)
Now, I am trying to get the final sentence embedding by summing the last 4 layers as follows:
summed_last_4_layers = [torch.sum(torch.stack(layer)[-4:], 0) for layer in token_embeddings]
But instead of getting a single torch vector of length 768 I get the following:
[tensor([-3.8930e+00, -3.2564e+00, -3.0373e-01, 2.6618e+00, 5.7803e-01,
-1.0007e+00, -2.3180e+00, 1.4215e+00, 2.6551e-01, -1.8784e+00,
-1.5268e+00, 3.6681e+00, ...., 3.9084e+00]), tensor([-2.0884e+00, -3.6244e-01, ....2.5715e+00]), tensor([ 1.0816e+00,...-4.7801e+00]), tensor([ 1.2713e+00,.... 1.0275e+00]), tensor([-6.6105e+00,..., -2.9349e-01])]
What did I get here? How do I pool the sum of the last for layers?
Thank you!
You create a list using a list comprehension that iterates over token_embeddings. It is a list that contains one tensor per token - not one tensor per layer as you probably thought (judging from your for layer in token_embeddings). You thus get a list with a length equal to the number of tokens. For each token, you have a vector that is a sum of BERT embeddings from the last 4 layers.
More efficient would be avoiding the explicit for loops and list comprehenions:
summed_last_4_layers = torch.stack(encoded_layers[-4:]).sum(0)
Now, variable summed_last_4_layers contains the same data, but in the form of a single tensor of dimension: length of the sentence × 768.
To get a single (i.e., pooled) vector, you can do pooling over the first dimension of the tensor. Max-pooling or average-pooling might make much more sense in this case than summing all the token embeddings. When summing the values, vectors of differently long sentences are in different ranges and are not really comparable.

Product merge layers with Keras functionnal API for Word2Vec model

I am trying to implement a Word2Vec CBOW with negative sampling with Keras, following the code found here:
EMBEDDING_DIM = 100
sentences = SentencesIterator('test_file.txt')
v_gen = VocabGenerator(sentences=sentences, min_count=5, window_size=3,
sample_threshold=-1, negative=5)
v_gen.scan_vocab()
v_gen.filter_vocabulary()
reverse_vocab = v_gen.generate_inverse_vocabulary_lookup('test_lookup')
# Generate embedding matrix with all values between -1/2d, 1/2d
embedding = np.random.uniform(-1.0 / (2 * EMBEDDING_DIM),
1.0 / (2 * EMBEDDING_DIM),
(v_gen.vocab_size + 3, EMBEDDING_DIM))
# Creating CBOW model
# Model has 3 inputs
# Current word index, context words indexes and negative sampled word indexes
word_index = Input(shape=(1,))
context = Input(shape=(2*v_gen.window_size,))
negative_samples = Input(shape=(v_gen.negative,))
# All inputs are processed through a common embedding layer
shared_embedding_layer = (Embedding(input_dim=(v_gen.vocab_size + 3),
output_dim=EMBEDDING_DIM,
weights=[embedding]))
word_embedding = shared_embedding_layer(word_index)
context_embeddings = shared_embedding_layer(context)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1),
output_shape=(EMBEDDING_DIM,))(context_embeddings)
# Context is multiplied (dot product) with current word and negative
# sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
negative_context_product = merge([negative_words_embedding, cbow],
mode='dot',
concat_axis=-1)
# The dot products are outputted
model = Model(input=[word_index, context, negative_samples],
output=[word_context_product, negative_context_product])
# Binary crossentropy is applied on the output
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
print(model.summary())
model.fit_generator(v_gen.pretraining_batch_generator(reverse_vocab),
samples_per_epoch=10,
nb_epoch=1)
However, I get an error during the merge part because Embedding layer is a 3D tensor while cbow is only 2 dimensions. I assume I need to reshape the embedding (which is [?, 1, 100]) to [1, 100] but I can't find how to reshape with the functional API.
I am using the Tensorflow backend.
Also, if someone can point to an other implementation of CBOW with Keras (Gensim free), I would love to have a look to it!
Thank you!
EDIT: Here is the error
Traceback (most recent call last):
File "cbow.py", line 48, in <module>
word_context_product = merge([word_embedding, cbow], mode='dot')
.
.
.
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100].
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100].
Indeed you need to reshape the word_embedding tensor. Two ways to do it :
Either you use the Reshape() layer, imported from keras.layers.core, this is done like :
word_embedding = Reshape((100,))(word_embedding)
the argument of Reshape is a tuple with the target shape.
Or you can use Flatten() layer, also imported from keras.layers.core, used like this :
word_embedding = Flatten()(word_embedding)
taking nothing as an argument, it will just remove "empty" dimensions.
Does this help you?
EDIT :
Indeed the second merge() is a bit more tricky. The dot merge in Keras only accepts tensors of the same rank, so same len(shape).
So what you will do is use a Reshape() layer to add back that 1 empty dimension, then use the feature dot_axes instead of concat_axis which is not relevant for a dot merge.
This is what I propose you for the solution :
word_embedding = shared_embedding_layer(word_index)
# Shape output = (None,1,emb_size)
context_embeddings = shared_embedding_layer(context)
# Shape output = (None, 2*window_size, emb_size)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Shape output = (None, negative, emb_size)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1),
output_shape=(EMBEDDING_DIM,))(context_embeddings)
# Shape output = (None, emb_size)
cbow = Reshape((1,emb_size))(cbow)
# Shape output = (None, 1, emb_size)
# Context is multiplied (dot product) with current word and negative
# sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
# Shape output = (None, 1, 1)
word_context_product = Flatten()(word_context_product)
# Shape output = (None,1)
negative_context_product = merge([negative_words_embedding, cbow], mode='dot',dot_axes=[2,2])
# Shape output = (None, negative, 1)
negative_context_product = Flatten()(negative_context_product)
# Shape output = (None, negative)
Is it working? :)
The problem comes from the rigidity of TF regarding the matrix multiplication. Merge with "dot" mode calls the backend batch_dot() function and, as opposed to Theano, TensorFlow requires the matrix to have the same rank : read here.

Tensorflow: How to add bias to ouputs from RNN where the sequences have varying length

First let me explain the input and target values of the RNN. My dataset consists of sequences (e.g. 4, 7, 1, 23, 42, 69). The RNN is trained to predict the next value in each sequence. So all values except the last are input, and all values except the first are target values. Each value is represented as a 1-HOT vector.
I have a RNN in Tensorflow where the outputs from the RNN (tf.dynamic_rnn) are sent through a feedforward layer. The input sequences have varying length, so I use the sequence_length parameter to specify the length of each sequence in a batch. The output from the RNN layer is a Tensor of outputs for each timestep. Most sequences have the same length, but some are shorter. When shorter sequences are sent through, I get additional all-zero vectors (as a padding).
The problem is that I want to send the output from the RNN layer through a feedforward layer. If I add bias in this feedforward layer, then the additional all-zero vectors become non-zero. With no bias, only weights, this works fine, since the all-zero vectors are not affected by multiplication. So without bias, I can set the target vectors as all-zero as well and thus they will not affect the backward pass. But if bias is added, I don't know what to put in the padded/dummy target vectors.
So the network looks like this:
[INPUT (1-HOT vectors, one vector for each value in the sequence)]
V
[GRU layer (smaller size than the input layer)]
V
[Feedforward layer (outputs vectors of the same size as the input)]
And here is the code:
# [batch_size, max_sequence_length, size of 1-HOT vectors]
x = tf.placeholder(tf.float32, [None, max_length, n_classes])
y = tf.placeholder(tf.int32, [None, max_length, n_classes])
session_length = tf.placeholder(tf.int32, [None])
outputs, state = rnn.dynamic_rnn(
rnn_cell.GRUCell(num_hidden),
x,
dtype=tf.float32,
sequence_length=session_length
)
layer = {'weights':tf.Variable(tf.random_normal([n_hidden, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
# Flatten to apply same weights to all timesteps
outputs = tf.reshape(outputs, [-1, n_hidden])
prediction = tf.matmul(output, layer['weights']) # + layer['bias']
error = tf.nn.softmax_cross_entropy_with_logits(prediction,y)
You can add the bias, but mask out the non-relevant sequence elements from the loss function.
See an example from the im2txt project:
weights = tf.to_float(tf.reshape(self.input_mask, [-1])) # these are the masks
# Compute losses.
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, targets)
batch_loss = tf.div(tf.reduce_sum(tf.mul(losses, weights)),
tf.reduce_sum(weights),
name="batch_loss") # Here the irrelevant sequence elements are masked out
Also, for generating the mask see the function batch_with_dynamic_pad in the same project, under ops/inputs

TensorFlow bidirectional LSTM encoding of word embeddings

I have a word embedding matrix containing a vector for each word. I am trying to use TensorFlow to get the bidirectional LSTM encoding of each word given the embedding vectors. Unfortunately, I get the following error message:
ValueError: Shapes (1, 125) and () must have the same rank
Exception TypeError: TypeError("'NoneType' object is not callable",) in ignored
Here is the code I used:
# Declare max number of words in a sentence
self.max_len = 100
# Declare number of dimensions for word embedding vectors
self.wdims = 100
# Indices of words in the sentence
self.wrd_holder = tf.placeholder(tf.int32, [self.max_len])
# Embedding Matrix
wrd_lookup = tf.Variable(tf.truncated_normal([len(vocab)+3, self.wdims], stddev=1.0 / np.sqrt(self.wdims)))
# Declare forward and backward cells
forward = rnn_cell.LSTMCell(125, (self.wdims))
backward = rnn_cell.LSTMCell(125, (self.wdims))
# Perform lookup
wrd_embd = tf.nn.embedding_lookup(wrd_lookup, self.wrd_holder)
embd = tf.split(0, self.max_len, wrd_embd)
# run bidirectional LSTM
boutput = rnn.bidirectional_rnn(forward, backward, embd, dtype=tf.float32, sequence_length=self.max_len)
the sequence length passed to rnn must be a vector of length batch size.

Categories