BERT sentence embedding by summing last 4 layers

BERT sentence embedding by summing last 4 layers - python

I used Chris Mccormick tutorial on BERT using pytorch-pretained-bert to get a sentence embedding as follows:
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# Holds the list of 12 layer embeddings for each token
# Will have the shape: [# tokens, # layers, # features]
token_embeddings = []
# For each token in the sentence...
for token_i in range(len(tokenized_text)):
# Holds 12 layers of hidden states for each token
hidden_layers = []
# For each of the 12 layers...
for layer_i in range(len(encoded_layers)):
# Lookup the vector for `token_i` in `layer_i`
vec = encoded_layers[layer_i][batch_i][token_i]
hidden_layers.append(vec)
token_embeddings.append(hidden_layers)
Now, I am trying to get the final sentence embedding by summing the last 4 layers as follows:
summed_last_4_layers = [torch.sum(torch.stack(layer)[-4:], 0) for layer in token_embeddings]
But instead of getting a single torch vector of length 768 I get the following:
[tensor([-3.8930e+00, -3.2564e+00, -3.0373e-01, 2.6618e+00, 5.7803e-01,
-1.0007e+00, -2.3180e+00, 1.4215e+00, 2.6551e-01, -1.8784e+00,
-1.5268e+00, 3.6681e+00, ...., 3.9084e+00]), tensor([-2.0884e+00, -3.6244e-01, ....2.5715e+00]), tensor([ 1.0816e+00,...-4.7801e+00]), tensor([ 1.2713e+00,.... 1.0275e+00]), tensor([-6.6105e+00,..., -2.9349e-01])]
What did I get here? How do I pool the sum of the last for layers?
Thank you!

You create a list using a list comprehension that iterates over token_embeddings. It is a list that contains one tensor per token - not one tensor per layer as you probably thought (judging from your for layer in token_embeddings). You thus get a list with a length equal to the number of tokens. For each token, you have a vector that is a sum of BERT embeddings from the last 4 layers.
More efficient would be avoiding the explicit for loops and list comprehenions:
summed_last_4_layers = torch.stack(encoded_layers[-4:]).sum(0)
Now, variable summed_last_4_layers contains the same data, but in the form of a single tensor of dimension: length of the sentence × 768.
To get a single (i.e., pooled) vector, you can do pooling over the first dimension of the tensor. Max-pooling or average-pooling might make much more sense in this case than summing all the token embeddings. When summing the values, vectors of differently long sentences are in different ranges and are not really comparable.

Related

XLM/BERT sequence outputs to pooled output with weighted average pooling

Let's say I have a tokenized sentence of length 10, and I pass it to a BERT model.
bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])
This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector
In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:
hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])
pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
Any other ways to obtain a good sentence representation?

There are two simple ways to get a sentence representation:
Get the vector for the CLS token.
Get the pooler_output
Assuming the input is [batch_size, seq_length, d_model], where batch_size is the number of sentences, then to get the CLS token for every sentence:
bert_out = bert(**bert_inp)
hidden_states = bert_out['last_hidden_state']
cls_tokens = hidden_states[:, 0, :] # 0 for the CLS token for every sentence.
You will have a tensor with shape (batch_size, d_model).
To get the pooler_output:
bert_out = bert(**bert_inp)
pooler_output = bert_out['pooler_output']
Again you get a tensor with shape (batch_size, d_model).

Input to Bidirectional LSTM in tensorflow

Normally all inputs fed to BiLSTM are of shape [batch_size, time_steps, input_size].
However, I'm working on a problem of Automatic Grading of an Essay in which there's an extra dimension called number of sentences in each essay. So in my case, a typical batch after embedding using word2vec, is of shape [2,16,25,300].
Here, there are 2 essays in each batch (batch_size=2), each essay has 16 sentences, each sentence is 25 words long(time_step=25) and I'm using word2vec of 300 dimensions (input_size=300).
So clearly I need to loop this batch over dimension 16 somehow such that the shape of input becomes [2,25,300] in each iteration. I have tried for a long time but I haven't been able to find a way to do it. For example, if you make a loop over tf.nn.bidirectional_dynamic_rnn(), it'll give error in second iteration saying that tf.nn.bidirectional_dynamic_rnn() kernel already exists. I can't directly make a for loop over sentence_dimension because those are tensors of shape [None,None,None,300] and I gave values just for the sake of simplicity. If there any other way to do it? Thanks. Please note that I am not using Keras or any other framework.
Here's a sample encoding layer for reference.
def bidirectional_lstm(input_data):
cell = tf.nn.rnn_cell.LSTMCell(num_units=200, state_is_tuple=True)
outputs, states = tf.nn.bidirectional_dynamic_rnn(cell_fw=cell,
cell_bw=cell,
dtype=tf.float32,
inputs=input_data)
return tf.concat(outputs,2)
embedded is of shape [2,16,25,300].
And here's a sample input
for batch_i, texts_batch in enumerate(get_batches(X_train, batch_size)): ## get_batches() is yielding batches one by one
## X_train is of shape [2,16,25] ([batch_size,sentence_length,num_words])
## word2vec
embeddings = tf.nn.embedding_lookup(word_embedding_matrix, texts_batch)
## embeddings shape is now [2,16,25,300] ([batch_size,sentence_length,num_words,word2vec_dim])
## Now I need some kind of loop here to loop it over sentence dimension. Can't use for loop since these are all placeholders with dimensions None
## ??? output = bidirectional_lstm(embeddings,3,200,0.7) ??? This would be correct if there was no sentence dimension.

How to structure an LSTM neural network for classification

I have data that has various conversations between two people. Each sentence has some type of classification. I am attempting to use an NLP net to classify each sentence of the conversation. I tried a convolution net and get decent results (not ground breaking tho). I figured that since this a back and forth conversation, and LSTM net may produce better results, because what was previously said may have a large impact on what follows.
If I follow the structure above, I would assume that I am doing a many-to-many. My data looks like.
X_train = [[sentence 1],
[sentence 2],
[sentence 3]]
Y_train = [[0],
[1],
[0]]
Data has been processed using word2vec. I then design my network as follows..
model = Sequential()
model.add(Embedding(len(vocabulary),embedding_dim,
input_length=X_train.shape[1]))
model.add(LSTM(88))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',
metrics['accuracy'])
model.fit(X_train,Y_train,verbose=2,nb_epoch=3,batch_size=15)
I assume that this setup will feed one batch of sentences in at a time. However, if in model.fit, shuffle is not equal to false its receiving shuffled batches, so why is an LSTM net even useful in this case? From research on the subject, to achieve a many-to-many structure one would need to change the LSTM layer too
model.add(LSTM(88,return_sequence=True))
and the output layer would need to be...
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
When switching to this structure I get an error on the input size. I'm unsure of how to reformat the data to meet this requirement, and also how to edit the embedding layer to receive the new data format.
Any input would be greatly appreciated. Or if you have any suggestions on a better method, I am more than happy to hear them!

Your first attempt was good. The shuffling takes place between sentences, the only shuffle the training samples between them so that they don't always come in in the same order. The words inside sentences are not shuffled.
Or maybe I didn't understand the question correctly?
EDIT :
After a better understanding of the question, here is my proposition.
Data preparation : You slice your corpus in blocks of n sentences (they can overlap).
You should then have a shape like (number_blocks_of_sentences, n, number_of_words_per_sentence) so basically a list of 2D arrays which contain blocks of n sentences. n shouldn't be too big because LSTM can't handle huge number of elements in the sequence when training (vanishing gradient).
Your targets should be an array of shape (number_blocks_of_sentences, n, 1) so also a list of 2D arrays containing the class of each sentence in your block of sentences.
Model :
n_sentences = X_train.shape[1] # number of sentences in a sample (n)
n_words = X_train.shape[2] # number of words in a sentence
model = Sequential()
# Reshape the input because Embedding only accepts shape (batch_size, input_length) so we just transform list of sentences in huge list of words
model.add(Reshape((n_sentences * n_words,),input_shape = (n_sentences, n_words)))
# Embedding layer - output shape will be (batch_size, n_sentences * n_words, embedding_dim) so each sample in the batch is a big 2D array of words embedded
model.add(Embedding(len(vocabaulary), embedding_dim, input_length = n_sentences * n_words ))
# Recreate the sentence shaped array
model.add(Reshape((n_sentences, n_words, embedding_dim)))
# Encode each sentence - output shape is (batch_size, n_sentences, 88)
model.add(TimeDistributed(LSTM(88)))
# Go over lines and output hidden layer which contains info about previous sentences - output shape is (batch_size, n_sentences, hidden_dim)
model.add(LSTM(hidden_dim, return_sequence=True))
# Predict output binary class - output shape is (batch_size, n_sentences, 1)
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
...
This should be a good start.
I hope this helps

Confusion with weights dumping from neural net in keras

I created a simple 2-layer network, one hidden layer. I am dumping the weights from the middle layer to visualize what the hidden neurons are learning.
I am using
weights = model.layers[0].get_weights()
When I look at the weights structure I get:
So len(weights) = 2, len(weights[0]) = 500, len(weights[1]) = 100.
I want to create an array m of size (500,100), so that m.shape = (500,100).
I tried numpy.reshape(weights, 500, 100), zip(weights[0], weights[1]), then, by chance, I wrote numpy.array(weights[0]) and this came back with shape (500,100).
Can someone explain why?

The Keras tensors work differently, they are n-dimensional lists. To illustrate the concept consider the list:
>>> list=[[[1,2,3],[1,2,3],[1,2,3],[1,2,3],[1,2,3]],[1,2,3]]
Here, the first element in list contains n-length elements and second list can also be an n-length elements. When you do:
>>> len(list)
Output is:
2( which is 2 in your case)
Also,
>>> len(list[0])
5(which is 500 in your case)
>>> len(list[1])
3(which is 100 in your case)
But when you try to convert to array:
>>> np.array(list[0]).shape
The answer is:
(5, 3) (which is 500,100 in your case)
This is because you are having an n-length list element inside your list[0] (which is weights[0] in your case). So when I asked you to return
len(weights[0][0])
it returned:
100
because it contains 100 length elements in that list and 500 such elements in it. Now, if you are wondering what does each 100 values mean, so they are corressponding weights of the connections i.e.
weights[0][0] = weights between first input to all 100 hidden neurons

TensorFlow bidirectional LSTM encoding of word embeddings

I have a word embedding matrix containing a vector for each word. I am trying to use TensorFlow to get the bidirectional LSTM encoding of each word given the embedding vectors. Unfortunately, I get the following error message:
ValueError: Shapes (1, 125) and () must have the same rank
Exception TypeError: TypeError("'NoneType' object is not callable",) in ignored
Here is the code I used:
# Declare max number of words in a sentence
self.max_len = 100
# Declare number of dimensions for word embedding vectors
self.wdims = 100
# Indices of words in the sentence
self.wrd_holder = tf.placeholder(tf.int32, [self.max_len])
# Embedding Matrix
wrd_lookup = tf.Variable(tf.truncated_normal([len(vocab)+3, self.wdims], stddev=1.0 / np.sqrt(self.wdims)))
# Declare forward and backward cells
forward = rnn_cell.LSTMCell(125, (self.wdims))
backward = rnn_cell.LSTMCell(125, (self.wdims))
# Perform lookup
wrd_embd = tf.nn.embedding_lookup(wrd_lookup, self.wrd_holder)
embd = tf.split(0, self.max_len, wrd_embd)
# run bidirectional LSTM
boutput = rnn.bidirectional_rnn(forward, backward, embd, dtype=tf.float32, sequence_length=self.max_len)

the sequence length passed to rnn must be a vector of length batch size.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BERT sentence embedding by summing last 4 layers - python

Related

XLM/BERT sequence outputs to pooled output with weighted average pooling

Input to Bidirectional LSTM in tensorflow

How to structure an LSTM neural network for classification

Confusion with weights dumping from neural net in keras

TensorFlow bidirectional LSTM encoding of word embeddings

Categories

Resources