TensorFlow bidirectional LSTM encoding of word embeddings

TensorFlow bidirectional LSTM encoding of word embeddings - python

I have a word embedding matrix containing a vector for each word. I am trying to use TensorFlow to get the bidirectional LSTM encoding of each word given the embedding vectors. Unfortunately, I get the following error message:
ValueError: Shapes (1, 125) and () must have the same rank
Exception TypeError: TypeError("'NoneType' object is not callable",) in ignored
Here is the code I used:
# Declare max number of words in a sentence
self.max_len = 100
# Declare number of dimensions for word embedding vectors
self.wdims = 100
# Indices of words in the sentence
self.wrd_holder = tf.placeholder(tf.int32, [self.max_len])
# Embedding Matrix
wrd_lookup = tf.Variable(tf.truncated_normal([len(vocab)+3, self.wdims], stddev=1.0 / np.sqrt(self.wdims)))
# Declare forward and backward cells
forward = rnn_cell.LSTMCell(125, (self.wdims))
backward = rnn_cell.LSTMCell(125, (self.wdims))
# Perform lookup
wrd_embd = tf.nn.embedding_lookup(wrd_lookup, self.wrd_holder)
embd = tf.split(0, self.max_len, wrd_embd)
# run bidirectional LSTM
boutput = rnn.bidirectional_rnn(forward, backward, embd, dtype=tf.float32, sequence_length=self.max_len)

the sequence length passed to rnn must be a vector of length batch size.

Related

XLM/BERT sequence outputs to pooled output with weighted average pooling

Let's say I have a tokenized sentence of length 10, and I pass it to a BERT model.
bert_out = bert(**bert_inp)
hidden_states = bert_out[0]
hidden_states.shape
>>>torch.Size([1, 10, 768])
This returns me a tensor of shape: [batch_size, seq_length, d_model] where each word in sequence is encoded as a 768-dimentional vector
In TensorFlow BERT also returns a so called pooled output which corresponds to a vector representation of a whole sentence.
I want to obtain it by taking a weighted average of sequence vectors and the way I do it is:
hidden_states.view(-1, 10).shape
>>> torch.Size([768, 10])
pooled = nn.Linear(10, 1)(hidden_states.view(-1, 10))
pooled.shape
>>> torch.Size([768, 1])
Is it the right way to proceed, or should I just flatten the whole thing and then apply linear?
Any other ways to obtain a good sentence representation?

There are two simple ways to get a sentence representation:
Get the vector for the CLS token.
Get the pooler_output
Assuming the input is [batch_size, seq_length, d_model], where batch_size is the number of sentences, then to get the CLS token for every sentence:
bert_out = bert(**bert_inp)
hidden_states = bert_out['last_hidden_state']
cls_tokens = hidden_states[:, 0, :] # 0 for the CLS token for every sentence.
You will have a tensor with shape (batch_size, d_model).
To get the pooler_output:
bert_out = bert(**bert_inp)
pooler_output = bert_out['pooler_output']
Again you get a tensor with shape (batch_size, d_model).

BERT sentence embedding by summing last 4 layers

I used Chris Mccormick tutorial on BERT using pytorch-pretained-bert to get a sentence embedding as follows:
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# Holds the list of 12 layer embeddings for each token
# Will have the shape: [# tokens, # layers, # features]
token_embeddings = []
# For each token in the sentence...
for token_i in range(len(tokenized_text)):
# Holds 12 layers of hidden states for each token
hidden_layers = []
# For each of the 12 layers...
for layer_i in range(len(encoded_layers)):
# Lookup the vector for `token_i` in `layer_i`
vec = encoded_layers[layer_i][batch_i][token_i]
hidden_layers.append(vec)
token_embeddings.append(hidden_layers)
Now, I am trying to get the final sentence embedding by summing the last 4 layers as follows:
summed_last_4_layers = [torch.sum(torch.stack(layer)[-4:], 0) for layer in token_embeddings]
But instead of getting a single torch vector of length 768 I get the following:
[tensor([-3.8930e+00, -3.2564e+00, -3.0373e-01, 2.6618e+00, 5.7803e-01,
-1.0007e+00, -2.3180e+00, 1.4215e+00, 2.6551e-01, -1.8784e+00,
-1.5268e+00, 3.6681e+00, ...., 3.9084e+00]), tensor([-2.0884e+00, -3.6244e-01, ....2.5715e+00]), tensor([ 1.0816e+00,...-4.7801e+00]), tensor([ 1.2713e+00,.... 1.0275e+00]), tensor([-6.6105e+00,..., -2.9349e-01])]
What did I get here? How do I pool the sum of the last for layers?
Thank you!

You create a list using a list comprehension that iterates over token_embeddings. It is a list that contains one tensor per token - not one tensor per layer as you probably thought (judging from your for layer in token_embeddings). You thus get a list with a length equal to the number of tokens. For each token, you have a vector that is a sum of BERT embeddings from the last 4 layers.
More efficient would be avoiding the explicit for loops and list comprehenions:
summed_last_4_layers = torch.stack(encoded_layers[-4:]).sum(0)
Now, variable summed_last_4_layers contains the same data, but in the form of a single tensor of dimension: length of the sentence × 768.
To get a single (i.e., pooled) vector, you can do pooling over the first dimension of the tensor. Max-pooling or average-pooling might make much more sense in this case than summing all the token embeddings. When summing the values, vectors of differently long sentences are in different ranges and are not really comparable.

LSTM network on pre trained word embedding gensim

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
While explaining, kindly let me know how do the dimension work out.
Upon request:
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )

You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

Adding hierarchical encoding to a pointer-generator Text-Summarization model

I'm working on text summarization, starting with the pointer-generator network described in the paper: https://arxiv.org/pdf/1704.04368.pdf, with a code release at: https://github.com/abisee/pointer-generator
I want to add hierarchical encoding to this network to be able to handle larger input documents for summarization. Right now, input documents are truncated at a length of 400 words because LSTMs have a keeping memory over very long inputs. We want to reweight the attention distribution considering sentence input.
The reweighting I am considering is from the paper: https://arxiv.org/pdf/1602.06023.pdf where the final distribution is
where "P_a^w(j) is the word-level attention weight at jth position of the source document, and s(j) is the ID of the sentence at jth word position, P_a^s(l)is the sentence-level attention weight for the lth sentence in the source, Nd is the number of words in the source document, and P_a(j) is the re-scaled attention at the jth word position"
The Pointer-Generator attention is calculated as
then
softmax(e^t_ j)
So I think I need to make another Pt-gen attention LSTM, feed the outputs through the same attention calculation with softmax as the words, and then at the end reweight as described in equation 1.
I have added a new bidirectionalLSTM under def _add_encoder. I am wondering how I should handle input into the sentence LSTM. The model needs to have some kind of indication of the position of the sentence and the words in it. How should I structure the sentences to be processed by the sentence LSTM?
def _add_encoder(self, encoder_inputs, seq_len):
"""Add a single-layer bidirectional LSTM encoder to the graph.
Args:
encoder_inputs: A tensor of shape [batch_size, <=max_enc_steps, emb_size].
seq_len: Lengths of encoder_inputs (before padding). A tensor of shape [batch_size].
Returns:
encoder_outputs:
A tensor of shape [batch_size, <=max_enc_steps, 2*hidden_dim]. It's 2*hidden_dim because it's the concatenation of the forwards and backwards states.
fw_state, bw_state:
Each are LSTMStateTuples of shape ([batch_size,hidden_dim],[batch_size,hidden_dim])
"""
with tf.variable_scope('encoder'):
cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, encoder_inputs, dtype=tf.float32, sequence_length=seq_len, swap_memory=True)
encoder_outputs = tf.concat(axis=2, values=encoder_outputs) # concatenate the forwards and backwards states
sentence_encoder_inputs = ???
sentence_lens = len(sentences_encoder_inputs)
### NEW CODE ###
sentence_cell_fw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
sentence_cell_bw = tf.contrib.rnn.LSTMCell(self._hps.hidden_dim, initializer=self.rand_unif_init, state_is_tuple=True)
(sentence_encoder_outputs, (fw_st, bw_st)) = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, sentence_encoder_inputs, dtype=tf.float32, sequence_length=sentence_lens, swap_memory=True)
sentence_encoder_outputs = tf.concat(axis=2, values=sentence_encoder_outputs) # concatenate the forwards and backwards states
return encoder_outputs, sentence_encoder_outputs, fw_st, bw_st

Product merge layers with Keras functionnal API for Word2Vec model

I am trying to implement a Word2Vec CBOW with negative sampling with Keras, following the code found here:
EMBEDDING_DIM = 100
sentences = SentencesIterator('test_file.txt')
v_gen = VocabGenerator(sentences=sentences, min_count=5, window_size=3,
sample_threshold=-1, negative=5)
v_gen.scan_vocab()
v_gen.filter_vocabulary()
reverse_vocab = v_gen.generate_inverse_vocabulary_lookup('test_lookup')
# Generate embedding matrix with all values between -1/2d, 1/2d
embedding = np.random.uniform(-1.0 / (2 * EMBEDDING_DIM),
1.0 / (2 * EMBEDDING_DIM),
(v_gen.vocab_size + 3, EMBEDDING_DIM))
# Creating CBOW model
# Model has 3 inputs
# Current word index, context words indexes and negative sampled word indexes
word_index = Input(shape=(1,))
context = Input(shape=(2*v_gen.window_size,))
negative_samples = Input(shape=(v_gen.negative,))
# All inputs are processed through a common embedding layer
shared_embedding_layer = (Embedding(input_dim=(v_gen.vocab_size + 3),
output_dim=EMBEDDING_DIM,
weights=[embedding]))
word_embedding = shared_embedding_layer(word_index)
context_embeddings = shared_embedding_layer(context)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1),
output_shape=(EMBEDDING_DIM,))(context_embeddings)
# Context is multiplied (dot product) with current word and negative
# sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
negative_context_product = merge([negative_words_embedding, cbow],
mode='dot',
concat_axis=-1)
# The dot products are outputted
model = Model(input=[word_index, context, negative_samples],
output=[word_context_product, negative_context_product])
# Binary crossentropy is applied on the output
model.compile(optimizer='rmsprop', loss='binary_crossentropy')
print(model.summary())
model.fit_generator(v_gen.pretraining_batch_generator(reverse_vocab),
samples_per_epoch=10,
nb_epoch=1)
However, I get an error during the merge part because Embedding layer is a 3D tensor while cbow is only 2 dimensions. I assume I need to reshape the embedding (which is [?, 1, 100]) to [1, 100] but I can't find how to reshape with the functional API.
I am using the Tensorflow backend.
Also, if someone can point to an other implementation of CBOW with Keras (Gensim free), I would love to have a look to it!
Thank you!
EDIT: Here is the error
Traceback (most recent call last):
File "cbow.py", line 48, in <module>
word_context_product = merge([word_embedding, cbow], mode='dot')
.
.
.
ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100].

ValueError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,100], [?,100].
Indeed you need to reshape the word_embedding tensor. Two ways to do it :
Either you use the Reshape() layer, imported from keras.layers.core, this is done like :
word_embedding = Reshape((100,))(word_embedding)
the argument of Reshape is a tuple with the target shape.
Or you can use Flatten() layer, also imported from keras.layers.core, used like this :
word_embedding = Flatten()(word_embedding)
taking nothing as an argument, it will just remove "empty" dimensions.
Does this help you?
EDIT :
Indeed the second merge() is a bit more tricky. The dot merge in Keras only accepts tensors of the same rank, so same len(shape).
So what you will do is use a Reshape() layer to add back that 1 empty dimension, then use the feature dot_axes instead of concat_axis which is not relevant for a dot merge.
This is what I propose you for the solution :
word_embedding = shared_embedding_layer(word_index)
# Shape output = (None,1,emb_size)
context_embeddings = shared_embedding_layer(context)
# Shape output = (None, 2*window_size, emb_size)
negative_words_embedding = shared_embedding_layer(negative_samples)
# Shape output = (None, negative, emb_size)
# Now the context words are averaged to get the CBOW vector
cbow = Lambda(lambda x: K.mean(x, axis=1),
output_shape=(EMBEDDING_DIM,))(context_embeddings)
# Shape output = (None, emb_size)
cbow = Reshape((1,emb_size))(cbow)
# Shape output = (None, 1, emb_size)
# Context is multiplied (dot product) with current word and negative
# sampled words
word_context_product = merge([word_embedding, cbow], mode='dot')
# Shape output = (None, 1, 1)
word_context_product = Flatten()(word_context_product)
# Shape output = (None,1)
negative_context_product = merge([negative_words_embedding, cbow], mode='dot',dot_axes=[2,2])
# Shape output = (None, negative, 1)
negative_context_product = Flatten()(negative_context_product)
# Shape output = (None, negative)
Is it working? :)
The problem comes from the rigidity of TF regarding the matrix multiplication. Merge with "dot" mode calls the backend batch_dot() function and, as opposed to Theano, TensorFlow requires the matrix to have the same rank : read here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

TensorFlow bidirectional LSTM encoding of word embeddings - python

the sequence length passed to rnn must be a vector of length batch size.

Related

XLM/BERT sequence outputs to pooled output with weighted average pooling

BERT sentence embedding by summing last 4 layers

LSTM network on pre trained word embedding gensim

Adding hierarchical encoding to a pointer-generator Text-Summarization model

Product merge layers with Keras functionnal API for Word2Vec model

Categories

Resources