Calculating embedding overload problems with BERT - python

I'm trying to calculate the embedding of a sentence using BERT. After I input the sentence into BERT, I calculate the Mean-pooling, which is used as the embedding of the sentence.
Problem
My code can calculate the embedding of sentences, but the computational cost is very high. I don't know what's wrong and I hope someone can help me.
Install BERT
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
Get Embedding Function
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0]
Data
The maximum number of words in the text is 50. I calculate the entity+text embedding
enter image description here
Run code
entity_desc is my data.
It's this step that overloads my computer every time I run it.
Please help me!!!
I was use RAM 80GB machine in Colab.
entity_embedding = {}
for i in range(len(entity_desc)):
entity = entity_desc['entity'][i]
text = entity_desc['text'][i]
entity += ' ' + text
entity_embedding[entity_desc['entity_id'][i]] = get_word_embedding(entity)

I fixed the problem.
The reason for the memory overload was that I wasn't saving the tensor to the GPU, so I made the following changes to the code.
model = model.to(device)
import torch
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
input_ids = input_ids.to(device)
outputs = model(input_ids)
last_hidden_states = outputs[1]
last_hidden_states = last_hidden_states.to(device)
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().to(device)

You might be storing sentence embedding in the GPU. Try to move it to cpu before returning it.
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().cpu()

Related

Huggingface: How to use bert-large-uncased in hugginface for long text classification?

I am trying to use the bert-large-uncased for long sequence ending, but it's giving the error:
Code:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")
text = "Replace me by any text you'd like."*1024
encoded_input = tokenizer(text, truncation=True, max_length=1024, return_tensors='pt')
output = model(**encoded_input)
It's giving the following error :
~/.local/lib/python3.6/site-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length)
218 if self.position_embedding_type == "absolute":
219 position_embeddings = self.position_embeddings(position_ids)
--> 220 embeddings += position_embeddings
221 embeddings = self.LayerNorm(embeddings)
222 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 1
I also tried to change the default size of the positional embedding:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained("bert-large-uncased")
model.config.max_position_embeddings = 1024
text = "Replace me by any text you'd like."*1024
encoded_input = tokenizer(text, truncation=True, max_length=1024, return_tensors='pt')
output = model(**encoded_input)
But still the error is persistent, How to use large model for 1024 length sequences?
I might be wrong, but I think you already have your answers here: How to use Bert for long text classification?
Basically you will need some kind of truncation on your text, or you will need to handle it in chunks, and stick them back together.
Side note: large model is not called large because of the sequence length. Max sequence length will be still 512 tokens. (tokens from your tokenizer, not words in your sentence)
EDIT:
The pretrained model you would like to use is trained on a maximum of 512 tokens. When you download it from huggingface, you can see max_position_embeddings in the configuration, which is 512. That means that you can not really extend on this. (actually that is not true)
However you can always tweak your configurations.
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
model = BertModel.from_pretrained(
'bert-large-uncased',
max_position_embeddings=1024,
ignore_mismatched_sizes=True
)
Note, that this is very ill-advised since it will ruin your pretrained model. Maybe it will go rouge, planets start to collide, or pigs will start to fall out of the skies. No one can really tell. Use it at your own risk.

How to use a batch size bigger than zero in Bert Sequence Classification

Hugging Face documentation describes how to do a sequence classification using a Bert model:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0) # Batch size 1
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
However, there is only example for batch size 1. How to implement it when we have a list of phrases and want to use a bigger batch size?
in that example unsqueeze is used to add a dimension to the input/labels, so that it is an array of size (batch_size, sequence_length). If you want to use a batch size > 1, you can build an array of sequences instead, like in the following example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
sequences = ["Hello, my dog is cute", "My dog is cute as well"]
input_ids = torch.tensor([tokenizer.encode(sequence, add_special_tokens=True) for sequence in sequences])
labels = torch.tensor([[1], [0]]) # Labels depend on the task
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
In that example, both sequences get encoded in the same number of tokens so it's easy to build a tensor containing both sequences, but if they have a differing amount of elements you would need to pad the sequences and tell the model which tokens it should attend to (so that it ignores the padded values) using an attention mask.
There is an entry in the glossary concerning attention masks which explains their purpose and usage. You pass this attention mask to the model when calling its forward method.

LSTM network on pre trained word embedding gensim

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
While explaining, kindly let me know how do the dimension work out.
Upon request:
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

Making predictions on text data using keras

i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.

Injecting pre-trained word2vec vectors into TensorFlow seq2seq

I was trying to inject pretrained word2vec vectors into existing tensorflow seq2seq model.
Following this answer, I produced the following code. But it doesn't seem to improve performance as it should, although the values in the variable are updated.
In my understanding the error might be due to the fact that EmbeddingWrapper or embedding_attention_decoder create embeddings independently of the vocabulary order?
What would be the best way to load pretrained vectors into tensorflow model?
SOURCE_EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
TARGET_EMBEDDING_KEY = "embedding_attention_seq2seq/embedding_attention_decoder/embedding"
def inject_pretrained_word2vec(session, word2vec_path, input_size, dict_dir, source_vocab_size, target_vocab_size):
word2vec_model = word2vec.load(word2vec_path, encoding="latin-1")
print("w2v model created!")
session.run(tf.initialize_all_variables())
assign_w2v_pretrained_vectors(session, word2vec_model, SOURCE_EMBEDDING_KEY, source_vocab_path, source_vocab_size)
assign_w2v_pretrained_vectors(session, word2vec_model, TARGET_EMBEDDING_KEY, target_vocab_path, target_vocab_size)
def assign_w2v_pretrained_vectors(session, word2vec_model, embedding_key, vocab_path, vocab_size):
vectors_variable = [v for v in tf.trainable_variables() if embedding_key in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many. key: " + embedding_key)
print("Existing embedding trainable variables:")
print([v.name for v in tf.trainable_variables() if "embedding" in v.name])
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
with gfile.GFile(vocab_path, mode="r") as vocab_file:
counter = 0
while counter < vocab_size:
vocab_w = vocab_file.readline().replace("\n", "")
# for each word in vocabulary check if w2v vector exist and inject.
# otherwise dont change the value.
if word2vec_model.__contains__(vocab_w):
w2w_word_vector = word2vec_model.get_vector(vocab_w)
vectors[counter] = w2w_word_vector
counter += 1
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})
I am not familiar with the seq2seq example, but in general you can use the following code snippet to inject your embeddings:
Where you build you graph:
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [vocabulary_size, embedding_size])
inputs = tf.nn.embedding_lookup(embedding, input_data)
When you execute (after building your graph and before stating the training), just assign your saved embeddings to the embedding variable:
session.run(tf.assign(embedding, embeddings_that_you_want_to_use))
The idea is that the embedding_lookup will replace input_data values with those present in the embedding variable.

Categories