Injecting pre-trained word2vec vectors into TensorFlow seq2seq - python

I was trying to inject pretrained word2vec vectors into existing tensorflow seq2seq model.
Following this answer, I produced the following code. But it doesn't seem to improve performance as it should, although the values in the variable are updated.
In my understanding the error might be due to the fact that EmbeddingWrapper or embedding_attention_decoder create embeddings independently of the vocabulary order?
What would be the best way to load pretrained vectors into tensorflow model?
SOURCE_EMBEDDING_KEY = "embedding_attention_seq2seq/RNN/EmbeddingWrapper/embedding"
TARGET_EMBEDDING_KEY = "embedding_attention_seq2seq/embedding_attention_decoder/embedding"
def inject_pretrained_word2vec(session, word2vec_path, input_size, dict_dir, source_vocab_size, target_vocab_size):
word2vec_model = word2vec.load(word2vec_path, encoding="latin-1")
print("w2v model created!")
session.run(tf.initialize_all_variables())
assign_w2v_pretrained_vectors(session, word2vec_model, SOURCE_EMBEDDING_KEY, source_vocab_path, source_vocab_size)
assign_w2v_pretrained_vectors(session, word2vec_model, TARGET_EMBEDDING_KEY, target_vocab_path, target_vocab_size)
def assign_w2v_pretrained_vectors(session, word2vec_model, embedding_key, vocab_path, vocab_size):
vectors_variable = [v for v in tf.trainable_variables() if embedding_key in v.name]
if len(vectors_variable) != 1:
print("Word vector variable not found or too many. key: " + embedding_key)
print("Existing embedding trainable variables:")
print([v.name for v in tf.trainable_variables() if "embedding" in v.name])
sys.exit(1)
vectors_variable = vectors_variable[0]
vectors = vectors_variable.eval()
with gfile.GFile(vocab_path, mode="r") as vocab_file:
counter = 0
while counter < vocab_size:
vocab_w = vocab_file.readline().replace("\n", "")
# for each word in vocabulary check if w2v vector exist and inject.
# otherwise dont change the value.
if word2vec_model.__contains__(vocab_w):
w2w_word_vector = word2vec_model.get_vector(vocab_w)
vectors[counter] = w2w_word_vector
counter += 1
session.run([vectors_variable.initializer],
{vectors_variable.initializer.inputs[1]: vectors})

I am not familiar with the seq2seq example, but in general you can use the following code snippet to inject your embeddings:
Where you build you graph:
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [vocabulary_size, embedding_size])
inputs = tf.nn.embedding_lookup(embedding, input_data)
When you execute (after building your graph and before stating the training), just assign your saved embeddings to the embedding variable:
session.run(tf.assign(embedding, embeddings_that_you_want_to_use))
The idea is that the embedding_lookup will replace input_data values with those present in the embedding variable.

Related

Calculating embedding overload problems with BERT

I'm trying to calculate the embedding of a sentence using BERT. After I input the sentence into BERT, I calculate the Mean-pooling, which is used as the embedding of the sentence.
Problem
My code can calculate the embedding of sentences, but the computational cost is very high. I don't know what's wrong and I hope someone can help me.
Install BERT
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
Get Embedding Function
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0]
Data
The maximum number of words in the text is 50. I calculate the entity+text embedding
enter image description here
Run code
entity_desc is my data.
It's this step that overloads my computer every time I run it.
Please help me!!!
I was use RAM 80GB machine in Colab.
entity_embedding = {}
for i in range(len(entity_desc)):
entity = entity_desc['entity'][i]
text = entity_desc['text'][i]
entity += ' ' + text
entity_embedding[entity_desc['entity_id'][i]] = get_word_embedding(entity)
I fixed the problem.
The reason for the memory overload was that I wasn't saving the tensor to the GPU, so I made the following changes to the code.
model = model.to(device)
import torch
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
input_ids = input_ids.to(device)
outputs = model(input_ids)
last_hidden_states = outputs[1]
last_hidden_states = last_hidden_states.to(device)
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().to(device)
You might be storing sentence embedding in the GPU. Try to move it to cpu before returning it.
# get the word embedding from BERT
def get_word_embedding(text:str):
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[1]
# The last hidden-state is the first element of the output tuple
return last_hidden_states[0].detach().cpu()

LSTM network on pre trained word embedding gensim

I am new to deep learning. I am trying to make very basic LSTM network on word embedding feature. I have written the following code for the model but I am unable to run it.
from keras.layers import Dense, LSTM, merge, Input,Concatenate
from keras.layers.recurrent import LSTM
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Flatten
max_sequence_size = 14
classes_num = 2
LSTM_word_1 = LSTM(100, activation='relu',recurrent_dropout = 0.25, dropout = 0.25)
lstm_word_input_1 = Input(shape=(max_sequence_size, 300))
lstm_word_out_1 = LSTM_word_1(lstm_word_input_1)
merged_feature_vectors = Dense(50, activation='sigmoid')(Dropout(0.2)(lstm_word_out_1))
predictions = Dense(classes_num, activation='softmax')(merged_feature_vectors)
my_model = Model(input=[lstm_word_input_1], output=predictions)
print my_model.summary()
The error I am getting is ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (3019, 300). On searching, I found that people have used Flatten() which will compress all the 2-D features (3019,300) for the dense layer. But I am unable to fix the issue.
While explaining, kindly let me know how do the dimension work out.
Upon request:
My X_training had dimension issues, so I am providing the code below to clear out the confusion,
def makeFeatureVec(words, model, num_features):
# Function to average all of the word vectors in a given
# paragraph
#
# Pre-initialize an empty numpy array (for speed)
featureVec = np.zeros((num_features,),dtype="float32")
#
nwords = 0.
#
# Index2word is a list that contains the names of the words in
# the model's vocabulary. Convert it to a set, for speed
index2word_set = set(model.wv.index2word)
#
# Loop over each word in the review and, if it is in the model's
# vocaublary, add its feature vector to the total
for word in words:
if word in index2word_set:
nwords = nwords + 1.
featureVec = np.add(featureVec,model[word])
#
# Divide the result by the number of words to get the average
featureVec = np.divide(featureVec,nwords)
return featureVec
I think the following code is giving 2-D numpy array as I am initializing it that way
def getAvgFeatureVecs(reviews, model, num_features):
# Given a set of reviews (each one a list of words), calculate
# the average feature vector for each one and return a 2D numpy array
#
# Initialize a counter
counter = 0.
#
# Preallocate a 2D numpy array, for speed
reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
for review in reviews:
if counter%1000. == 0.:
print "Question %d of %d" % (counter, len(reviews))
reviewFeatureVecs[int(counter)] = makeFeatureVec(review, model, \
num_features)
counter = counter + 1.
return reviewFeatureVecs
def getCleanReviews(reviews):
clean_reviews = []
for review in reviews["question"]:
clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
return clean_reviews
My objective is just to use gensim pretrained model for LSTM on some comments that I have.
trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
You should try using Embedding layer before LSTM layer. Also, since you have pre-trained vectors of 300-dimensions for 3019 comments, you can initialize the weights for embedding layer with this matrix.
inp_layer = Input((maxlen,))
x = Embedding(max_features, embed_size, weights=[trainDataVecs])(x)
x = LSTM(50, dropout=0.1)(x)
Here, maxlen is the maximum length of your comments, max_features is the maximum number of unique words or vocabulary size of your dataset, and embed_size is dimensions of your vectors, which is 300 in your case.
Note that shape of trainDataVecs should be (max_features, embed_size), so if you have pre-trained word vectors loaded into trainDataVecs, this should work.

Using pretrained gensim Word2vec embedding in keras

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?
PS: It is different from other questions because I am using gensim for word2vec training instead of keras.
Let's say you have following data that you need to encode
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size
t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1
You can then enocde it to sequences like this
encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)
You can then pad the sequences so that all the sequences are of a fixed length
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
Then use the word2vec model to make embedding matrix
# load embedding as a dict
def load_embedding(filename):
# load embedding into memory, skip first line
file = open(filename,'r')
lines = file.readlines()[1:]
file.close()
# create a map of words to vectors
embedding = dict()
for line in lines:
parts = line.split()
# key is string word, value is numpy array for vector
embedding[parts[0]] = asarray(parts[1:], dtype='float32')
return embedding
# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
# total vocabulary size plus 0 for unknown words
vocab_size = len(vocab) + 1
# define weight matrix dimensions with all 0
weight_matrix = zeros((vocab_size, 100))
# step vocab, store vectors using the Tokenizer's integer mapping
for word, i in vocab.items():
weight_matrix[i] = embedding.get(word)
return weight_matrix
# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)
Once you have the embedding matrix you can use it in Embedding layer like this
e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)
This layer can be used in making a model like this
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove
For using word2vec see this post
With the new Gensim version this is pretty easy:
w2v_model.wv.get_keras_embedding(train_embeddings=False)
there you have your Keras embedding layer
My code for gensim-trained w2v model. Assume all words trained in the w2v model is now a list variable called all_words.
from keras.preprocessing.text import Tokenizer
import gensim
import pandas as pd
import numpy as np
from itertools import chain
w2v = gensim.models.Word2Vec.load("models/w2v.model")
vocab = w2v.wv.vocab
t = Tokenizer()
vocab_size = len(all_words) + 1
t.fit_on_texts(all_words)
def get_weight_matrix():
# define weight matrix dimensions with all 0
weight_matrix = np.zeros((vocab_size, w2v.vector_size))
# step vocab, store vectors using the Tokenizer's integer mapping
for i in range(len(all_words)):
weight_matrix[i + 1] = w2v[all_words[i]]
return weight_matrix
embedding_vectors = get_weight_matrix()
emb_layer = Embedding(vocab_size, output_dim=w2v.vector_size, weights=[embedding_vectors], input_length=FIXED_LENGTH, trainable=False)

Making predictions on text data using keras

i had trained and tested a CNN for sentiment analysis. The train and test data were prepared the same way, tokenizing the sentence and giving unique integers:
tokenizer = Tokenizer(filters='$%&()*/:;<=>#[\\]^`{|}~\t\n')
tokenizer.fit_on_texts(text)
vocab_size = len(tokenizer.word_index) + 1
sequences = tokenizer.texts_to_sequences(text)
Then pre-trained glove model to create embedding matrix for CNN as:
filepath_glove = 'glove.twitter.27B.200d.txt'
glove_vocab = []
glove_embd=[]
embedding_dict = {}
file = open(filepath_glove,'r',encoding='UTF-8')
for line in file.readlines():
row = line.strip().split(' ')
vocab_word = row[0]
glove_vocab.append(vocab_word)
embed_vector = [float(i) for i in row[1:]] # convert to list of float
embedding_dict[vocab_word]=embed_vector
file.close()
for word, index in tokenizer.word_index.items():
`embedding_matrix[index] = embedding_dict[word]`
At this point i also used the test sentences to create this matrix which was later passed as weights into embedding layer:
e= Embedding(vocab_size, 200, input_length=maxSeqLength, weights=[embedding_matrix], trainable=False)(inp)
Now i want to reload my model and test with some new data but it would mean that embedding matrix wont include some words from new data.This makes me wonder that if even before i shouldnt have had included test data while creating embedding matrix? And if not,how does the embedding layer work for those new words?This part is similar to this question but i couldnt find answer:
How does the Keras Embedding Layer work if word is not found?
Thanks
It´s quite simple. You are providing the vocab_size, which is the number of words, the embedding layer knows. If you pass an index, which is out of bounds of the vocab_size (new word), it will be ignored, or an error will be thrown by keras.
This answers your question regarding if you should include all data for your embedding matrix. Yes, you should.

PyTorch / Gensim - How do I load pre-trained word embeddings?

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?
I just wanted to report my findings about loading a gensim embedding with PyTorch.
Solution for PyTorch 0.4.0 and newer:
From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable.
Here is an example from the documentation.
import torch
import torch.nn as nn
# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)
The weights from gensim can easily be obtained by:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
As noted by #Guglie: in newer gensim versions the weights can be obtained by model.wv:
weights = model.wv
Solution for PyTorch version 0.3.1 and older:
I'm using version 0.3.1 and from_pretrained() isn't available in this version.
Therefore I created my own from_pretrained so I can also use it with 0.3.1.
Code for from_pretrained for PyTorch versions 0.3.1 or lower:
def from_pretrained(embeddings, freeze=True):
assert embeddings.dim() == 2, \
'Embeddings parameter is expected to be 2-dimensional'
rows, cols = embeddings.shape
embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
embedding.weight = torch.nn.Parameter(embeddings)
embedding.weight.requires_grad = not freeze
return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.
I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.
I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):
import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab
# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)
# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
# build vocabulary
text_field.build_vocab(dataset)
# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)
# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
...
embedding(batch.text)
from gensim.models import Word2Vec
model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created
import torch
weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)
Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"
I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.
def convert_bin_emb_txt(out_path,emb_file):
txt_name = basename(emb_file).split(".")[0] +".txt"
emb_txt_file = os.path.join(out_path,txt_name)
emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
emb_model.save_word2vec_format(emb_txt_file,binary=False)
return emb_txt_file
emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
cache='custom_embeddings',
unk_init=torch.Tensor.normal_)
TEXT.build_vocab(train_data,
max_size=MAX_VOCAB_SIZE,
vectors=custom_embeddings,
unk_init=torch.Tensor.normal_)
tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.
I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.
I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings. By setting requires_grad to false we make sure that we are not changing them.
class InferClassifier(nn.Module):
def __init__(self, input_dim, n_classes, matrix_embeddings):
"""initializes a 2 layer MLP for classification.
There are no non-linearities in the original code, Katia instructed us
to use tanh instead"""
super(InferClassifier, self).__init__()
#dimensionalities
self.input_dim = input_dim
self.n_classes = n_classes
self.hidden_dim = 512
#embedding
self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
self.embeddings.requires_grad = False
#creates a MLP
self.classifier = nn.Sequential(
nn.Linear(self.input_dim, self.hidden_dim),
nn.Tanh(), #not present in the original code.
nn.Linear(self.hidden_dim, self.n_classes))
def forward(self, sentence):
"""forward pass of the classifier
I am not sure it is necessary to make this explicit."""
#get the embeddings for the inputs
u = self.embeddings(sentence)
#forward to the classifier
return self.classifier(x)
sentence is a vector with the indexes of matrix_embeddings instead of words.

Categories