PyTorch / Gensim - How do I load pre-trained word embeddings?

PyTorch / Gensim - How do I load pre-trained word embeddings? - python

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?

I just wanted to report my findings about loading a gensim embedding with PyTorch.
Solution for PyTorch 0.4.0 and newer:
From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable.
Here is an example from the documentation.
import torch
import torch.nn as nn
# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)
The weights from gensim can easily be obtained by:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
As noted by #Guglie: in newer gensim versions the weights can be obtained by model.wv:
weights = model.wv
Solution for PyTorch version 0.3.1 and older:
I'm using version 0.3.1 and from_pretrained() isn't available in this version.
Therefore I created my own from_pretrained so I can also use it with 0.3.1.
Code for from_pretrained for PyTorch versions 0.3.1 or lower:
def from_pretrained(embeddings, freeze=True):
assert embeddings.dim() == 2, \
'Embeddings parameter is expected to be 2-dimensional'
rows, cols = embeddings.shape
embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
embedding.weight = torch.nn.Parameter(embeddings)
embedding.weight.requires_grad = not freeze
return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):
import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab
# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)
# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
# build vocabulary
text_field.build_vocab(dataset)
# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)
# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
...
embedding(batch.text)

from gensim.models import Word2Vec
model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created
import torch
weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)

Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"
I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.
def convert_bin_emb_txt(out_path,emb_file):
txt_name = basename(emb_file).split(".")[0] +".txt"
emb_txt_file = os.path.join(out_path,txt_name)
emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
emb_model.save_word2vec_format(emb_txt_file,binary=False)
return emb_txt_file
emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
cache='custom_embeddings',
unk_init=torch.Tensor.normal_)
TEXT.build_vocab(train_data,
max_size=MAX_VOCAB_SIZE,
vectors=custom_embeddings,
unk_init=torch.Tensor.normal_)
tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.
I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings. By setting requires_grad to false we make sure that we are not changing them.
class InferClassifier(nn.Module):
def __init__(self, input_dim, n_classes, matrix_embeddings):
"""initializes a 2 layer MLP for classification.
There are no non-linearities in the original code, Katia instructed us
to use tanh instead"""
super(InferClassifier, self).__init__()
#dimensionalities
self.input_dim = input_dim
self.n_classes = n_classes
self.hidden_dim = 512
#embedding
self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
self.embeddings.requires_grad = False
#creates a MLP
self.classifier = nn.Sequential(
nn.Linear(self.input_dim, self.hidden_dim),
nn.Tanh(), #not present in the original code.
nn.Linear(self.hidden_dim, self.n_classes))
def forward(self, sentence):
"""forward pass of the classifier
I am not sure it is necessary to make this explicit."""
#get the embeddings for the inputs
u = self.embeddings(sentence)
#forward to the classifier
return self.classifier(x)
sentence is a vector with the indexes of matrix_embeddings instead of words.

Related

How to mix tensorflow keras model and transformers

I am trying to import a pretrained model from Huggingface's transformers library and extend it with a few layers for classification using tensorflow keras. When I directly use transformers model (Method 1), the model trains well and reaches a validation accuracy of 0.93 after 1 epoch. However, when trying to use the model as a layer within a tf.keras model (Method 2), the model can't get above 0.32 accuracy. As far as I can tell based on the documentation, the two approaches should be equivalent. My goal is to get Method 2 working so that I can add more layers to it instead of directly using the logits produced by Huggingface's classifier head but I'm stuck at this stage.
import tensorflow as tf
from transformers import TFRobertaForSequenceClassification
Method 1:
model = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=6)
Method 2:
input_ids = tf.keras.Input(shape=(128,), dtype='int32')
attention_mask = tf.keras.Input(shape=(128, ), dtype='int32')
transformer = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=6)
encoded = transformer([input_ids, attention_mask])
logits = encoded[0]
model = tf.keras.models.Model(inputs = [input_ids, attention_mask], outputs = logits)
Rest of the code for either method is identical,
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
I am using Tensorflow 2.3.0 and have tried with transformers versions 3.5.0 and 4.0.0.

Answering my own question here. I posted a bug report on HuggingFace GitHub and they fixed this in the new dev version (4.1.0.dev0 as of December 2020). The snippet below now works as expected:
input_ids = tf.keras.Input(shape=(128,), dtype='int32')
attention_mask = tf.keras.Input(shape=(128, ), dtype='int32')
transformer = TFRobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=6)
encoded = transformer({"input_ids": input_ids, "attention_mask": attention_mask})
logits = encoded[0]
model = tf.keras.models.Model(inputs = {"input_ids": input_ids, "attention_mask": attention_mask}, outputs = logits)

keras LSTM get hidden-state (converting sentece-sequence to document context vectors)

Im trying to create document context vectors from sentence-vectors via LSTM using keras (so each document consist of a sequence of sentence vectors).
My goal is to replicate the following blog post using keras: https://andriymulyar.com/blog/bert-document-classification
I have a (toy-)tensor, that looks like this: X = np.array(features).reshape(5, 200, 768) So 5 documents with each having a 200 sequence of sentence vectors - each sentence vector having 768 features.
So to get an embedding from my sentence vectors, I encoded my documents as one-hot-vectors to learn an LSTM:
y = [1,2,3,4,5] # 5 documents in toy-tensor
y = np.array(y)
yy = to_categorical(y)
yy = yy[0:5,1:6]
Until now, my code looks like this
inputs1=Input(shape=(200,768))
lstm1, states_h, states_c =LSTM(5,dropout=0.3,recurrent_dropout=0.2, return_state=True)(inputs1)
model1=Model(inputs1,lstm1)
model1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model1.summary()
model1.fit(x=X,y=yy,batch_size=100,epochs=10,verbose=1,shuffle=True,validation_split=0.2)
When I print states_h I get a tensor of shape=(?,5) and I dont really know how to access the vectors inside the tensor, which should represent my documents.
print(states_h)
Tensor("lstm_51/while/Exit_3:0", shape=(?, 5), dtype=float32)
Or am I doing something wrong? To my understanding there should be 5 document vectors e.g. doc1=[...] ; ...; doc5=[...] so that I can reuse the document vectors for a classification task.

Well, printing a tensor shows exactly this: it's a tensor, it has that shape and that type.
If you want to see data, you need to feed data.
States are not weights, they are not persistent, they only exist with input data, just as any other model output.
You should create a model that outputs this information (yours doesn't) in order to grab it. You can have two models:
#this is the model you compile and train - exactly as you are already doing
training_model = Model(inputs1,lstm1)
#this is just for getting the states, nothing else, don't compile, don't train
state_getting_model = Model(inputs1, [lstm1, states_h, states_c])
(Don't worry, these two models will share the same weights and be updated together, even if you only train the training_model)
Now you can:
With eager mode off (and probably "on" too):
lstm_out, states_h_out, states_c_out = state_getting_model.predict(X)
print(states_h_out)
print(states_c_out)
With eager mode on:
lstm_out, states_h_out, states_c_out = state_getting_model(X)
print(states_h_out.numpy())
print(states_c_out.numpy())

TF 1.x with tf.keras (Tested with TF 1.15)
Keras does operations using symbolic tensors. Therefore, print(states_h) won't give you anything unless you pass data to the placeholders states_h depends on (in this case inputs1). You can do that as follows.
import tensorflow.keras.backend as K
inputs1=Input(shape=(200,768))
lstm1, states_h, states_c =LSTM(5,dropout=0.3,recurrent_dropout=0.2, return_state=True)(inputs1)
model1=Model(inputs1,lstm1)
model1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model1.summary()
model1.fit(x=X,y=yy,batch_size=100,epochs=10,verbose=1,shuffle=True,validation_split=0.2)
sess = K.get_session()
out = sess.run(states_h, feed_dict={inputs1:X})
Then out will be (batch_size, 5) sized output.
TF 2.x with tf.keras
The above code won't work as it is. And I still haven't found how to get this to work with TF 2.0 (even though TF 2.0 will still produce a placeholder according to docs). I will edit my answer when I find how to fix this for TF 2.x.

Initializing BERT embeddings in a classification model

I'm quite new to TensorFlow and trying to do a multitask classification with BERT (I have done this with GloVe in another part of the project). My problem is with the concept of placeholder in TensorFlow. I know that it is just a placeholder of some variables and will be filled. See this is the part of my classification model that I have problem with. I'll explain the exact problem down here.
def bert_emb_lookup(input_ids):
# TODO to be implemented;
"""
X is the input IDs, but a placeholder
"""
pass
class BertClassificationModel(object):
def __init__(self, num_class, args):
self.embedding_size = args.embedding_size
self.num_layers = args.num_layers
self.num_hidden = args.num_hidden
self.input_ids = tf.placeholder(tf.int32, [None, args.max_document_len])
self.Y1 = tf.placeholder(tf.int32, [None])
self.Y2 = tf.placeholder(tf.int32, [None])
self.dropout = tf.placeholder(tf.float64, [])
self.input_len = tf.reduce_sum(tf.sign(self.input_ids), 1)
with tf.name_scope("embedding"):
self.input_emb = bert_emb_lookup(self.)
...
It was easy to get the word embeddings from GloVe; I first loaded the glove vectors and then simply used tf.nn.embedding_lookup(embeddings, self.input_ids) to fetch the embeddings.
So in BERT classification model, I'm trying to do something similar by defining a function whose argument is input_ids, where I want to match input ids with their associated vocab (string). Thereafter, I'll use an API (BERT as a service) that gives BERT embeddings of any given list of strings at string-level/token-level. The problem is that since self.input_ids is just a placeholder, it shows it a NULL object. Is there any workaround that helps me with this?
Thanks!

You cannot use bert-as-service as a tensor directly. So you have two options:
Use bert-as-service to look up the embeddings. You give the sentences as input and get a numpy array of embeddings as ouput. You then feed the numpy array of embeddings to a placeholder self.embeddings = tf.placeholder(tf.float32, [None, 768])
Then you use self.embeddings wherever you would have used tf.nn.embedding_lookup(embeddings, self.input_ids).
The other option is probably overkill in this case, but it may give you some context. Here you do not use bert-as-service to get embeddings. Instead, you use the bert model graph directly. You would use BertModel from https://github.com/google-research/bert/blob/master/run_classifier.py to create a tensor which again can be used wherever you would use tf.nn.embedding_lookup(embeddings, self.input_ids).

Extending the vocabulary on deployment

When doing training, I initialize my embedding matrix, using the pretrained embeddings picked for words in training set vocabulary.
import torchtext as tt
contexts = tt.data.Field(lower=True, sequential=True, tokenize=tokenizer, use_vocab=True)
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
In my model I pass contexts.vocab as parameter and initialize embeddings:
embedding_dim = vocab.vectors.shape[1]
self.embeddings = nn.Embedding(len(vocab), embedding_dim)
self.embeddings.weight.data.copy_(vocab.vectors)
self.embeddings.weight.requires_grad=False
I train my model and during training I save its 'best' state via torch.save(model, f).
Then I want to test/create demo for model in separate file for evaluation. I load the model via torch.load. How do I extend the embedding matrix to contain test vocabulary? I tried to replace embedding matrix
# data is TabularDataset with test data
contexts.build_vocab(data, vectors="fasttext.en.300d",
vectors_cache=config["vectors_cache"])
model.embeddings = torch.nn.Embedding(len(contexts.vocab), contexts.vocab.vectors.shape[1])
model.embeddings.weight.data.copy_(contexts.vocab.vectors)
model.embeddings.weight.requires_grad = False
But the results are terrible (almost 0 accuracy). Model was doing good during training. What is the 'correct way' of doing this?

How to use glove pretrained vectors to find similarity between two words and later in two documnets? [duplicate]

I've recently reviewed an interesting implementation for convolutional text classification. However all TensorFlow code I've reviewed uses a random (not pre-trained) embedding vectors like the following:
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
Does anybody know how to use the results of Word2vec or a GloVe pre-trained word embedding instead of a random one?

There are a few ways that you can use a pre-trained embedding in TensorFlow. Let's say that you have the embedding in a NumPy array called embedding, with vocab_size rows and embedding_dim columns and you want to create a tensor W that can be used in a call to tf.nn.embedding_lookup().
Simply create W as a tf.constant() that takes embedding as its value:
W = tf.constant(embedding, name="W")
This is the easiest approach, but it is not memory efficient because the value of a tf.constant() is stored multiple times in memory. Since embedding can be very large, you should only use this approach for toy examples.
Create W as a tf.Variable and initialize it from the NumPy array via a tf.placeholder():
W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)
# ...
sess = tf.Session()
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding})
This avoid storing a copy of embedding in the graph, but it does require enough memory to keep two copies of the matrix in memory at once (one for the NumPy array, and one for the tf.Variable). Note that I've assumed that you want to hold the embedding matrix constant during training, so W is created with trainable=False.
If the embedding was trained as part of another TensorFlow model, you can use a tf.train.Saver to load the value from the other model's checkpoint file. This means that the embedding matrix can bypass Python altogether. Create W as in option 2, then do the following:
W = tf.Variable(...)
embedding_saver = tf.train.Saver({"name_of_variable_in_other_model": W})
# ...
sess = tf.Session()
embedding_saver.restore(sess, "checkpoint_filename.ckpt")

I use this method to load and share embedding.
W = tf.get_variable(name="W", shape=embedding.shape, initializer=tf.constant_initializer(embedding), trainable=False)

2.0 Compatible Answer: There are many Pre-Trained Embeddings, which are developed by Google and which have been Open Sourced.
Some of them are Universal Sentence Encoder (USE), ELMO, BERT, etc.. and it is very easy to reuse them in your code.
Code to reuse the Pre-Trained Embedding, Universal Sentence Encoder is shown below:
!pip install "tensorflow_hub>=0.6.0"
!pip install "tensorflow>=2.0.0"
import tensorflow as tf
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.KerasLayer(module_url)
embeddings = embed(["A long sentence.", "single-word",
"http://example.com"])
print(embeddings.shape) #(3,128)
For more information the Pre-Trained Embeddings developed and open-sourced by Google, refer TF Hub Link.

The answer of #mrry is not right because it provoques the overwriting of the embeddings weights each the network is run, so if you are following a minibatch approach to train your network, you are overwriting the weights of the embeddings. So, on my point of view the right way to pre-trained embeddings is:
embeddings = tf.get_variable("embeddings", shape=[dim1, dim2], initializer=tf.constant_initializer(np.array(embeddings_matrix))

With tensorflow version 2 its quite easy if you use the Embedding layer
X=tf.keras.layers.Embedding(input_dim=vocab_size,
output_dim=300,
input_length=Length_of_input_sequences,
embeddings_initializer=matrix_of_pretrained_weights
)(ur_inp)

I was also facing embedding issue, So i wrote detailed tutorial with dataset.
Here I would like to add what I tried You can also try this method,
import tensorflow as tf
tf.reset_default_graph()
input_x=tf.placeholder(tf.int32,shape=[None,None])
#you have to edit shape according to your embedding size
Word_embedding = tf.get_variable(name="W", shape=[400000,100], initializer=tf.constant_initializer(np.array(word_embedding)), trainable=False)
embedding_loopup= tf.nn.embedding_lookup(Word_embedding,input_x)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for ii in final_:
print(sess.run(embedding_loopup,feed_dict={input_x:[ii]}))
Here is working detailed Tutorial Ipython example if you want to understand from scratch , take a look .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch / Gensim - How do I load pre-trained word embeddings? - python

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer. How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer. You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

from gensim.models import Word2Vec model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4) #gensim model created import torch weights = torch.FloatTensor(model.wv.vectors) embedding = nn.Embedding.from_pretrained(weights)

Related

How to mix tensorflow keras model and transformers

keras LSTM get hidden-state (converting sentece-sequence to document context vectors)

Initializing BERT embeddings in a classification model

Extending the vocabulary on deployment

How to use glove pretrained vectors to find similarity between two words and later in two documnets? [duplicate]

Categories

Resources