I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.
I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.
To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.
What I want is to save RAM, so possible things that would help me:
Is there a way to add the word vectors directly to the model?
Is there a way to load to gensim from a matrix or another object? I could have that object in RAM and append to it the new words before loading them in the model
I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)
Thanks in advance.
METHOD 1:
You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
METHOD 2:
AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)
Related
Hi I am looking to generate similar words for a word using BERT model, the same approach we use in gensim to generate most_similar word, I found the approach as:
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
word = "Hello"
inputs = tokenizer(word, return_tensors="pt")
outputs = model(**inputs)
word_vect = outputs.pooler_output.detach().numpy()
Okay, now this gives me the embedding for input word given by user, so can we compare this embedding with complete BERT model for cosine similarity to find top N embeddings that are closest match with that word, and then convert the embeddings to word using the vocab.txt file in the model? is it possible?
Seems like you need to store embeddings for all word from your vocabulary.
After that, you can use some tools to find the closest embedding to the target embedding.
For example, you can use NearestNeighbors from scikit-learn.
Another option you might like to consider is HNSW, which is the data structure specially designed to perform fast approximate nearest neighbour search. Faiss is a quite good implementation of HNSW by Facebook.
I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...
Or it's me not finding the right function?
Thank you!
FastText models come in two flavours:
unsupervised models that produce word embeddings and can find similar words. The native Facebook package does not support quantization for them.
supervised models that are used for text classification and can be quantized natively, but generally do not produce meaningful word embeddings.
To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.
With this package, you can lookup similar words in small models as follows:
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')
With things like neural networks (NNs) in keras it is very clear how to use word embeddings within the training of the NN, you can simply do something like
embeddings = ...
model = Sequential(Embedding(...),
layer1,
layer2,...)
But I'm unsure of how to do this with algorithms in sklearn such as SVMs, NBs, and logistic regression. I understand that there is a Pipeline method, which works simply (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) like
pip = Pipeline([(Countvectorizer()), (TfidfTransformer()), (Classifier())])
pip.fit(X_train, y_train)
But how can I include loaded word embeddings in this pipeline? Or should it somehow be included outside the pipeline? I can't find much documentation online about how to do this.
Thanks.
You can use the FunctionTransformer class.
If your goal is to have a transformer that takes a matrix of indexes and outputs a 3d tensor with word vectors, then this should suffice:
# this assumes you're using numpy ndarrays
word_vecs_matrix = get_wv_matrix() # pseudo-code
def transform(x):
return word_vecs_matrix[x]
transformer = FunctionTransformer(transform)
Be aware that, unlike keras, the word vector will not be fine tuned using some kind of gradient descent
There is any easy way to get word embeddings transformers with the Zeugma package.
It handles the downloading of the pre-trained embeddings and returns a "Transformer interface" for the embeddings.
For example if you want to use the averge of the GloVe embeddings for sentences representations you'd just have to write:
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')
Here glove is a sklearn transformer has the standard transform method that takes a list of sentences as input and outputs a design matrix, just like Tfidftransformer. You can get the resulting embeddings with embeddings = glove.transform(['first sentence of the corpus', 'another sentence']) and embeddings woud contain a 2 x N matrics, where N is the dimension of the chosen embedding. Note that you don't have to bother with embeddings downloading, or local loading if you've already done it, Zeugma handles this transparently.
Hope this helps
last parts of the code:
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 64)
corpus_lda = lda[corpus_tfidf]
I am wondering how to save corpus_lda for further use?
Gensim has functions for writing corpora to disk:
from Gensim import corpora
corpora.MmCorpus.serialize('pathandfilename.mm', corpus_lda)
To load a saved corpus use:
corpus_lda = corpora.MmCorpus('pathandfilename.mm')
There are similar functions for saving models (check the tutorials or the references).
There are different corpus formats available, I believe matrix market used to be the standard format used by Gensim but recently the indexedcorpus format was added, which has some additional functionality (an index, as you may have guessed).
I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. However, there already exist many precomputed word vectors available in text format (e.g. http://www-nlp.stanford.edu/projects/glove/). Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch?
Thanks!
The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:
python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'):
word2vec
Extract the file and then you can load it in python like:
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
model.most_similar('dog')
EDIT (May 2017):
As the above code is now deprecated, this is how you'd load the vectors now:
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)
20000 100
the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]
and 0.223408 0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..]
[19998 more lines...]
Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.
Loading the different formats is easy, but it is also easy to get them mixed up:
import gensim
model_file = path/to/model/file
1) Loading binary word2vec
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)
2) Loading binary fastText
model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)
3) Loading the generic plain text format (which has been introduced by word2vec)
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).
The following will load the binary word2vec format as keyedvectors:
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)