I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. However, there already exist many precomputed word vectors available in text format (e.g. http://www-nlp.stanford.edu/projects/glove/). Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch?
Thanks!
The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:
python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'):
word2vec
Extract the file and then you can load it in python like:
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
model.most_similar('dog')
EDIT (May 2017):
As the above code is now deprecated, this is how you'd load the vectors now:
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)
20000 100
the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]
and 0.223408 0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..]
[19998 more lines...]
Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.
Loading the different formats is easy, but it is also easy to get them mixed up:
import gensim
model_file = path/to/model/file
1) Loading binary word2vec
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)
2) Loading binary fastText
model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)
3) Loading the generic plain text format (which has been introduced by word2vec)
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).
The following will load the binary word2vec format as keyedvectors:
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)
Related
I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...
Or it's me not finding the right function?
Thank you!
FastText models come in two flavours:
unsupervised models that produce word embeddings and can find similar words. The native Facebook package does not support quantization for them.
supervised models that are used for text classification and can be quantized natively, but generally do not produce meaningful word embeddings.
To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.
With this package, you can lookup similar words in small models as follows:
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')
As the title says, I would like to load custom word vectors built from gensim to the SpaCy Vector class.
I have found several other questions where folks have successfully loaded vectors to the nlp object itself, but I have a current project where I would like to have a separate Vectors object.
Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext.
On the gensim end I am:
calling model.wv.save_word2vec_format(output/bin/path, binary=True)
saving the model -> model.save(path/to/model)
On the SpaCy side:
I can either use the from_disk or from_bytes methods to load the word vectors
there is also a from_glove method that expects a vocab.txt file and a binary file (which I already have a binary file
Link to Vectors Documentation
just for reference, here is my code to test the load process:
import spacy
from spacy.vectors import Vectors
vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'
vecs.from_disk(dir_path)
print(vecs.shape)
I have tried various combinations of from_disk and from_bytes with no success. Any help or advice would be greatly appreciated!
It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c code.
It seems the from_disk expects things in Spacy's own multi-file format. The from_bytes might expect a raw version of the vectors. Neither would be useful for data saved from gensim's FastText model.
The from_glove might in fact be a compatible format. You could try using the save_word2vec_format() method with its optional fvocab argument (to specify a vocab.txt file with words), binary=True, and a filename according to Spacy's conventions. For example, if you have 300 dimensional vectors:
ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)
Then, see if that directory works for Spacy's from_glove. (I'm not sure it will.)
Alternatively, you could possibly use a gensim utility class (such as its KeyedVectors) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors object.
Note that by saving FastText vectors to the plain, vectors-only word2vec_format, you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).
I recently installed gensim and glove in my mac and am trying to get word embedding for textual data I have. However, I'm having trouble finding the right function for it. I've only come across methods to get similarity metrics between two words. How do I train a glove object with data present in the library and use it to obtain embeddings for words in my dataset? Or is there any other library in python to do this? Thanks!
Actually, the format of glove is different from word2vec you can convert the format of glove to word2vec format using this https://radimrehurek.com/gensim/scripts/glove2word2vec.html
Let the converted glove is glove_changed.txt
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('glove_changed.txt', binary=False)
print(model['cat']) // This will give the wordvector for the word 'cat'
I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.
I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.
To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.
What I want is to save RAM, so possible things that would help me:
Is there a way to add the word vectors directly to the model?
Is there a way to load to gensim from a matrix or another object? I could have that object in RAM and append to it the new words before loading them in the model
I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)
Thanks in advance.
METHOD 1:
You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
METHOD 2:
AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)
last parts of the code:
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 64)
corpus_lda = lda[corpus_tfidf]
I am wondering how to save corpus_lda for further use?
Gensim has functions for writing corpora to disk:
from Gensim import corpora
corpora.MmCorpus.serialize('pathandfilename.mm', corpus_lda)
To load a saved corpus use:
corpus_lda = corpora.MmCorpus('pathandfilename.mm')
There are similar functions for saving models (check the tutorials or the references).
There are different corpus formats available, I believe matrix market used to be the standard format used by Gensim but recently the indexedcorpus format was added, which has some additional functionality (an index, as you may have guessed).