As the title says, I would like to load custom word vectors built from gensim to the SpaCy Vector class.
I have found several other questions where folks have successfully loaded vectors to the nlp object itself, but I have a current project where I would like to have a separate Vectors object.
Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext.
On the gensim end I am:
calling model.wv.save_word2vec_format(output/bin/path, binary=True)
saving the model -> model.save(path/to/model)
On the SpaCy side:
I can either use the from_disk or from_bytes methods to load the word vectors
there is also a from_glove method that expects a vocab.txt file and a binary file (which I already have a binary file
Link to Vectors Documentation
just for reference, here is my code to test the load process:
import spacy
from spacy.vectors import Vectors
vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'
vecs.from_disk(dir_path)
print(vecs.shape)
I have tried various combinations of from_disk and from_bytes with no success. Any help or advice would be greatly appreciated!
It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c code.
It seems the from_disk expects things in Spacy's own multi-file format. The from_bytes might expect a raw version of the vectors. Neither would be useful for data saved from gensim's FastText model.
The from_glove might in fact be a compatible format. You could try using the save_word2vec_format() method with its optional fvocab argument (to specify a vocab.txt file with words), binary=True, and a filename according to Spacy's conventions. For example, if you have 300 dimensional vectors:
ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)
Then, see if that directory works for Spacy's from_glove. (I'm not sure it will.)
Alternatively, you could possibly use a gensim utility class (such as its KeyedVectors) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors object.
Note that by saving FastText vectors to the plain, vectors-only word2vec_format, you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).
Related
I have downloaded the wikipedia glove vectors using the gensim API. I want to save it locally so I don't have to call the API everytime to download it. How can I do this? I have looked but I am not sure if this is the right way to save them.
import gensim.downloader as api
vectors = api.load('glove-wiki-gigaword-50')
vectors.save('vectors.bin')
The code you've shown in your question likely worked to save the vectors.
You can check the class type of the object that the api.load() gave you with code like:
print(type(vectors))
I think that will be shown to be an instance of KeyedVectors.
If so, then reloading the file you've already saved should be as simple as:
from gensim.models import KeyedVectors
reloaded_vectors = KeyedVectors.load('vectors.bin')
Note that the native Gensim .save(), for any vectors of significant size, will usually be split over multiple files that all start with vectors.bin – so if you then want to move the model elsewhere, move all those files together.
You can also save/load vectors in a simpler format with the .save_word2vec_format() & .load_word2vec_format() methods. EG:
vectors.save_word2vec_format('vectors.txt', binary=False)
vectors_reloaded = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
I have a pretrained embeddings file, which was quantized, in .ftz format. I need it to look up words, find the nearest neighbours. But I fail to find any toolkits that can do that. FastText can load the embeddings file, yet not able to look up the nearest neighbour, Gensim can lookup the nearest neighbour, but not be able to load the model...
Or it's me not finding the right function?
Thank you!
FastText models come in two flavours:
unsupervised models that produce word embeddings and can find similar words. The native Facebook package does not support quantization for them.
supervised models that are used for text classification and can be quantized natively, but generally do not produce meaningful word embeddings.
To compress unsupervised models, I have created a package compress-fasttext which is a wrapper around Gensim that can reduce the size of unsupervised models by pruning and quantization. This post describes it in more details.
With this package, you can lookup similar words in small models as follows:
import compress_fasttext
small_model = compress_fasttext.models.CompressedFastTextKeyedVectors.load(
'https://github.com/avidale/compress-fasttext/releases/download/v0.0.4/cc.en.300.compressed.bin'
)
print(small_model.most_similar('Python'))
# [('PHP', 0.5253), ('.NET', 0.5027), ('Java', 0.4897), ... ]
Of course, it works only if the model has been compressed using the same package. You can compress your own unsupervised model this way:
import compress_fasttext
from gensim.models.fasttext import load_facebook_model
big_model = load_facebook_model('path-to-original-model').wv
small_model = compress_fasttext.prune_ft_freq(big_model, pq=True)
small_model.save('path-to-new-model')
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)
last parts of the code:
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 64)
corpus_lda = lda[corpus_tfidf]
I am wondering how to save corpus_lda for further use?
Gensim has functions for writing corpora to disk:
from Gensim import corpora
corpora.MmCorpus.serialize('pathandfilename.mm', corpus_lda)
To load a saved corpus use:
corpus_lda = corpora.MmCorpus('pathandfilename.mm')
There are similar functions for saving models (check the tutorials or the references).
There are different corpus formats available, I believe matrix market used to be the standard format used by Gensim but recently the indexedcorpus format was added, which has some additional functionality (an index, as you may have guessed).
I am using the Gensim Python package to learn a neural language model, and I know that you can provide a training corpus to learn the model. However, there already exist many precomputed word vectors available in text format (e.g. http://www-nlp.stanford.edu/projects/glove/). Is there some way to initialize a Gensim Word2Vec model that just makes use of some precomputed vectors, rather than having to learn the vectors from scratch?
Thanks!
The GloVe dump from the Stanford site is in a format that is little different from the word2vec format. You can convert the GloVe file into word2vec format using:
python -m gensim.scripts.glove2word2vec --input glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt
You can download pre-trained word vectors from here (get the file 'GoogleNews-vectors-negative300.bin'):
word2vec
Extract the file and then you can load it in python like:
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
model.most_similar('dog')
EDIT (May 2017):
As the above code is now deprecated, this is how you'd load the vectors now:
model = gensim.models.KeyedVectors.load_word2vec_format(os.path.join(os.path.dirname(__file__), 'GoogleNews-vectors-negative300.bin'), binary=True)
As far as I know, Gensim can load two binary formats, word2vec and fastText, and a generic plain text format which can be created by most word embedding tools. The generic plain text format looks like this (in this example 20000 is the size of the vocabulary and 100 is the length of vector)
20000 100
the 0.476841 -0.620207 -0.002157 0.359706 -0.591816 [98 more numbers...]
and 0.223408 0.231993 -0.231131 -0.900311 -0.225111 [98 more numbers..]
[19998 more lines...]
Chaitanya Shivade has explained in his answer here, how to use a script provided by Gensim to convert the Glove format (each line: word + vector) into the generic format.
Loading the different formats is easy, but it is also easy to get them mixed up:
import gensim
model_file = path/to/model/file
1) Loading binary word2vec
model = gensim.models.word2vec.Word2Vec.load_word2vec_format(model_file)
2) Loading binary fastText
model = gensim.models.fasttext.FastText.load_fasttext_format(model_file)
3) Loading the generic plain text format (which has been introduced by word2vec)
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file)
If you only plan to use the word embeddings and not to continue to train them in Gensim, you may want to use the KeyedVector class. This will reduce the amount of memory you need to load the vectors considerably (detailed explanation).
The following will load the binary word2vec format as keyedvectors:
model = gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format(model_file, binary=True)