I have downloaded the wikipedia glove vectors using the gensim API. I want to save it locally so I don't have to call the API everytime to download it. How can I do this? I have looked but I am not sure if this is the right way to save them.
import gensim.downloader as api
vectors = api.load('glove-wiki-gigaword-50')
vectors.save('vectors.bin')
The code you've shown in your question likely worked to save the vectors.
You can check the class type of the object that the api.load() gave you with code like:
print(type(vectors))
I think that will be shown to be an instance of KeyedVectors.
If so, then reloading the file you've already saved should be as simple as:
from gensim.models import KeyedVectors
reloaded_vectors = KeyedVectors.load('vectors.bin')
Note that the native Gensim .save(), for any vectors of significant size, will usually be split over multiple files that all start with vectors.bin – so if you then want to move the model elsewhere, move all those files together.
You can also save/load vectors in a simpler format with the .save_word2vec_format() & .load_word2vec_format() methods. EG:
vectors.save_word2vec_format('vectors.txt', binary=False)
vectors_reloaded = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
Related
I have a docker image which contains the en_web_core_lg model and also an API which returns vector of a sentence or word when requested. Let's say that this API is called "spacy_vectors".
I have trained a custom NER model (which uses the same en_web_core_lg pipeline for vectors) in google colab and after saving to disk, I now want to reduce the size of the model on disk.
So I am asking if there any way I can remove the vectors saved on disk and when I load the model I can query the vector from the API and pass it across to the NER model?
I looked through the Spacy documentation and could not find anything like this. The only thread that comes close to this is this, but that question was asked for Spacy version 2 and lot has changed since then.
I use FastText to generate the word embedding. I download the pre-trained model from https://fasttext.cc/docs/en/crawl-vectors.html
The model has 300 dimensions but I want 100 dimensions so I use reduce model command but I got an error
import gensim
model = gensim.models.fasttext.FastText.load_fasttext_format('cc.th.300.bin')
gensim.models.fasttext.utils.reduce_model(model, 100)
I got AttributeError: module 'gensim.utils' has no attribute 'reduce_model'
Heres are the code from FastText docs
import fasttext
import fasttext.util
ft = fasttext.load_model('cc.en.300.bin')
fasttext.util.reduce_model(ft, 100)
How to fix this error, I cannot find any docs for the new command.
Thank you
The module gensim.fasttext.utils does not have a function reduce_model(), as the error message describes.
That's not a common/standard operation - it's just something the Facebook wrapper decided to implement. (It looks like it's using standard PCA on a tiny subsample of the vectors, per source code here.)
Why do you want to reduce the dimensionality?
Note that you'll lose some of the model's expressiveness, and if you were able to load the model at all to do the reduction, it's not too big for your RAM. If your main goal is to save model size, there might be better ways, such as discarding more rare words, depending on your reasons.
If you absolutely need to perform such a reduction, some options could be:
Do it using the Facebook wrapper, then save the results in a form Gensim can load.
Reimplement the same operation for a Gensim model, perhaps using the FB code as a guide. (You'd have to make sure the Gensim model is updated in all ways that it considered the original dimensionality, which might be tricky – it's never been a goal or function of Gensim to enable after-the-fact model-shrinking.)
As the title says, I would like to load custom word vectors built from gensim to the SpaCy Vector class.
I have found several other questions where folks have successfully loaded vectors to the nlp object itself, but I have a current project where I would like to have a separate Vectors object.
Specifically, I am using BioWordVec to generate my word vectors which serializes the vectors using methods from gensim.models.Fastext.
On the gensim end I am:
calling model.wv.save_word2vec_format(output/bin/path, binary=True)
saving the model -> model.save(path/to/model)
On the SpaCy side:
I can either use the from_disk or from_bytes methods to load the word vectors
there is also a from_glove method that expects a vocab.txt file and a binary file (which I already have a binary file
Link to Vectors Documentation
just for reference, here is my code to test the load process:
import spacy
from spacy.vectors import Vectors
vecs = Vectors()
path = '/home/medmison690/pyprojects/BioWordVec/pubmed_mesh_test.bin'
dir_path = '/home/medmison690/Desktop/tuned_vecs'
vecs.from_disk(dir_path)
print(vecs.shape)
I have tried various combinations of from_disk and from_bytes with no success. Any help or advice would be greatly appreciated!
It's unfortunate the Spacy docs don't clearly state what formats are used by its various reading functions, nor implement an import that's clearly based on the format written by the original Google word2vec.c code.
It seems the from_disk expects things in Spacy's own multi-file format. The from_bytes might expect a raw version of the vectors. Neither would be useful for data saved from gensim's FastText model.
The from_glove might in fact be a compatible format. You could try using the save_word2vec_format() method with its optional fvocab argument (to specify a vocab.txt file with words), binary=True, and a filename according to Spacy's conventions. For example, if you have 300 dimensional vectors:
ft_model.wv.save_word2vec_format('vectors.300.f.bin', fvocab='vocab.txt', binary=True)
Then, see if that directory works for Spacy's from_glove. (I'm not sure it will.)
Alternatively, you could possibly use a gensim utility class (such as its KeyedVectors) to load the vectors into memory, then manually add each vector, one-by-one, into a pre-allocated Spacy Vectors object.
Note that by saving FastText vectors to the plain, vectors-only word2vec_format, you'll be losing anything the model learned about subwords (which is what FastText-capable models use to synthesize vectors for out-of-vocabulary words).
In Gensim's documentation, it says:
You can save trained models to disk and later load them back, either to continue training on new training documents or to transform new documents.
I would like to do this with a dictionary, corpus and tf.idf model. However, the documentation seems to say that it is possible, without explaining how to save these things and load them back up again.
How do you do this?
I've been using Pickle, but don't know if this is right...
import pickle
pickle.dump(tfidf, open("tfidf.p", "wb"))
tfidf_reloaded = pickle.load(open("tfidf.p", "rb"))
In general, you can save things with generic Python pickle, but most gensim models support their own native .save() method.
It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file. (When you later move the saved model, keep all these files with the same root name together.)
In particular, some models which have multi-gigabyte subcomponents may not save at all with pickle() – but gensim's native .save() will work.
Models saved with .save() can typically be loaded by using the appropriate class's .load() method. (For example if you've saved a instance of gensim.corpora.dictionary.Dictionary, you'd load it with gensim.corpora.dictionary.Dictionary.load(filepath).
Saving the Dict and Corpus to disk
dictionary.save(DICT_PATH)
corpora.MmCorpus.serialize(CORPUS_PATH, corpus)
Loading the Dict and Corpus from disk
loaded_dict = corpora.Dictionary.load(DICT_PATH)
loaded_corp = corpora.MmCorpus(CORPUS_PATH)
Python default pickle should save all python object. As an example
import pickle
file_name = 'myModel.sav'
pickle.dump(my_model, open(fime_name, 'wb'))
loaded_model = pickle.load(open(file_name, 'rb))
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)