Mocking FastText model for utest - python

I am using fasttext models in my python library (from the official fasttext library). To run my u-tests, I need at some point a model (fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.
I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called
model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")
It basically works but the model is 16Mb. It is kind of ok for a U-test resource but do you think I can go below this?

Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters. (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.)
But, if you just need a relatively meaningless object/file of the right type, your approach should work. The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000. It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.
Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.

Related

word-embendding: Convert supervised model into unsupervised model

I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that
If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.

When should I consider to use pretrain-model word2vec model weights?

Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?
# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)
# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))
The general answer to this type of question is: you should try them both, and see which works better for your purposes.
No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.
Separately:
"fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?
Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.

How to load the pre-trained doc2vec model and use it's vectors

Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)

What reasonable options for model persistency for sklearn models exists?

joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?

Categories