What parameters when training a gensim fasttext model have the biggest effect on the resulting models' size in memory?
gojomos answer to this question mentions ways to reduce a model's size during training, apart from reducing embedding dimensionality.
There seem a few parameters that might have an effect: thresholds for including words in the vocabulary especially. Do the other parameters also influence model size, for example ngram range, and which parameters have the largest effect?
I hope this is not too lazy of a question :-)
The main parameters affecting FastText model size are:
vector_size (dimensionality) - the size of the model is overwhelmingly a series of vectors (both whole-word and n-gram) of this length. Thus, reducing vector_size has a direct, large effect on total model size.
min_count and/or max_final_vocab - by affecting how many whole words are considered 'known' (in-vocabulary) for the model, these directly influence how many bulk vectors are in the model. Especially if you have large enough training data that model size is an issue – & are using FastText – you should be considering higher values than the default min_count=5. Very-rare words with just a handful of usage examples typically don't learn good generalizable representations in word2vec-like models. (Good vectors come from many subtly-contrasting usage examples.) But because by Zipfian distributions, there are typically a lot of such words in natural language data, they do wind up taking a lot of the training time, & tug against other words' training, & push more-frequent words out of each-other's context windows. Hence this is a case where, counter to many peoples' intuition, throwing away some data (the rarest words) can often improve the final model.
bucket – which specifies exactly how may n-gram vectors will be learned by the model, because they all share a collision-oblivious hashmap. That is, no matter how many unique n-grams there really are in the training data, they'll all be forced into exactly this many vectors. (Essentially, rarer n-grams will often collide with more-frequent ones, and be just background noise.)
Notably, because of the collisions tolerated by the bucket-sized hashmap, the parameters min_n & max_n actually don't affect the model size at all. Whether they allow for lots of n-grams of many sizes, or much fewer of a single/smaller range of sizes, they'll be shoehorned into the same number of buckets. (If more n-grams are used, a larger bucket value may help reduce collisions, and with more n-grams, training time will be longer. But the model will only grow with a larger bucket, not different min_n & max_n values.)
You can get a sense of a model's RAM size by using .save() to save it to disk - the size of the multiple related files created (without compression) will roughly be of a similar magnitude as the RAM needed by the model. So, you can improve your intuition for how varying parameters changes the model size, by running varied-parameter experiments with smaller models, and watching their different .save()-sizes. (Note that you don't actually have to .train() these models - they'll take up their full allocated size once the .build_vocab() step has completed.)
Related
I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.
At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:
Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.
Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288
Python 3.9.6
I wrote the code to create word- embeddings for my domain (medicine books). My data consists of 45,000 normal length sentences (31 519 unique words, 591 347 all words). When I create / learn a model:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences,
min_count = 5,
vector_size = 200,
workers = multiprocessing.cpu_count(),
window = 6
)
model.save(full_path)
,it's trained about 1- 2 seconds, and the size of the saved model is about 15MB.
How can I check the correctness of the creation my word- embeddings?
There's not really 'correctness' for word-embeddings, just 'usefulness for desired purposes'.
Especially when you are training-up domain-specific vectors, for some specific intended task, you should try to create your own mix of evaluations.
Those might start as some ad hoc sanity checks, like: "if I look at model.most_similar('esophageal') (or many other probe words), do the results make sense to me?"
But it's even better if your evaluations are some repeatable quantitative scoring – such as a bunch of words that 'should' be closer-to-each-other than other words – that could be run, quickly, against a new model where you've tweaked parameters or added training data, to rank it against other models.
And best if your downstream application – info-retrieval, or classification, or recommendation, etc – itself has some robust scoring against ideal results, and you can apply that back to indicate whether one set of word-vectors is better than another for that use, and by how much.
Separately: that's not a very big training set for word2vec, but might be enough for some useful results. But that might be suspiciously fast completion. Try enabling logging at the INFO level, and watch the progress info to make sure it's making sense at each step. Test if using more than the default epochs=5 has the expected effect of making training take more time. (One common mistake is to pass a mere iterator, that's only capable of producing the training data once, instead of a true iterable that be be re-iterated many time, to the model. That error allows the model the one pass it needs to discover the vocabulary but not the epochs additional passes it needs for real training.)
Have been working on grover model of rowanz . I was able to train grover's large model on 4 batch size but was getting memory allocation error while fine tuning mega model I then reduce batch size to 1 and training is now on going. I also tried to reduce max_seq_length to 512 and set batch_size to 4 and it was working.
My questions is what parameter will effect more on performance reducing batch size or reducing max_seq_length ?
Also can I set the value of max_seq_length other then the power of 2 like some value between 512 and 1024?
My questions is what parameter will effect more on performance
reducing batch size or reducing max_seq_length?
Effects of batch size:
On performance: None. It is a big misconception that batch size in any way affects the end metrics (e.g. accuracy). Although finer batch size means metrics being reported on shorter intervals giving illusion of much larger variability than there actually is. Effect is highly noticeable in case of batch size = 1 for obvious reasons. Larger batch sizes tend to report higher veracity for metrics as they are being calculated over several data points. End metrics are usually the same (with account for random initialization of weights).
On efficiency: Larger batch sizes means metrics being calculated less often but at the same time more space in the memory at the same time as metrics are being aggregated over a number of data points as per batch size. The same issue you were facing. So, batch size is more of a efficiency concern than a performance one. Moreover, how often you want to check model’s output.
Effects of max_seq_length:
On performance: Probably the most important metric for performance of language based models like Grover. Reason behind this is the perplexity of human-written text is lower than randomly sampled text, and this gap increases with sequence length. Generally, more the sequence length is, easier it is for a language model to stay consistent during the whole course of the output. So yeah it does help in model performance. However you might want to look into documentation for your particular model for “Goldilocks Zones” of sequence lengths and whether the sequences in power of 2 are more desirable than others.
On efficiency: Larger sequence sizes are of course require more processing power and computational memory so higher you go for the sequence lengths, more power you will need.
Also can I set the value of max_seq_length other then the power of 2
like some value between 512 and 1024?
Yeah why not? No model is designed to work with a fixed set of values. Wxperiment different sequence lengths and see whichever works for you best. Adjsuting some parameters in powers of two has been a classical practice for having a little computational advantage because of their simple binary representations but is negligible in case of large models as of today.
I already have a model that has trained 130,000 sentences.
I want to categorize sentences with bidirectional lstm.
We plan to use this service.
However, the model must continue to be trained throughout the service.
so i Think
Until the accuracy of the model increases
I will look at the sentences that the model has categorized and I will answer them myself.
I will train sentence to answer.
Is there a difference between training the sentences one by one and training them by merge them into one file?
Every time I give a sentence
One by one training
Does it matter?
Yes, there is a difference. Suppose, you have a dataset of 10,000 sentences.
If you are training one sentence at each time, then optimization will take place at every sentence ( backpropagation ). This consumes more time and memory and is not a good choice. This is not possible if you have a large dataset. Computing gradient on each instance is noisy and the speed of convergence is less.
If you are training in batches, suppose the batch size is 1000, then you have 10 batches. These batches together go in the network and thus gradients are computed over these batches. Hence, the gradients receive enough noise to converge at the global minima rather than the local minima. Also, it is memory efficient and converges faster.
You can check out answers from here, here and here.
I'm training a Word2Vec model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec results?
Update: This is how doc2vec_tagged_documents is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
My training data contains 4000 documents
with 900 words on average.
My vocabulary size is about 1000 words.
My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
My data is not about natural language, please keep this in mind.
Summing/averaging word2vec vectors is often quite good!
It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)
If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.
If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.
Other things that sometimes help improve Doc2Vec vectors for classification purposes:
re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)
Hope this helps.