I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.
However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop? Please let me know how I should change the following code to train the model for 20 epoches.
Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.
Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)
If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.
Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:
model = doc2vec.Doc2Vec(docs, iter=20, ...)
Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)
You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().
Related
I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.
At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:
Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.
Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288
Python 3.9.6
I wrote the code to create word- embeddings for my domain (medicine books). My data consists of 45,000 normal length sentences (31 519 unique words, 591 347 all words). When I create / learn a model:
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences,
min_count = 5,
vector_size = 200,
workers = multiprocessing.cpu_count(),
window = 6
)
model.save(full_path)
,it's trained about 1- 2 seconds, and the size of the saved model is about 15MB.
How can I check the correctness of the creation my word- embeddings?
There's not really 'correctness' for word-embeddings, just 'usefulness for desired purposes'.
Especially when you are training-up domain-specific vectors, for some specific intended task, you should try to create your own mix of evaluations.
Those might start as some ad hoc sanity checks, like: "if I look at model.most_similar('esophageal') (or many other probe words), do the results make sense to me?"
But it's even better if your evaluations are some repeatable quantitative scoring – such as a bunch of words that 'should' be closer-to-each-other than other words – that could be run, quickly, against a new model where you've tweaked parameters or added training data, to rank it against other models.
And best if your downstream application – info-retrieval, or classification, or recommendation, etc – itself has some robust scoring against ideal results, and you can apply that back to indicate whether one set of word-vectors is better than another for that use, and by how much.
Separately: that's not a very big training set for word2vec, but might be enough for some useful results. But that might be suspiciously fast completion. Try enabling logging at the INFO level, and watch the progress info to make sure it's making sense at each step. Test if using more than the default epochs=5 has the expected effect of making training take more time. (One common mistake is to pass a mere iterator, that's only capable of producing the training data once, instead of a true iterable that be be re-iterated many time, to the model. That error allows the model the one pass it needs to discover the vocabulary but not the epochs additional passes it needs for real training.)
I need to train my own model with word2vec and fasttext. By readind different sourcs I found different information.
So I did the model and trained it like this:
model = FastText(all_words, size=300, min_count= 3,sg=1)
model = Word2Vec(all_words, min_count=3, sg = 1, size = 300 )
So I read that that should be enough to creat and train the model. But then I saw, that some people do it seperatly:
model = FastText(size=4, window=3, min_count=1) # instantiate
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10) # train
Now I am confused and dont know if what I did is correct. Can sombody help me to make it clear?
Thank you
It's perfectly acceptable to supply your training corpus – all_words – when you instantiate the model object. In that case, the model will automatically perform all steps needed to train the model, using that data. So you can do this:
model = Word2Vec(all_words, ...) # where '...' is your non-default params
It's also acceptable to not provide the corpus when instantiating the model - but then the model is extremely minimal, with just your initial parameters. It still needs to discover the relevant vocabulary (which requires a single pass over the training data), then allocate some vary-large internal structures to accommodate those words, then do the actual training (which requires multiple additional passes over the training data).
So if you don't provide the corpus when the model is instantiated, you should do two extra method calls:
model = Word2Vec(...) # where '...' is your non-default params
model.build_vocab(all_words) # discover vocabulary & allocate model
# now train, with #-of-passes & #-of-texts set by earlier steps
model.train(all_words, epochs=model.iter, total_examples=model.corpus_count)
These two code blocks I've shown are equivalent. The top does the usual steps for you; the bottom breaks the steps out into your explicit control.
(The code you'd excerpted in your question, showing only a .train() call, would error for a number of reasons. The .build_vocab() is a necessary step to have a fully-allocated model, and the call to .train() must explicitly state the desired epochs and an accurate count total_examples of the number-of-items in the corpus. But, you can and typically should re-use values that were already cached into the model by the two previous steps.)
It's your choice which approach to use. Generally people only use the 3-separate-steps process if they want to do other output/logging between the steps, or something advanced between the steps that might tamper with the model state.
I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.
dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
words=x['text_columns'], tags=[str(x.ID)]), axis=1)
model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)
corpus = dataset['tagged_descriptions'].tolist()
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.
Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.
As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model. However, I would appreciate more lights to be shed on this matter.
A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)
If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.
Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.
(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)
A few side notes on your example training code:
it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words
dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.
much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.
I am a bit confused regarding an aspect of Doc2Vec. Basically, I am not sure if what I do makes sense. I have the following dataset :
train_doc_0 --> label_0
... ...
train_doc_99 --> label_0
train_doc_100 --> label_1
... ...
train_doc_199 --> label_1
... ...
... ...
train_doc_239999 --> label_2399
eval_doc_0
...
eval_doc_29
Where train_doc_n is a short document, belonging to some label. There are 2400 labels, with 100 training documents per label. eval_doc_0 are evaluation documents where I would like to predict their label in the end (using a classifier).
I train a Doc2Vec model with these training documents & labels. Once the model is trained, I reproject each of the original training document as well as my evaluation documents (the ones I would like to classify in the end) into the model's space using infer_vector.
The resulting is a matrix :
X_train (240000,300) # doc2vec vectors for training documents
y_train (240000,) # corresponding labels
y_eval (30, 300) # doc2vec vectors for evaluation documents
My problem is the following : If I run a simple cross validation on X_train and y_train, I have a decent accuracy. Once I try to classify my evaluation documents (even, using only 50 randomly sampled labels) I have a super bad accuracy, which makes me question my way of approaching this problem.
I followed this tutorial for the training of documents.
Does my approach make sense, especially with reprojecting all the training documents using infer_vector ?
I don't see anything blatantly wrong.
Are the evaluation documents similar to the training documents in length, vocabulary, etc? Ideally, they'd be a randomly-chosen subset of all available labeled examples. (If quite different, that might be a reason why cross-validation versus held-out-evaluation accuracy varies.)
When training the Doc2Vec model, are you giving each document a single unique ID as the only entry of its tags? Or are you using the label_n labels as the tags of your training examples? Or perhaps both? (Any of those are defensible choices, though I've found mixing known-labels into the usually 'unsupervised' Doc2Vec training, making it semi-supervised, often helps the mdoels' vectors become more useful as input to later explicitly-supervised classifiers.)
When I get unprecedented 'super-bad' accuracy in an unexpected step, often it's because some erroneous shuffling or re-ordering of the test examples has occurred – randomizing the real relationships. So it's worth doubling-checking for that, in code and by looking at a few examples in detail.
Re-inferring vectors for examples used in training, rather than simply asking for the trained-up vectors retained in the model, sometimes results in better vectors. However, many have observed that different-than-default parameters to infer_vector(), especially many-more steps and perhaps a starting alpha closer to that used during training, may improve results. (Also, inference seems to work better in fewer steps in the simpler PV-DBOW, dm=0, mode. PV-DM, dm=1, may especially require more steps.)
The tutorial you link shows a practice, calling train() multiple times while adjusting alpha yourself, that's generally unnecessary and error-prone – and specifically isn't likely to be doing the right thing in the latest gensim versions. You can leave the default alpha/min_alpha in place, and supply a preferred iter value during Doc2Vec initialization - and then one call to train() will automatically do that many passes, and glide the learning-rate down properly. And since the default iter is 5, if you don't set it, every call to train() is doing 5 passes - so doing your own external loop of 10 would mean 50 passes, and the code at that tutorial, with two calls to train() per loop for some odd reason, would mean 100 passes.