I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.
dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
words=x['text_columns'], tags=[str(x.ID)]), axis=1)
model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)
corpus = dataset['tagged_descriptions'].tolist()
model.build_vocab(corpus)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)
and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.
Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.
As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model. However, I would appreciate more lights to be shed on this matter.
A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)
If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.
Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.
(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)
A few side notes on your example training code:
it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words
dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.
much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.
Related
I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.
At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:
Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.
Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288
I'm wanna to load an pre-trained embendding to initalize my own unsupervise FastText model and retrain with my dataset.
The trained embendding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try:
FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.
Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?
I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that
If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)
However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().
(That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)
Update after comment below:
Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.
You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)
But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.
You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)
Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)
So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.
I'm trying to classify whether or not I liked books that I've read this year based on the text in the books. I'm using the preprocessing described here, and a variety of sklearn classification models.
At first I was just feeding the models the raw text, but I cleaned it based on GloVe embeddings (a process described here). The text was improved from 40% vocab, 80% coverage to 80% vocab, 98% coverage based on GloVe embeddings. However, for some reason, after cleaning the text, the accuracy of the classification models seemed to be the same or lower.
Uncleaned text model results:
Cleaned text model results:
One thing to note is that the classes are quite imbalanced (75% of books were good as compared to 25% bad), so accuracy above 75% should be expected, since 75% is what the model would get if it guessed good every single time.
I've linked my full notebook here so you can check out the specific code if that will be helpful for solving this issue. I'm incredibly confused; I can't see where I'm going wrong, but it can't be right that cleaning the text data has zero or negative impact on model accuracy.
I think the main point you are missing is that data cleaning is an empirical process. Text preprocessing may consist of removing stop words, punctuations, numericals, lowercasing, but if this adds to model's ability to learn and generalize remains to be seen through Cross Validation, i.e. feeding results of your peprocessing to model train and seeing if this generalizes to test well.
In general preproceeeing (stop words removal, etc) works well for Bag of Words models because it reduces data dimensionality because data in BOW is long and sparse (check out Curse of dimensionality e.g. for possible theoretical foundations). The need for data preprocessing is diminished with word embeddings like word2vec or BERT.
In short, if you have any data preprocessing in mind, check if it helps your model to learn and generalize through properly constructed CV.
I don't have large corpus of data to train word similarities e.g. 'hot' is more similar to 'warm' than to 'cold'. However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents.
To elaborate let me use this toy example. Assume I've only 4 training docs given by 4 sentences - "I love hot chocolate.", "I hate hot chocolate.", "I love hot tea.", and "I love hot cake.".
Given a test document "I adore hot chocolate", I would expect, doc2vec will invariably return "I love hot chocolate." as the closest document. This expectation will be true if word2vec already supplies the knowledge that "adore" is very similar to "love". However, I'm getting most similar document as "I hate hot chocolate" -- which is a bizarre!!
Any suggestion on how to circumvent this, i.e. be able to use pre-trained word embeddings so that I don't need to venture into training "adore" is close to "love", "hate" is close to "detest", and so on.
Code (Jupyter Nodebook. Python 3.7. Jensim 3.8.1)
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love hot chocolate.",
"I hate hot chocolate",
"I love hot tea.",
"I love hot cake."]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
print(tagged_data)
#Train and save
max_epochs = 10
vec_size = 5
alpha = 0.025
model = Doc2Vec(vector_size=vec_size, #it was size earlier
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
if epoch % 10 == 0:
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.epochs) #It was model.iter earlier
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
print("Model Ready")
test_sentence="I adore hot chocolate"
test_data = word_tokenize(test_sentence.lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)
# to find most similar doc using tags
sims = model.docvecs.most_similar([v1])
print("\nTest: %s\n" %(test_sentence))
for indx, score in sims:
print("\t(score: %.4f) %s" %(score, data[int(indx)]))
Just ~100 documents is way too small to meaningfully train a Doc2Vec (or Word2Vec) model. Published Doc2Vec work tends to use tens-of-thousands to millions of documents.
To the extent you may be able to get slightly meaningful results from smaller datasets, you'll usually need to reduce the vector-sizes a lot – to far smaller than the number of words/examples – and increase the training epochs. (Your toy data has 4 texts & 6 unique words. Even to get 5-dimensional vectors, you probably want something like 5^2 constrasting documents.)
Also, gensim's Doc2Vec doesn't offer any official option to import word-vectors from elsewhere. The internal Doc2Vec training is not a process where word-vectors are trained 1st, then doc-vectors calculated. Rather, doc-vectors & word-vectors are trained in a simultaneous process, gradually improving together. (Some modes, like the fast & often highly effective DBOW that can be enabled with dm=0, don't create or use word-vectors at all.)
There's not really anything bizarre about your 4-sentence results, when looking at the data as if we were the Doc2Vec or Word2Vec algorithms, which have no prior knowledge about words, only what's in the training data. In your training data, the token 'love' and the token 'hate' are used in nearly exactly the same way, with the same surrounding words. Only by seeing many subtly varied alternative uses of words, alongside many contrasting surrounding words, can these "dense embedding" models move the word-vectors to useful relative positions, where they are closer to related words & farther from other words. (And, since you've provided no training data with the token 'adore', the model knows nothing about that word – and if it's provided inside a test document, as if to the model's infer_vector() method, it will be ignored. So the test document it 'sees' is only the known words ['i', 'hot', 'chocolate'].)
But also, even if you did manage to train on a larger dataset, or somehow inject the knowledge from other word-vectors that 'love' and 'adore' are somewhat similar, it's important to note that antonyms are typically quite similar in sets of word-vectors, too – as they are used in the same contexts, and often syntactically interchangeable, and of the same general category. These models often aren't very good at detecting the flip-in-human-perceived meaning from the swapping of a word for its antonym (or insertion of a single 'not' or other reversing-intent words).
Ultimately if you want to use gensim's Doc2Vec, you should train it with far more data. (If you were willing to grab some other pre-trainined word-vectors, why not grab some other source of somewhat-similar bulk sentences? The effect of using data that isn't exactly like your actual problem will be similar whether you leverage that outside data via bulk text or a pre-trained model.)
Finally: it's a bad, error-prone pattern to be calling train() more than once in your own loop, with your own alpha adjustments. You can just call it once, with the right number of epochs, and the model will perform the multiple training passes & manage the internal alpha smoothly over the right number of epochs.
I'm training a Word2Vec model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec results?
Update: This is how doc2vec_tagged_documents is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
My training data contains 4000 documents
with 900 words on average.
My vocabulary size is about 1000 words.
My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
My data is not about natural language, please keep this in mind.
Summing/averaging word2vec vectors is often quite good!
It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)
If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.
If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.
Other things that sometimes help improve Doc2Vec vectors for classification purposes:
re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)
Hope this helps.