When should I consider to use pretrain-model word2vec model weights?

When should I consider to use pretrain-model word2vec model weights? - python

Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?
# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)
# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))

The general answer to this type of question is: you should try them both, and see which works better for your purposes.
No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.
Separately:
"fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.

Related

Using weight from a Gensim Word2Vec model as a starting point of another model

I have two corpora that are from the same field, but with a temporal shift, say one decade. I want to train Word2vec models on them, and then investigate the different factors affecting the semantic shift.
I wonder how should I initialize the second model with the first model's embeddings to avoid as much as possible the effect of variance in co-occurrence estimates.

At a naive & easy level, you can just load one existing model, and .train() on new data. But note if doing that:
Any words not already known by the model will be ignored, and the word-frequencies that feed algorithmic steps will only be from the initial survey
While all words in the current corpus will get as many training-updates as their appearances (& your epochs setting) dictate, and thus be nudged arbitrarily-far from their original-model locations, other words from the seed model will stay exactly where they were. But, it's only the interleaved tug-of-war between words in the same training session that makes them usefully comparable. So doing this sequential training – updating only some words in a new training session – is likely to degrade the meaningfulness of word-to-word comparisons, in hard-to-measure ways.
Another approach that might be woth trying could be to train single model over the combined corpus - but transform/repeat the era-specific texts/words in certain ways to be able to distinguish earlier-usages from later-usages. There are more details about this suggestion in the context of word-vectors varying over usage-eras in a couple previous answers:
https://stackoverflow.com/a/57400356/130288
https://stackoverflow.com/a/59095246/130288

'Doc2Vec' object has no attribute 'get_latest_training_loss'

I am pretty new to doc2vec then I made small research and found a couple of things. Here is my story: I am trying to learn using doc2vec 2.4 million documents. At first, I tried only doing so with a small model of 12 documents. I checked the results with infer vector of the first document and found it to be similar indeed to the first document by 0.97-0.99 cosine similarity measure. Which I found good, even though when I tried to enter a new document of completely different words I received a high score of 0.8 measure similarity. However, I had put it aside and tried to go on and build the full model with the 2.4 million documents. In this point, my problems began. The result was complete nonsense, I received in the most_similar function results with a similarity of 0.4-0.5 which were completely different from the new document checked. I tried to tune parameters but no result yet. I tried also to remove randomness both from the small and big model, however, I still got different vectors. Then I had tried to use get_latest_training_loss on each epoch in order to see how the loss changes over each epoch. This is my code:
model = Doc2Vec(vector_size=300, alpha=0.025, min_alpha=0.025, pretrained_emb=".../glove.840B.300D/glove.840B.300d.txt", seed=1, workers=1, compute_loss=True)
workers=1, compute_loss=True)
model.build_vocab(documents)
for epoch in range(10):
for i in range(model_glove.epochs):
model.train(documents, total_examples = token_count, epochs=1)
training_loss = model.get_latest_training_loss()
print("Training Loss: " + str(training_loss))
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
I know this code is a bit awkward, but it is used here only to follow the loss.
The error I receive is:
AttributeError: 'Doc2Vec' object has no attribute 'get_latest_training_loss'
I tried looking at model. and auto-complete and found that indeed there is no such function, I found something similar name training_loss, but it gives me the same error.
Anyone here can give me an idea?
Thanks in Advance

Especially as a beginner, there's no pressing need to monitor training-loss. For a long time, gensim didn't report it in any way for any models – and it was still possible to evaluate & tune models.
Even now, running-loss-reporting in gensim kind of a rough, incomplete, advanced/experimental feature – and after a recent refactoring it doesn't seem to have full support in Doc2Vec. (Notably, while having the loss level reach a plateau can be a helpful indicator that further training can't help, it is most definitely not the case that a model with arbitrarily-lower-loss is better than others. In particular, a model that achieves near-zero loss would likely be extremely overfit, and probably of little use for downstream applications.)
Regarding your general aim, of getting good vectors, with regard to the process you've described/shown:
Tiny tests (as with your 12 documents) don't really work with these algorithms, except to check that you're calling the steps with legal parameters. You shouldn't expect the similarities in such toy-sized tests to mean anything, even if they superficially meet expectations in some cases. The algorithms need lots of training data & large vocabularies to train sensible models. (So, your full 2.4 million docs should work well.)
You generally shouldn't be changing the default alpha/min_alpha values, or call train() multiple times in a loop. You can just leave those at their defaults, and call train() with your desired number of training epochs – and it will do the right thing. The approach in your shown code is a suboptimal and fragile anti-pattern – whichever online source you learned it from is misguided and severely outdated.
You haven't shown your inference code, but note that it will re-use the epochs, alpha, and min_alpha cached in the model instance from original initialization, unless you supply other values. And, the default epochs if not-specified is a value inherited from shared code with Word2Vec of just 5. Doing a mere 5 epochs, and leaving the effective alpha at 0.025 the whole time (as alpha=0.025, min_alpha=0.025 does to inference), is unlikely to give good results, especially on short docs. Common epochs values from published work are 10-20 - and doing at least as many for inference as were used for training is typical.
You are showing the use of a pretrained_emb initialization parameter that is not part of the standard gensim library, so perhaps you're using some other fork, based on some older version of gensim.. Note that it's not typical to initialize a Doc2Vec model with word-embeddings from elsewhere before training, so if doing that, you're already in advanced/experimental territory – which is premature if you're still trying to get some basic doc-vectors into reasonable shape. (And, usually people seek tricks like re-used word-vectors if they have a small corpus. With 2.4 million docs, you probably don't have such corpus problems – any word-vectors can be learned from your corpus along with doc-vectors, in the default way.)

gensim doc2vec Model doesn't learn some words

I'm currently learning gensim doc2model in Python3.6 to see similarity between sentences.
I created a model but it returns KeyError: "word 'WORD' not in vocabulary" when I input a word which obviously exists in the training dataset, to find a similar word/sentence.
Does it automatically skip some words not very important to define sentences? or is that simply a bug or something?
Very appreciated if I could have any way out to cover all the appearing words in the dataset. thanks.

If a word you expected to be learned in the model isn't in the model, the most likely causes are:
it wasn't really there, in the version the model saw, perhaps because your tokenization/preprocessing is broken. Enable logging at INFO level, and examine your corpus as presented to the model, to ensure it's tokenized as intended
it wasn't part of the surviving vocabulary after the 1st vocabulary-survey of the corpus. The default min_count=5 discards words with fewer than 5 occurrences, as such words both fail to get good vectors for themselves, and effectively serve as 'noise' interfering with the improvement of other vectors.
You can set min_count=1 to retain all words, but it's more likely to hurt than help your overall vector quality. Word2Vec & Doc2Vec require large, varied corpuses – if you want a good vector for a word, find more diverse examples of its usage in an expanded corpus.
(Also note: one of the simple & fast Doc2Vec modes, that's also often a top-performer, especially on shorter texts, is plain PV-DBOW mode: dm=0. This mode will allocate/randomly-initialize word-vectors, but then ignores them for training, only training the doc-vectors. If you use that mode, you can still request word-vectors from the model at the end – but they'll just be random nonsense.)

Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!

Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.

For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.

There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques

Looking to cluster short descriptions of reports. Should I use Word2Vec or Doc2Vec

So, I have close to 2000 reports and each report has an associated short description of the problem. My goal is to cluster all of these so that we can find distinct trends within these reports.
One of the features I'd like to use some sort of contextual text vector. Now, I've used Word2Vec and think this would be a good option but I also so Doc2Vec and I'm not quite sure what would be a better option for this use case.
Any feedback would be greatly appreciated.

They're very similar, so just as with a single approach, you'd try tuning parameters to improve results in some rigorous manner, you should try them both, and compare the results.
Your dataset sounds tiny compared to what either needs to induce good vectors – Word2Vec is best trained on corpuses of many millions to billions of words, while Doc2Vec's published results rely on tens-of-thousands to millions of documents.
If composing some summary-vector-of-the-document from word-vectors, you could potentially leverage word-vectors that are reused from elsewhere, but that will work best if the vectors' original training corpus is similar in vocabulary/domain-language-usage to your corpus. For example, don't expect words trained on formal news writing to work well with, or even cover the same vocabulary as, informal tweets, or vice-versa.
If you had a larger similar-text corpus of documents to train a Doc2Vec model, you could potentially train a good model on the full set of documents, but then just use your small subset, or re-infer vectors for your small subset, and get better results than a model that was only trained on your subset.
Strictly for clustering, and with your current small corpus of short texts, if you have good word-vectors from elsewhere, it may be worth looking at the "Word Mover's Distance" method of calculating pairwise document-to-document similarity. It can be expensive to calculate on larger docs and large document-sets, but might support clustering well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.