gensim doc2vec Model doesn't learn some words - python

I'm currently learning gensim doc2model in Python3.6 to see similarity between sentences.
I created a model but it returns KeyError: "word 'WORD' not in vocabulary" when I input a word which obviously exists in the training dataset, to find a similar word/sentence.
Does it automatically skip some words not very important to define sentences? or is that simply a bug or something?
Very appreciated if I could have any way out to cover all the appearing words in the dataset. thanks.

If a word you expected to be learned in the model isn't in the model, the most likely causes are:
it wasn't really there, in the version the model saw, perhaps because your tokenization/preprocessing is broken. Enable logging at INFO level, and examine your corpus as presented to the model, to ensure it's tokenized as intended
it wasn't part of the surviving vocabulary after the 1st vocabulary-survey of the corpus. The default min_count=5 discards words with fewer than 5 occurrences, as such words both fail to get good vectors for themselves, and effectively serve as 'noise' interfering with the improvement of other vectors.
You can set min_count=1 to retain all words, but it's more likely to hurt than help your overall vector quality. Word2Vec & Doc2Vec require large, varied corpuses – if you want a good vector for a word, find more diverse examples of its usage in an expanded corpus.
(Also note: one of the simple & fast Doc2Vec modes, that's also often a top-performer, especially on shorter texts, is plain PV-DBOW mode: dm=0. This mode will allocate/randomly-initialize word-vectors, but then ignores them for training, only training the doc-vectors. If you use that mode, you can still request word-vectors from the model at the end – but they'll just be random nonsense.)

Related

Inconsistencies between bigrams found by TfidfVectorizer and Word2Vec model

I am building a topic model from scratch, one step of which uses the TfidfVectorizer method to get unigrams and bigrams from my corpus of texts:
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9, ngram_range = (1,2))
After topics are created, I use the similarity scores provided by gensim's Word2Vec to determine coherence of topics. I do this by training on the same corpus:
bigram_transformer = Phrases(corpus)
model = Word2Vec(bigram_transformer[corpus], min_count=1)
For many of the bigrams in my topics however, I get a KeyError because that bigram was not picked up in the training of Word2Vec, despite them being trained on the same corpus. I think this is because Word2Vec decides on which bigrams to choose based on statistical analysis (Why aren't all bigrams created in gensim's `Phrases` tool?)
Is there a way to get the Word2Vec to include all those bigrams identified by TfidfVectorizer? I see trimming capabilities such as 'trim_rule' but not anything in the other direction.
The point of the Phrases model in Gensim is to pick some bigrams, which are calculated to be statistically-significant.
If you then apply that model's determinations as a preprocessing step on your corpus, certain pairs of unigrams will be outright replaced in your text with the combined bigram. (As such, it's possible some unigrams that were there originally will no longer appear even once.)
Thus the concepts of bigrams as used by Gensim's Phrases and the TfidfVectorizer's ngram_range facility are different. Phrases is meant for destructive replacements where specific bigrams are inferred to be more interesting than the unigrams. TfidfVectorizer will add extra bigrams as additional dimensional features.
I suppose the right tuning of Phrases could cause it to consider every bigram as significant. Without checking, it looks like a super-tiny value, like 0.0000000001, might have essentially that effect. (The Phrases class will reject a value of 0 as nonsensical given its usual use.)
But at that point, your later transformation (via bigram_transformer[corpus]) will combine every possible pair of words before Word2Vec training. For example, the sentence:
['the', 'skittish', 'cat', 'jumped', 'over', 'the', 'gap',]
...would indiscriminately become...
['the_skittish', 'cat_jumped', 'over_the', 'gap',]
It seems unlikely that you want that, for a number of reasons:
There might then be no training texts with the 'cat' unigram alone, leaving you with no word-vector for that word at all.
Bigrams that are rare or of little grammatical value (like 'the_skittish') will receive trained word-vectors, & take up space in the model.
The kinds of text corpus that are large enough for good Word2Vec results might have far more bigrams than are manageable. (A corpus small enought that you can afford to track every bigram may be on the thin side for good Word2Vec results.)
Further, to perform that greedy-combination of all bigrams, the Phrases frequency-survey & calculations aren't even necessary. (It can be done automatically with no preparation/analysis.)
So, you shouldn't expect every bigram of TfidfVectorizer to be get a word-vector, unless you take some extra steps, outside the normal behavior of Phrases, to ensure every such bigram was in the training texts.
To try to do so wouldn't necessarily need Phrases at all, and might be unmanageable, and involve other tradeoffs. (For example, I could imagine repeating the corpus many times, only combining a fraction of the bigrams each time – so that each is sometimes surrounded by other unigrams, and sometimes by other bigrams – to create a synthetic corpus with enough meaningful texts to create all your desired vectors. But the logic & storage space for that model would be larger & complicated, and without prominent precedent, so it'd be a novel experiment.)

When should I consider to use pretrain-model word2vec model weights?

Suppose my corpus is reasonably large - having tens-of-thousands of unique words. I can either use it to build a word2vec model directly(Approach #1 in the code below) or initialize a new word2vec model with pre-trained model weights and fine tune it with my own corpus(Approach #2). Is the approach #2 worth consideration? If so, is there a rule of thumb on when I should consider a pre-trained model?
# Approach #1
from gensim.models import Word2Vec
model = Word2Vec(my_corpus, vector_size=300, min_count=1)
# Approach #2
model = Word2Vec(vector_size=300, min_count=1)
model.build_vocab(my_corpus)
model.intersect_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True, lockf=1.0)
model.train(my_corpus, total_examples=len(my_corpus))
The general answer to this type of question is: you should try them both, and see which works better for your purposes.
No one without your exact data & project goals can be sure which will work better in your situation, and you'll need to exact same kind of ability-to-evaluate alterante choices to do all sorts of very basic, necessary tuning of your work.
Separately:
"fine-tuning" word2vec-vectors can mean many things, and can introduce a number of expert-leve thorny tradeoff-decisions - the sorts of tradeoffs that can only be navigated if you've got a robust way to test different choices against each other.
The specific simple tuning approach your code shows - which relies on an experimental method (intersect_word2vec_format()) that might not work in the latest Gensim – is pretty limited, and since it discards all the words in the outside vectors that aren't already in your own corpus, also discards one of the major reasons people often want to mix older vectors in - to cover more words not in their training data. (I doubt that approach will be useful in many cases, but as per above, to be sure you'd want to try it with respect to your data/goals.
It's almost always a bad idea to use min_count=1 with word2vec & similar algorithms. If such rare words are truly important, find more training examples so good vectors can be trained for them. But without enough training examples, they're usually better to ignore - keeping them even makes the vectors for surrounding words worse.

Gensim pretrained model similarity

Problem :
Im using glove pre-trained model with vectors to retrain my model with a specific domain say #cars, after training I want to find similar words within my domain but I got words not in my domain corpus, I believe it's from glove's vectors.
model_2.most_similar(positive=['spacious'], topn=10)
[('bedrooms', 0.6275501251220703),
('roomy', 0.6149100065231323),
('luxurious', 0.6105825901031494),
('rooms', 0.5935696363449097),
('furnished', 0.5897485613822937),
('cramped', 0.5892841219902039),
('courtyard', 0.5721820592880249),
('bathrooms', 0.5618442893028259),
('opulent', 0.5592212677001953),
('expansive', 0.555268406867981)]
Here I expect something like leg-room, car's spacious features mentioned in the domain's corpus. How can we exclude the glove vectors while having similar vectors?
Thanks
There may not be enough info in a simple set of generic word-vectors to filter neighbors by domain-of-use.
You could try using a mixed-weighting: combine the similarities to 'spacious', and to 'cars', and return the top results in that combination – and it might help a little.
Supplying more than one positive word to the most_similar() method might approximate this. If you're sure of some major sources of interference/overlap, you might even be able to use negative word examples, similar to how word2vec finds candidate answers for analogies (though this might also suppress useful results that are legitimately related to both domains, like 'roomy'). For example:
candidates = vec_model.most_similar(positive=['spacious', 'car'],
negative=['house'])
(Instead of using single words like 'car' or 'house' you could also try using vectors combined from many words that define a domain.)
But a sharp distinction sounds like a research project, rather than something easily possible with off-the-shelf libraries/vectors – and may requires more sophisticated approaches and datasets.
You could also try using a set of vectors trained only on a dataset of text from the domain of interest – thus ensuring the vocabulary, and senses, of words are all in that domain.
You cannot exclude the words from already trained model. I don't know in which framework you're working on but I'll give you the examples in Keras as it's simple to understand the intentions.
What you could do is use Embedding layer, populate it with GloVe "knowledge" and then resume training with your corpus so that layer will learn the words and fit them for your specific domain. You can read more about it in Keras blog

Text Preprocessing for classification - Machine Learning

what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!
Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.
For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.
There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques

Looking to cluster short descriptions of reports. Should I use Word2Vec or Doc2Vec

So, I have close to 2000 reports and each report has an associated short description of the problem. My goal is to cluster all of these so that we can find distinct trends within these reports.
One of the features I'd like to use some sort of contextual text vector. Now, I've used Word2Vec and think this would be a good option but I also so Doc2Vec and I'm not quite sure what would be a better option for this use case.
Any feedback would be greatly appreciated.
They're very similar, so just as with a single approach, you'd try tuning parameters to improve results in some rigorous manner, you should try them both, and compare the results.
Your dataset sounds tiny compared to what either needs to induce good vectors – Word2Vec is best trained on corpuses of many millions to billions of words, while Doc2Vec's published results rely on tens-of-thousands to millions of documents.
If composing some summary-vector-of-the-document from word-vectors, you could potentially leverage word-vectors that are reused from elsewhere, but that will work best if the vectors' original training corpus is similar in vocabulary/domain-language-usage to your corpus. For example, don't expect words trained on formal news writing to work well with, or even cover the same vocabulary as, informal tweets, or vice-versa.
If you had a larger similar-text corpus of documents to train a Doc2Vec model, you could potentially train a good model on the full set of documents, but then just use your small subset, or re-infer vectors for your small subset, and get better results than a model that was only trained on your subset.
Strictly for clustering, and with your current small corpus of short texts, if you have good word-vectors from elsewhere, it may be worth looking at the "Word Mover's Distance" method of calculating pairwise document-to-document similarity. It can be expensive to calculate on larger docs and large document-sets, but might support clustering well.

Categories