Inconsistencies between bigrams found by TfidfVectorizer and Word2Vec model

Inconsistencies between bigrams found by TfidfVectorizer and Word2Vec model - python

I am building a topic model from scratch, one step of which uses the TfidfVectorizer method to get unigrams and bigrams from my corpus of texts:
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9, ngram_range = (1,2))
After topics are created, I use the similarity scores provided by gensim's Word2Vec to determine coherence of topics. I do this by training on the same corpus:
bigram_transformer = Phrases(corpus)
model = Word2Vec(bigram_transformer[corpus], min_count=1)
For many of the bigrams in my topics however, I get a KeyError because that bigram was not picked up in the training of Word2Vec, despite them being trained on the same corpus. I think this is because Word2Vec decides on which bigrams to choose based on statistical analysis (Why aren't all bigrams created in gensim's `Phrases` tool?)
Is there a way to get the Word2Vec to include all those bigrams identified by TfidfVectorizer? I see trimming capabilities such as 'trim_rule' but not anything in the other direction.

The point of the Phrases model in Gensim is to pick some bigrams, which are calculated to be statistically-significant.
If you then apply that model's determinations as a preprocessing step on your corpus, certain pairs of unigrams will be outright replaced in your text with the combined bigram. (As such, it's possible some unigrams that were there originally will no longer appear even once.)
Thus the concepts of bigrams as used by Gensim's Phrases and the TfidfVectorizer's ngram_range facility are different. Phrases is meant for destructive replacements where specific bigrams are inferred to be more interesting than the unigrams. TfidfVectorizer will add extra bigrams as additional dimensional features.
I suppose the right tuning of Phrases could cause it to consider every bigram as significant. Without checking, it looks like a super-tiny value, like 0.0000000001, might have essentially that effect. (The Phrases class will reject a value of 0 as nonsensical given its usual use.)
But at that point, your later transformation (via bigram_transformer[corpus]) will combine every possible pair of words before Word2Vec training. For example, the sentence:
['the', 'skittish', 'cat', 'jumped', 'over', 'the', 'gap',]
...would indiscriminately become...
['the_skittish', 'cat_jumped', 'over_the', 'gap',]
It seems unlikely that you want that, for a number of reasons:
There might then be no training texts with the 'cat' unigram alone, leaving you with no word-vector for that word at all.
Bigrams that are rare or of little grammatical value (like 'the_skittish') will receive trained word-vectors, & take up space in the model.
The kinds of text corpus that are large enough for good Word2Vec results might have far more bigrams than are manageable. (A corpus small enought that you can afford to track every bigram may be on the thin side for good Word2Vec results.)
Further, to perform that greedy-combination of all bigrams, the Phrases frequency-survey & calculations aren't even necessary. (It can be done automatically with no preparation/analysis.)
So, you shouldn't expect every bigram of TfidfVectorizer to be get a word-vector, unless you take some extra steps, outside the normal behavior of Phrases, to ensure every such bigram was in the training texts.
To try to do so wouldn't necessarily need Phrases at all, and might be unmanageable, and involve other tradeoffs. (For example, I could imagine repeating the corpus many times, only combining a fraction of the bigrams each time – so that each is sometimes surrounded by other unigrams, and sometimes by other bigrams – to create a synthetic corpus with enough meaningful texts to create all your desired vectors. But the logic & storage space for that model would be larger & complicated, and without prominent precedent, so it'd be a novel experiment.)

Related

Why does KNN algorithm perform better on Word2Vec than on TF-IDF vector representation?

I am doing a project on multi-class text classification and could do with some advice.
I have a dataset of reviews which are classified into 7 product categories.
Firstly, I create a term document matrix using TF-IDF (tfidfvectorizer from sklearn). This generates a matrix of n x m where n in the number of reviews in my dataset and m is the number of features.
Then after splitting term document matrix into 80:20 train:test, I pass it through the K-Nearest Neighbours (KNN) algorithm and achieve an accuracy of 53%.
In another experiment, I used the Google News Word2Vec pretrained embedding (300 dimensional) and averaged all the word vectors for each review. So, each review consists of x words and each of the words has a 300 dimensional vector. Each of the vectors are averaged to produce one 300 dimensional vector per review.
Then I pass this matrix through KNN. I get an accuracy of 72%.
As for other classifiers that I tested on the same dataset, all of them performed better on the TF-IDF method of vectorization. However, KNN performed better on word2vec.
Can anyone help me understand why there is a jump in accuracy for KNN in using the word2vec method as compared to when using the tfidf method?

By using the external word-vectors, you've introduced extra info about the words to the word2vec-derived features – info that simply may not be deducible at all to the plain word-occurrece (TF-IDF) model.
For example, imagine just a single review in your train set, and another single review in your test set, use some less-common word for car like jalopy – but then zero other car-associated words.
A TFIDF model will have a weight for that unique term in a particular slot - but may have no other hints in the training dataset that jalopy is related to cars at all. In TFIDF space, that weight will just make those 2 reviews more-distant from all other reviews (which have a 0.0 in that dimension). It doesn't help or hurt much. A review 'nice jalopy' will be no closer to 'nice car' than it is to 'nice movie'.
On the other hand, if the GoogleNews has a vector for that word, and that vector is fairly close to car, auto, wheels, etc, then reviews with all those words will be shifted a little in the same direction in the word2vec-space, giving an extra hint to some classifiers, especially, perhaps the KNN one. Now, 'nice jalopy' is quite a bit closer to 'nice car' than to 'nice movie' or most other 'nice [X]' reviews.
Using word-vectors from an outside source may not have great coverage of your dataset's domain words. (Words in GoogleNews, from a circa-2013 training run on news articles, might miss both words, and word-senses in your alternative & more-recent reviews.) And, summarizing a text by averaging all its words is a very crude method: it can learn nothing from word-ordering/grammar (that can often reverse intended sense), and aspects of words may all cancel-out/dilute each other in longer texts.
But still, it's bringing in more language info that otherwise wouldn't be in the data at all, so in some cases it may help.
If your dataset is sufficiently large, training your own word-vectors may help a bit, too. (Though, the gain you've seen so far suggests some useful patterns of word-similarities may not be well-taught from your limited dataset.)
Of course also note that you can use blended techniques. Perhaps, each text can be even better-represented by a concatenation of the N TF-IDF dimensions and the M word2vec-average dimensions. (If your texts have many significany 2-word phrases that mean hings different than the individual words, adding in word 2-grams features may help. If your texts have many typos or rare word variants that still share word-roots with other words, than adding in character-n-grams – word fragments – may help.)

Can I match words or sentences to a pre-vectorized corpus of sentences in Python for NL processing?

I've been searching for an answer to this specific question for a few hours and while I've learned a lot, I still haven't figured it out.
I have a dataset of ~70,000 sentences with subset of about 4,000 sentences that have been appropriately categorized, the rest are uncategorized. Currently I'm using a scikit pipeline with CountVectorizer and TfidfTransformer to vectorize the data, however I'm only vectorizing based off the 4,000 sentences and then testing various models via cross-validation.
I'm wondering if there is a way to use Word2Vec or something similar to vectorize the entire corpus of data and then use these vectors with my subset of 4,000 sentences. My intention is to increase the accuracy of my model predictions by using word vectors that incorporate all of the semantic data in the corpus rather than just data from the 4,000 sentences.
The code I'm currently using is:
svc = Pipeline([('vect', CountVectorizer(ngram_range=(3, 5))),
('tfidf', TfidfTransformer()),
('clf', LinearSVC()),
])
nb.fit(X_train, y_train)
y_pred = svc.predict(X_test)
Where X_train and y_train are my features and labels, respectively. I also have a list z_all which includes all remaining uncategorized features.
Just getting pointed in the right direction (or told whether or not this is possible) would be super helpful.
Thank you!

I would say that the answer is yes: you can use Word2Vec or another similar word-embedding method to get vectors of each sentence in your data, and then use these vectors both as training and testing data in a linear Support Vector Machine (SVC).
And yes, you can first create those vectors for your entire corpus of ~70,000 sentences before actually doing any training on your data.
It is however not as straightforward as the approach you're currently using.
There are many different ways to do this so I'll just go through one of them to help you get the basics of how this can be done.
Before we start and see what possible steps you can follow, let's remember that the goal here is to get one vector for each and every sentence of your corpus.
If you don't know what word-embeddings are, I highly suggest you to read about it, but in short this is just a way to link each word of a pre-defined vocabulary to a vector of a given dimension.
For instance, you would have:
# the vector associated with the word "cat" is the following vector of fixed-length
word_embeddings["cat"] = [0.0014, 0.6710, ..., 0.3281]
Now that you know this, here are the steps you could be following:
Tokenization - The first thing that you want to do is to tokenize each of your sentences. This can be done using a NLP library (SpaCy for instance) that will help you to:
split each sentence in a list of words
remove any punctuation from these words and converting them to lowercase
remove stopwords - optionally
lemmatize all the words - optionally
Train a word embedding model - Now that you have each sentence as a pre-processed list of words, you need to train a word-embedding model using your corpus. There are many different algorithms to do that. I would suggest using GenSim and Word2Vec or fastText. What you can also do is using pre-trained word embeddings, like GloVe or anything that best fits your corpus in terms of language/context. Either way, this will allow you to:
have one vector of pre-defined size for each and every word in your corpus' vocabulary
get a list of equally-sized vectors for each sentence in your corpus
Adopting a weighting method - Once you have a list of vectors for each sentence in your corpus, and mainly because your sentences vary in length (some have 6 words, some others have 13 words, etc.) what you want to do is getting a single vector for each and every sentence. To do this, what you can do is simply weighting the vectors corresponding to the words in each sentence. You can:
average all vectors
using weights like TF-IDF weights to give some words more importance than others
use other weighting methods...
Training and testing - Finally, all you're left to do is training a model using these vectors, for instance with a linear Support Vector Machine (SVC), and testing the accuracy of your model on a test dataset (you can also use a validation dataset).

My opinion is, if you are going to use a word2vec embedding, use one pre-trained or used generic text to generate it.
Word2vec embedding are usually used to give meaning and context to your text data, if you train an embedding using only your data, it might be biased and not represent a language. And that means it vectors doesn't carry any meaning.
After having your embedding working, you also has to think about what to do with your words, because a sentence has 1 or more words (embedding works at word level), and you want to feed your models with 1 sentence -> 1 vector. not 1 sentences -> N vectors.
People usually average or multiply those vectors so for example, for the sentence "Hello there" and an embedding of 5 dims:
Hello -> [0, 0, .2, .3, .8]
there -> [.1, .2, 0, 0, .5]
AVG Hello there -> [.05, .1, .1, .15, .65]
This is what you want to use for your models!
So instead of using TF-IDF to generate your sentence vectors, use word2vec like this and you shouldn't have any problem. I already work in a text calssification project and we ended usind a self-trained w2v embedding an ExtraTrees model with amazing results.

Use tf-idf with FastText vectors

I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:
http://dsgeek.com/2018/02/19/tfidf_vectors.html
https://www.aclweb.org/anthology/P16-1089
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.
For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.
But for fastText same word will have several n-grams,
"Listen to the latest news summary" will have n-grams generated by a sliding windows like:
lis ist ste ten tot het...
These n-grams are handled internally by the model so when I try:
model["Listen to the latest news summary"]
I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:
model['lis']
model['ist']
model['ten']
And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.

I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.
That is, you send
model["Listen"]
model["to"]
model["the"]
...
to FastText, and then use your old code to get the tf-idf weights.
In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.

You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python

For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:
Basically Word Embeddings methods are unsupervised models for
generating word vectors. The word vectors generated by this kind of
models are now very popular in NPL tasks. This is because a word
embedding representation of a word captures more information about
a word than just a one-hot representation of the word, since the
former captures semantic similarity of that word to other words
whereas the latter representation of the word is equidistant from all
other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
Tf-idf, instead is a scoring scheme for words, that is a measure of how
important a word is to a document.
From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).
Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.
For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.

gensim doc2vec Model doesn't learn some words

I'm currently learning gensim doc2model in Python3.6 to see similarity between sentences.
I created a model but it returns KeyError: "word 'WORD' not in vocabulary" when I input a word which obviously exists in the training dataset, to find a similar word/sentence.
Does it automatically skip some words not very important to define sentences? or is that simply a bug or something?
Very appreciated if I could have any way out to cover all the appearing words in the dataset. thanks.

If a word you expected to be learned in the model isn't in the model, the most likely causes are:
it wasn't really there, in the version the model saw, perhaps because your tokenization/preprocessing is broken. Enable logging at INFO level, and examine your corpus as presented to the model, to ensure it's tokenized as intended
it wasn't part of the surviving vocabulary after the 1st vocabulary-survey of the corpus. The default min_count=5 discards words with fewer than 5 occurrences, as such words both fail to get good vectors for themselves, and effectively serve as 'noise' interfering with the improvement of other vectors.
You can set min_count=1 to retain all words, but it's more likely to hurt than help your overall vector quality. Word2Vec & Doc2Vec require large, varied corpuses – if you want a good vector for a word, find more diverse examples of its usage in an expanded corpus.
(Also note: one of the simple & fast Doc2Vec modes, that's also often a top-performer, especially on shorter texts, is plain PV-DBOW mode: dm=0. This mode will allocate/randomly-initialize word-vectors, but then ignores them for training, only training the doc-vectors. If you use that mode, you can still request word-vectors from the model at the end – but they'll just be random nonsense.)

How to get list of context words in Gensim

How to get most frequent context words from pretrained fasttext model?
For example:
For word 'football' and corpus ["I like playing football with my friends"]
Get list of context words: ['playing', 'with','my','like']
I try to use
model_wiki = gensim.models.KeyedVectors.load_word2vec_format("wiki.ru.vec")
model.most_similar("блок")
But it's not satisfied for me

The plain model doesn't retain any such co-occurrence statistics from the original corpus. It just has the trained results: vectors per word.
So, the ranked list of most_similar() vectors – which isn't exactly words that appeared-together, but strongly correlates to that – is the best you'll get from that file.
Only going back to the original training corpus would give you exactly what you've requested.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.