Understanding gensim word2vec's most_similar

Understanding gensim word2vec's most_similar - python

I am unsure how I should use the most_similar method of gensim's Word2Vec. Let's say you want to test the tried-and-true example of: man stands to king as woman stands to X; find X. I thought that is what you could do with this method, but from the results I am getting I don't think that is true.
The documentation reads:
Find the top-N most similar words. Positive words contribute
positively towards the similarity, negative words negatively.
This method computes cosine similarity between a simple mean of the
projection weight vectors of the given words and the vectors for each
word in the model. The method corresponds to the word-analogy and
distance scripts in the original word2vec implementation.
I assume, then, that most_similar takes the positive examples and negative examples, and tries to find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones. Is that correct?
Additionally, is there a method that allows us to map the relation between two points to another point and get the result (cf. the man-king woman-X example)?

You can view exactly what most_similar() does in its source code:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L485
It's not quite "find points in the vector space that are as close as possible to the positive vectors and as far away as possible from the negative ones". Rather, as described in the original word2vec papers, it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.
That is sufficient to solve man : king :: woman :: ?-style analogies, via a call like:
sims = wordvecs.most_similar(positive=['king', 'woman'],
negative=['man'])
(You can think of this as, "start at 'king'-vector, add 'woman'-vector, subtract 'man'-vector, from where you wind up, report ranked word-vectors closest to that point (while leaving out any of the 3 query vectors).")

Related

What does each element in an embedding mean?

I've been working with facial embeddings but I think Word2Vec is a more common example.
Each entry in that matrix is a number that came from some prediction program/algorithm, but what are they? Are they learned features?

Those numbers are learned vectors that each represents a dimension that best separates each word from each other, given some limiting number of dimensions (normally ~200). So if one group of words tends to appear in the same context, then they'd likely share a similar score on one or more dimensions.
For example, words like North, South, East, West are likely to be very close since they are interchangeable in many contexts.
The dimensions are chosen by algorithm to maximize the variance they encode, and what they mean is not necessarily something we can talk about in words. But imagine a bag of fridge-magnets each representing a letter of the alphabet - if you shine a light on them so as to cast a shadow, there will be some orientations of the letters that yield more discriminatory information in the shadows than for other orientations.
The dimensions in a word-embedding represent the best "orientations" that give light to the most discriminatory "shadows". Sometimes these dimensions might approximate things we recognise as having direct meaning, but very often, they wont.
That being said, if you collect words that do have similar functions, and find the vectors from those words to other words that are the endpoint of some kind of fixed relationship - say England, France, Germany as one set of words consisting of Countries, and London, Paris, Berlin as another set of words consisting of the respective Capital-Cities, you will find that the relative vectors between each country and its capital are often very, very similar in both direction and magnitude.
This has an application for search because you can start with a new word location, say "Argentina" and by looking in the location arrived at by applying the relative "has_capital_city" vector, you should arrive at the word "Buenos Aires".
So the raw dimensions probably have little meaning of their own, but by performing these A is to B as X is to Y comparisons, it is possible to derive relative vectors that do have a meaning of sorts.

How does similarity function in SpaCy works?

I have tried one e.g,
'Positive' and 'Negative' they are not similar words instead they are opposite but still spaCy gives me 81% similarity ratio for them.
here is my code,
import spacy
nlp = spacy.load('en_core_web_lg')
word1 = nlp(u'negative')
word2 = nlp(u'positive')
word1_word2 = word1.similarity(word2)
print(word1_word2)

Typically, word similarities like this are computed using cosine similarity between their corresponding word vectors. Words often used in the same contexts end up in similar locations in the vector space, on the assumption that words that get used similarly mean similar things. E.g., King and Queen might be similar, and King and Man might be similar, but Queen and Man should be a bit less similar (though they still both refer to "people", and they're both nouns, so they'll probably still be more similar than, say, Man and Combusted).
You want these words ('Positive' and 'Negative') to be negatives of each other (cosine similarity of -1), but they're similar because they're almost exactly the same word besides one being the negation of the other. The global semantic vector space incorporates many more ideas than just negation, and so these two words end up being very similar in other ways. What you can do is compute their average vector, then Positive -> average = - (Negative -> average), and that difference vector Positive -> average (or, more precisely, "Positive" - ("Positive" - "Negative") / 2) would approximate the idea of negation that you're particularly interested in. That is, you could then add that vector to other cases to negate them too, e.g. "Yes" + ("Negative" - "Positive") ~= "No"
All that just to say, the effect you're observing is not a fault of Spacy, and you won't avoid it by using Gensim or Sklearn, it's due the nature of what "similarity" means in this context. If you want more comprehensible, human-designed semantic relationships between words, consider looking at WordNet, which is manually created and would be more likely to explicitly have some "negation" relation between your two words.

Why are almost all cosine similarities positive between word or document vectors in gensim doc2vec?

I have calculated document similarities using Doc2Vec.docvecs.similarity() in gensim. Now, I would either expect the cosine similarities to lie in the range [0.0, 1.0] if gensim used the absolute value of the cosine as the similarity metric, or roughly half of them to be negative if it does not.
However, what I am seeing is that some similarities are negative, but they are very rare – less than 1% of pairwise similarities in my set of 30000 documents.
Why are almost all of the similarities positive?

There's no inherent guarantee in Word2Vec/Doc2Vec that the generated set of vectors is symmetrically distributed around the origin point. They could be disproportionately in some directions, which would yield the results you've seen.
In a few tests I just did on the toy-sized dataset ('lee corpus') used in the bundled gensim docs/notebooks/doc2vec-lee.ipynb notebook, checking the cosine-similarities of all documents against the first document, it vaguely seems that:
using hierarchical-softmax rather than negative sampling (hs=1, negative=0) yields a balance between >0.0 and <0.0 cosine-similarities that is closer-to (but not yet quite) half and half
using a smaller number of negative samples (such as negative=1) yields a more balanced set of results; using a larger number (such as negative=10) yields relatively more >0.0 cosine-similarities
While not conclusive, this is mildly suggestive that the arrangement of vectors may be influenced by the negative parameter. Specifically, typical negative-sampling parameters, such as the default negative=5, mean words will be trained more times as non-targets, than as positive targets. That might push the preponderance of final coordinates in one direction. (More testing on larger datasets and modes, and more analysis of how the model setup could affect final vector positions, would be necessary to have more confidence in this idea.)
If for some reason you wanted a more balanced arrangement of vectors, you could consider transforming their positions, post-training.
There's an interesting recent paper in the word2vec space, "All-but-the-Top: Simple and Effective Postprocessing for Word Representations", that found sets of trained word-vectors don't necessarily have a 0-magnitude mean – they're on average in one direction from the origin. And further, this paper reports that subtracting the common mean (to 're-center' the set), and also removing a few other dominant directions, can improve the vectors' usefulness for certain tasks.
Intuitively, I suspect this 'all-but-the-top' transformation might serve to increase the discriminative 'contrast' in the resulting vectors.
A similar process might yield similar benefits for doc-vectors – and would likely make the full set of cosine-similarities, to any doc-vector, more balanced between >0.0 and <0.0 values.

python glove similarity measure calculation

i am trying to understand how python-glove computes most-similar terms.
Is it using cosine similarity?
Example from python-glove github
https://github.com/maciejkula/glove-python/tree/master/glove
:
I know that from gensim's word2vec, the most_similar method computes similarity using cosine distance.

The project website is a bit unclear on this point:
The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.
Euclidean distance is not the same as cosine similarity. It sounds like either works well enough, but it does not specify which is used.
However, we can observe the source of the repo you are looking at to see:
dst = (np.dot(self.word_vectors, word_vec)
/ np.linalg.norm(self.word_vectors, axis=1)
/ np.linalg.norm(word_vec))
It uses cosine similarity.

On the glove project website, this is explained with a fair amount of clarity.
http://www-nlp.stanford.edu/projects/glove/
In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.
To read more about the math behind this, check the "Model overview" section in the website

yes it uses the cosine similarity.
the paper mentioning that in text : ... A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity....

TfidfVectorizer - Normalisation bias

I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book

use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.