python glove similarity measure calculation - python

i am trying to understand how python-glove computes most-similar terms.
Is it using cosine similarity?
Example from python-glove github
https://github.com/maciejkula/glove-python/tree/master/glove
:
I know that from gensim's word2vec, the most_similar method computes similarity using cosine distance.

The project website is a bit unclear on this point:
The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.
Euclidean distance is not the same as cosine similarity. It sounds like either works well enough, but it does not specify which is used.
However, we can observe the source of the repo you are looking at to see:
dst = (np.dot(self.word_vectors, word_vec)
/ np.linalg.norm(self.word_vectors, axis=1)
/ np.linalg.norm(word_vec))
It uses cosine similarity.

On the glove project website, this is explained with a fair amount of clarity.
http://www-nlp.stanford.edu/projects/glove/
In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.
To read more about the math behind this, check the "Model overview" section in the website

yes it uses the cosine similarity.
the paper mentioning that in text : ... A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity....

Related

What does Similarity Score mean in gensim?

I have used Gensim library to find the similarity between a sentence against a collection of paragraphs, a dataset of texts. I have used Cosine similarity, Soft cosine similarity and Mover measures separately. Gensim returns a list of items including docid and similarity score. For Cosine similarity and Soft cosine similarity, I guess the similarity score is the cosine between the vectors. Am I right?
In Gensim documents, they wrote it is the semantic relatedness, and no extra explanation. I have search a lot, but did not find any answer. Any help please
Usually by 'similarity', people are seeking to find a measure semantic relatedness - but whether the particular values calculated achieve that will depend on lots of other factors, such as the sufficiency of training data & choice of other appropriate parameters.
Within each code context, 'similarity' has no more and no less meaning than how it's calculated right there - usually, that's 'cosine similarity between vector representations'. (When there's no other hints it means something different, 'cosine similarity' is typically a safe starting assumption.)
But really: the meaning of 'similarity' at each use is no more and no less than whatever that one code path's docs/source-code dictate.
(I realize that may seem an indirect & unsatisfying answer. If there are specific uses in context in Gensim source/docs/example where the meaning is unclear, you could point those out & I might be able to clarify those more.)

Sentence meaning similarity in python

I want to calculate the sentence meaning similarity. I am using cosine similarity but this method does not fulfill my needs.
For example, if I have these two sentences.
He and his father are very close.
He shares a wonderful bond with his father.
What I need is calculating the similarity between these sentences based on the meaning similarity and not just matching similar words
Is there a way to do this?
One approach would be to represent each word using pre-trained word vectors ("embeddings"). These are vectors with a few hundred dimensions where words with similar meaning (e.g., "close", "bond") should have similar vectors. The key idea is that word embeddings could represent that the two sentences have similar meaning even though they use different words.
This could be done quickly in a package such as Spacy in python. See https://spacy.io/usage/vectors-similarity
Common pre-trained vectors include the Google news word embeddings (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) and GLOVE embeddings (https://nlp.stanford.edu/projects/glove/).
Here's a simple approach: represent each word by its pretrained embedding and average words across the sentence. Now compare the vectors using any reasonable distance measure (cosine is standard).

Normalization of Term-frequency and Inverse Document Frequency of varying documents lengths to calculate cosine similarity

I have been trying to find thousands of textual document's similarity against one single query. And every document size is majorly varying (from 20 words to 2000 words)
I did refer the question: tf-idf documents of different length
But that doesn't help me because a fraction of cosine value matters too when comparing with a pool of documents to maintain order.
I then came across a wonderful normalization blog: Tf-Idf and Cosine similarity. But the problem here is to tweak in the TermFreq of every document.
I am using sklearn to calculate tf-idf. But now I am looking for some utility similar to sklearn's tf-idf performance. An iterative approach over all the documents to calculate TF and then modify it is not only time consuming but also not efficient.
Any knowledge/suggestions are appreciated.

Finding the most similar documents (nearest neighbours) from a set of documents

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.
So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.
You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

Clustering words based on Distance Matrix

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm.
DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices.
You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.
Recommend to take a look at agglomerative clustering.

Categories