What does Similarity Score mean in gensim? - python

I have used Gensim library to find the similarity between a sentence against a collection of paragraphs, a dataset of texts. I have used Cosine similarity, Soft cosine similarity and Mover measures separately. Gensim returns a list of items including docid and similarity score. For Cosine similarity and Soft cosine similarity, I guess the similarity score is the cosine between the vectors. Am I right?
In Gensim documents, they wrote it is the semantic relatedness, and no extra explanation. I have search a lot, but did not find any answer. Any help please

Usually by 'similarity', people are seeking to find a measure semantic relatedness - but whether the particular values calculated achieve that will depend on lots of other factors, such as the sufficiency of training data & choice of other appropriate parameters.
Within each code context, 'similarity' has no more and no less meaning than how it's calculated right there - usually, that's 'cosine similarity between vector representations'. (When there's no other hints it means something different, 'cosine similarity' is typically a safe starting assumption.)
But really: the meaning of 'similarity' at each use is no more and no less than whatever that one code path's docs/source-code dictate.
(I realize that may seem an indirect & unsatisfying answer. If there are specific uses in context in Gensim source/docs/example where the meaning is unclear, you could point those out & I might be able to clarify those more.)

Related

Similarity check on python NLP

If I have two columns both are retrieved from different resources but with the same Identifier and I need to check if they are similar but there might be only differences in the spelling or the are completely different.
If you want to check whether the two sentences are similar except for spelling differences then, you can use the Normalized Levenshtein Distance or the string edit distance.
s1= "Quick brown fox"
s2= "Quiqk drown fox"
The Levenshtein distance between the two sentences is two.
If you want to check for semantic differences, then you will have to probably use machine learning based model. Simplest thing you can do for semantic similarity is use a model like Sentence2Vec or Doc2Vec and get semantic embeddings for two sentences and compute their dot product.
As shubh gupta noted above me, there are measures of distance among strings. They usually return a magnitude related to the difference of characters or substrings. Tge Levenshtein Distance is one of the most common one. You can find a really cool articule that explains how it works here.
Looking on how your question is stated, I do not think you're looking for the semantic difference between your two input strings, you would need an NLP model to do that. Maybe you can restate your question and provide more information on exactly the difference that you want to measure.

using LDA for dimension reduction / clustering

I'm currently facing a text mining problem where my goal is to identify clusters within a corpus of short texts.
The idea is, that these clusters represent some kind of technical/domain specific content which all members of the respective cluster have in common.
The final evaluation of the clustering has to be made from a domain knowledge based perspetive.
I worked myself through a bunch of different approaches.
Topic modeling with lda seems a good one to start with.
So each of my documents is represented through a mixture of different topics (which are based on the coherent occrurance of single words or n_grams)
My first idea was to use the resulting topics as clusters/groups to group my documents.
But one single document can consist of different topics, so I'm not sure wether this is a good idea.
Furthermore, as LDA is not using a distance measure to calculate it's topics, I'm lacking some kind of metric to evalute my lda based clusters. Because of the fact, that I'm missing a given ground truth I'm bound to methods, which are not bound to a given ground truth. I used the silhouette score to evalute my clusters, but while this metric is based on distances, lda is not. I'm not sure wether it actualy makes sense.
My second thought was to use the lda results as a preprocessing step for dimension reduction.
On these new Input vector I could apply distance based clustering methods like agglomerative clustering, kmeans, dbscan.
I also found some posts and papers, which pointed to self organizing maps to solve that kind of problem. Is this approach worth following it, compared to the methods described above?
Is it a reasonable approach to use lda topics as clusters or as a preprocessing step?
What are metrics to evalute non distance approaches like lda?
Are there any other approaches which I should take into account?

python glove similarity measure calculation

i am trying to understand how python-glove computes most-similar terms.
Is it using cosine similarity?
Example from python-glove github
https://github.com/maciejkula/glove-python/tree/master/glove
:
I know that from gensim's word2vec, the most_similar method computes similarity using cosine distance.
The project website is a bit unclear on this point:
The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.
Euclidean distance is not the same as cosine similarity. It sounds like either works well enough, but it does not specify which is used.
However, we can observe the source of the repo you are looking at to see:
dst = (np.dot(self.word_vectors, word_vec)
/ np.linalg.norm(self.word_vectors, axis=1)
/ np.linalg.norm(word_vec))
It uses cosine similarity.
On the glove project website, this is explained with a fair amount of clarity.
http://www-nlp.stanford.edu/projects/glove/
In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.
To read more about the math behind this, check the "Model overview" section in the website
yes it uses the cosine similarity.
the paper mentioning that in text : ... A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity....

Similarity measure for documents based on controlled vocabularies

I have a list of controlled vocabularies, e.g., term1, term2, termN.. A document may have one or more controlled vocabularies, but each vocabulary may only occurs once for each documents.
Let say the total controlled vocabularies are Term1, Term2, Term3, Term4, Term5, Term6.
Doc 1 (4 terms) : term1, term2, term5, term6
Doc 2 (2 terms) : term2, term5
Option1:
The Jaccard approach looks at the two data sets and finds the incident where both values are equal to 1. Therefore, I may convert the existence of a controlled term (term 1-6) for a document into binary vector 1,0. Then, compute the similarity based on Jaccard (http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.jaccard.html)
Doc1:{1,1,0,0,1,1}
Doc2:{0,1,0,0,1,0}
Option 2 - use cosine similarity based on tf-idf as in http://brandonrose.org/clustering
Among these options (or perhaps other similarity measures), which measure is suitable to compute similarity between documents based on controlled vocabularies? I am new to data mining, any suggestion will be appreciated.
It wont let me leave a comment so I will leave an answer. I do something similar but in R, and find this helpful
http://text2vec.org/similarity.html#cosine_similarity
I do not know if there is a "right answer". I would try the different approaches and see which yields the answer most similar to a human's judgment. I think "Euclidean distance" may be best, but I don't know if that is available to you. I

Finding the most similar documents (nearest neighbours) from a set of documents

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.
So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.
You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

Categories