I have a list of controlled vocabularies, e.g., term1, term2, termN.. A document may have one or more controlled vocabularies, but each vocabulary may only occurs once for each documents.
Let say the total controlled vocabularies are Term1, Term2, Term3, Term4, Term5, Term6.
Doc 1 (4 terms) : term1, term2, term5, term6
Doc 2 (2 terms) : term2, term5
Option1:
The Jaccard approach looks at the two data sets and finds the incident where both values are equal to 1. Therefore, I may convert the existence of a controlled term (term 1-6) for a document into binary vector 1,0. Then, compute the similarity based on Jaccard (http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.jaccard.html)
Doc1:{1,1,0,0,1,1}
Doc2:{0,1,0,0,1,0}
Option 2 - use cosine similarity based on tf-idf as in http://brandonrose.org/clustering
Among these options (or perhaps other similarity measures), which measure is suitable to compute similarity between documents based on controlled vocabularies? I am new to data mining, any suggestion will be appreciated.
It wont let me leave a comment so I will leave an answer. I do something similar but in R, and find this helpful
http://text2vec.org/similarity.html#cosine_similarity
I do not know if there is a "right answer". I would try the different approaches and see which yields the answer most similar to a human's judgment. I think "Euclidean distance" may be best, but I don't know if that is available to you. I
Related
If I have two columns both are retrieved from different resources but with the same Identifier and I need to check if they are similar but there might be only differences in the spelling or the are completely different.
If you want to check whether the two sentences are similar except for spelling differences then, you can use the Normalized Levenshtein Distance or the string edit distance.
s1= "Quick brown fox"
s2= "Quiqk drown fox"
The Levenshtein distance between the two sentences is two.
If you want to check for semantic differences, then you will have to probably use machine learning based model. Simplest thing you can do for semantic similarity is use a model like Sentence2Vec or Doc2Vec and get semantic embeddings for two sentences and compute their dot product.
As shubh gupta noted above me, there are measures of distance among strings. They usually return a magnitude related to the difference of characters or substrings. Tge Levenshtein Distance is one of the most common one. You can find a really cool articule that explains how it works here.
Looking on how your question is stated, I do not think you're looking for the semantic difference between your two input strings, you would need an NLP model to do that. Maybe you can restate your question and provide more information on exactly the difference that you want to measure.
I have used Gensim library to find the similarity between a sentence against a collection of paragraphs, a dataset of texts. I have used Cosine similarity, Soft cosine similarity and Mover measures separately. Gensim returns a list of items including docid and similarity score. For Cosine similarity and Soft cosine similarity, I guess the similarity score is the cosine between the vectors. Am I right?
In Gensim documents, they wrote it is the semantic relatedness, and no extra explanation. I have search a lot, but did not find any answer. Any help please
Usually by 'similarity', people are seeking to find a measure semantic relatedness - but whether the particular values calculated achieve that will depend on lots of other factors, such as the sufficiency of training data & choice of other appropriate parameters.
Within each code context, 'similarity' has no more and no less meaning than how it's calculated right there - usually, that's 'cosine similarity between vector representations'. (When there's no other hints it means something different, 'cosine similarity' is typically a safe starting assumption.)
But really: the meaning of 'similarity' at each use is no more and no less than whatever that one code path's docs/source-code dictate.
(I realize that may seem an indirect & unsatisfying answer. If there are specific uses in context in Gensim source/docs/example where the meaning is unclear, you could point those out & I might be able to clarify those more.)
I have two lists of sentences (list A and list B). I want to find which sentence in A is closest in meaning to the entirety of B.
This is not the same as the standard cosine similarity check you can do when comparing (in spacy for example) two doc objects: even if i iterate through A and compare each element of A to all elements of B, it leaves me with a number of cosine similarity scores, while i want just one number to represent the closeness of each element of A to all of B.
So far I have tried the folowing:
for every element in A, perform cosine similarity check with every element in B, leaving me with a list of values equal in length to B. Then I calculate the average of this list, leaving me with a single value which would ideally represent how close that element of A was to all of B.
The issue with that approach is that the averaging results in too much information loss and by the time ive done this for all elements of A, there isnt much difference in these condensed averages and therefore hard to conclude which element of A is closest to all of B.
P.S.
I can show code if asked but feel it's irrelevant because the issue is with the approach itself, not broken code.
I have a few approaches when I have a similar problem -- for me it's often comparing new documents to a cluster of documents and finding which cluster is "most similar."
First, a sidebar, you can totally do this in spaCy, but if you're dealing with sentences or shorter paragraphs, you might want to try embedding them with a model from SentenceTransformers. SpaCy's document embeddings are just the average of the word embeddings, and embedding the full document with a model intended to do that might give you better results.
Assuming you have lists of documents A and B, and embeddings for both, what I would do first instead of averaging cosine similarities is average the embeddings of B, then find the cosine similarity between each item in A and this average B embedding.
Sometimes, as you experienced with your original approach, averaging results in a loss of information. Going back to having our lists of embeddings for A and B--another approach I take, especially if the documents in B are highly variable in content, is for each document in A find the document in B with the max cosine similarity value. The benefit here is that you might find "clusters" of similar documents and be able to evaluate those. This is nice because the idea of the "meaning of the entirety of B" isn't well defined, especially if B contains lots of documents. This is a nice way to decompose both A and B and better understand groups of documents similar between them.
Whatever you choose, I hope you post back with the results!
I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!
This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.
I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.
So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.
You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...