If I have two columns both are retrieved from different resources but with the same Identifier and I need to check if they are similar but there might be only differences in the spelling or the are completely different.
If you want to check whether the two sentences are similar except for spelling differences then, you can use the Normalized Levenshtein Distance or the string edit distance.
s1= "Quick brown fox"
s2= "Quiqk drown fox"
The Levenshtein distance between the two sentences is two.
If you want to check for semantic differences, then you will have to probably use machine learning based model. Simplest thing you can do for semantic similarity is use a model like Sentence2Vec or Doc2Vec and get semantic embeddings for two sentences and compute their dot product.
As shubh gupta noted above me, there are measures of distance among strings. They usually return a magnitude related to the difference of characters or substrings. Tge Levenshtein Distance is one of the most common one. You can find a really cool articule that explains how it works here.
Looking on how your question is stated, I do not think you're looking for the semantic difference between your two input strings, you would need an NLP model to do that. Maybe you can restate your question and provide more information on exactly the difference that you want to measure.
Related
I have a dataset of words in a non-semantic context, basically names. I want to perform an act of grouping all the similar ones (say samantha, samanta, sammanta, samaynta.. ) in the same groups.
Since it is a non-semantic context, I cannot vectorize the data using TF-IDF or something else, so I am using the data as it is.
Notice that, I tried using clustering, I used DBSCAN with a custom distance metric (levenshtein), and Polyfuzz. Both gave some decent results, but they were not enough, the former gave a lot of misclusterings, and the later missed a lot of data. I tried searching on the internet for ways to approach this, but weirdly couldn't find any. All were in semantic contexts using TF-IDF and NLP technologies.
note : the dataset is relatively big (around 400.000 or more names)
I have been stuck in this, and would appreciate help, insight, or propositions in this regard.
For clustering you need a distance metric, such as the Levenshtein distance. If that does not give you the desired result, you need to use another one.
I would start by defining what you mean by similar: obviously it is not similarity in spelling, as otherwise the Levenshtein distance should work. What else is there? From your example it seems like the initail characters are important, so maybe use a string comparison that is weighted towards the beginning of a word.
Another approach you could use is using an algorithm tailored to names, such as Soundex.
I have used Gensim library to find the similarity between a sentence against a collection of paragraphs, a dataset of texts. I have used Cosine similarity, Soft cosine similarity and Mover measures separately. Gensim returns a list of items including docid and similarity score. For Cosine similarity and Soft cosine similarity, I guess the similarity score is the cosine between the vectors. Am I right?
In Gensim documents, they wrote it is the semantic relatedness, and no extra explanation. I have search a lot, but did not find any answer. Any help please
Usually by 'similarity', people are seeking to find a measure semantic relatedness - but whether the particular values calculated achieve that will depend on lots of other factors, such as the sufficiency of training data & choice of other appropriate parameters.
Within each code context, 'similarity' has no more and no less meaning than how it's calculated right there - usually, that's 'cosine similarity between vector representations'. (When there's no other hints it means something different, 'cosine similarity' is typically a safe starting assumption.)
But really: the meaning of 'similarity' at each use is no more and no less than whatever that one code path's docs/source-code dictate.
(I realize that may seem an indirect & unsatisfying answer. If there are specific uses in context in Gensim source/docs/example where the meaning is unclear, you could point those out & I might be able to clarify those more.)
So imagine I have three text documents, for example (let 3 randomly generated texts).
Document 1:
"Whole every miles as tiled at seven or. Wished he entire esteem mr oh by. Possible bed you pleasure civility boy elegance ham. He prevent request by if in pleased. Picture too and concern has was comfort. Ten difficult resembled eagerness nor. Same park bore on be...."
Document 2:
"Style too own civil out along. Perfectly offending attempted add arranging age gentleman concluded. Get who uncommonly our expression ten increasing considered occasional travelling. Ever read tell year give may men call its. Piqued son turned fat income played end wicket..."
If I want to obtain in python (using libraries) a metric on how similar these 2 documents are to a third one (in other words, which one of the 2 documents is more similar to a third one) , what would be the best way to proceed?
edit: I have observed other questions that they answer by comparing individual sentences to other sentences, but I am not interested on that, as I want to compare a full text (consisting on related sentences) against another full text, and obtaining a number (which for example may be bigger than another comparison obtained with a different document which is less similar to the target one)
There is no simple answer to this question. As similarities will perform better or worse depending on the particular task you want to perform.
Having said that, you do have a couple of options regarding comparing blocks of text. This post compares and ranks several different ways of computing sentence similarity, which you can then aggregate to perform full document similarity. How to aggregate this? will also depend on your particular task. A simple, but often well-performing approach is to compute the average sentence similarities of the 2 (or more) documents.
Other useful links for this topics include:
Introduction to Information Retrieval (free book)
Doc2Vec (from gensim, for paragraph embeddings, which is probably very suitable for your case)
You could try the Simphile NLP text similarity library (disclosure: I'm the author). It offers several language agnostic methods: JaccardSimilarity, CompressionSimilarity, EuclidianSimilarity. Each has their advantages, but all work well on full document comparison:
Install:
pip install simphile
This example shows Jaccard, but is exactly the same with Euclidian or Compression:
from simphile import jaccard_similarity
text_a = "I love dogs"
text_b = "I love cats"
print(f"Jaccard Similarity: {jaccard_similarity(text_a, text_b)}")
I would need to find something like the opposite of model.most_similar()
While most_similar() returns an array of words most similar to the one given as input, I need to find a sort of "center" of a list of words.
Is there a function in gensim or any other tool that could help me?
Example:
Given {'chimichanga', 'taco', 'burrito'} the center would be maybe mexico or food, depending on the corpus that the model was trained on
If you supply a list of words as the positive argument to most_similar(), it will report words closest to their mean (which would seem to be one reasonable interpretation of the words' 'center').
For example:
sims = model.most_similar(positive=['chimichanga', 'taco', 'burrito'])
(I somewhat doubt the top result sims[0] here will be either 'mexico' or 'food'; it's most likely to be another mexican-food word. There isn't necessarily a "more generic"/hypernym relation to be found either between word2vec words, or in certain directions... but some other embedding techniques, such as hyperbolic embeddings, might provide that.)
I have a list of controlled vocabularies, e.g., term1, term2, termN.. A document may have one or more controlled vocabularies, but each vocabulary may only occurs once for each documents.
Let say the total controlled vocabularies are Term1, Term2, Term3, Term4, Term5, Term6.
Doc 1 (4 terms) : term1, term2, term5, term6
Doc 2 (2 terms) : term2, term5
Option1:
The Jaccard approach looks at the two data sets and finds the incident where both values are equal to 1. Therefore, I may convert the existence of a controlled term (term 1-6) for a document into binary vector 1,0. Then, compute the similarity based on Jaccard (http://docs.scipy.org/doc/scipy-0.17.0/reference/generated/scipy.spatial.distance.jaccard.html)
Doc1:{1,1,0,0,1,1}
Doc2:{0,1,0,0,1,0}
Option 2 - use cosine similarity based on tf-idf as in http://brandonrose.org/clustering
Among these options (or perhaps other similarity measures), which measure is suitable to compute similarity between documents based on controlled vocabularies? I am new to data mining, any suggestion will be appreciated.
It wont let me leave a comment so I will leave an answer. I do something similar but in R, and find this helpful
http://text2vec.org/similarity.html#cosine_similarity
I do not know if there is a "right answer". I would try the different approaches and see which yields the answer most similar to a human's judgment. I think "Euclidean distance" may be best, but I don't know if that is available to you. I