What's best WordNet function for similarity between words? - python

I aim to find the similarities between words for about ~10,000 words. I'm using the "word.path_similarity(otherword)" method of the wordnet library but the results I'm getting for the path_similarity are in the range 0-0.1 as opposed to being distributed over 0-1. How is it possible that similarities between 10,000 random words all end up in that narrow range?
Is there a better way to use WordNet for finding similarity between two words?

For context, here's how this is calculated:
Claculate the length of the shortest path between the two synsets/words (inclusive).
Return the score as 1/pathlen
Therefore a score <.2 is indicative of a pathlength > 5 steps. Inclusive of the two input synsets, that means there are at least 4 synsets between them.
With that said: you're complaint seems to be "according to this metric, two words chosen at random are pretty consistently unrelated! What's going on?" Well, your similarity metric is telling you that random words are generally not closely related. This shouldn't be that surprising. Why are you calculating similarities between random words to begin with?

Related

Sentence meaning similarity in python

I want to calculate the sentence meaning similarity. I am using cosine similarity but this method does not fulfill my needs.
For example, if I have these two sentences.
He and his father are very close.
He shares a wonderful bond with his father.
What I need is calculating the similarity between these sentences based on the meaning similarity and not just matching similar words
Is there a way to do this?
One approach would be to represent each word using pre-trained word vectors ("embeddings"). These are vectors with a few hundred dimensions where words with similar meaning (e.g., "close", "bond") should have similar vectors. The key idea is that word embeddings could represent that the two sentences have similar meaning even though they use different words.
This could be done quickly in a package such as Spacy in python. See https://spacy.io/usage/vectors-similarity
Common pre-trained vectors include the Google news word embeddings (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) and GLOVE embeddings (https://nlp.stanford.edu/projects/glove/).
Here's a simple approach: represent each word by its pretrained embedding and average words across the sentence. Now compare the vectors using any reasonable distance measure (cosine is standard).

How to choose a fuzzy matching algorithm?

I need to know the criteria which made fuzzy algo different from each other between those 3 :
Levenshtein distance Algorithm
Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
Damerau–Levenshtein distance
Damerau–Levenshtein distance is a distance (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters.
Bitap algorithm with modifications by Wu and Manber
Bitmap algorithm is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.
My document is a table with name of companies, some companies are twice or three time because of misspelling. In this particular case, how to group the companies by matching them ? Which algorithm to chose and why ? In the file I have 100k lines and it is growing.
If you are on Google Sheets, try using Flookup. You say that your list has over 100K rows so that might prove a bit of a challenge depending on Google's (timed) execution limits but I still encourage you to give it a try.
One function you might be interested in is:
FLOOKUP(lookupValue, tableArray, lookupCol, indexNum, [threshold], [rank])
Full Disclosure: I created Flookup for Google sheets
If you want to group the companies you could use locality sensitive hashing or a clustering method such as K-medoids clustering with e.g., Levenshtein edit distance as the metric. Alternatively, you can use SymSpell.
Levenshtein- and Damerau–Levenshtein distance are both good metrics for string similarity, but make sure that you use a fast implementation. There are too many popular and insanely slow implementations on Github. The best I know of are PolyLeven or editdistance.

Find similarity with doc2vec like word2vec

Is there a way to find similar docs like we do in word2vec
Like:
model2.most_similar(positive=['good','nice','best'],
negative=['bad','poor'],
topn=10)
I know we can use infer_vector,feed them to have similar ones, but I want to feed many positive and negative examples as we do in word2vec.
is there any way we can do that! thanks !
The doc-vectors part of a Doc2Vec model works just like word-vectors, with respect to a most_similar() call. You can supply multiple doc-tags or full vectors inside both the positive and negative parameters.
So you could call...
sims = d2v_model.docvecs.most_similar(positive=['doc001', 'doc009'], negative=['doc102'])
...and it should work. The elements of the positive or negative lists could be doc-tags that were present during training, or raw vectors (like those returned by infer_vector(), or your own averages of multiple such vectors).
Don't believe there is a pre-written function for this.
One approach would be to write a function that iterates through each word in the positive list to get top n words for a particular word.
So for positive words in your question example, you would end up with 3 lists of 10 words.
You could then identify words that are common across the 3 lists as the top n similar to your positive list. Since not all words will be common across the 3 lists, you probably need to get top 20 similar words when iterating so you end up with top 10 words as you want in your example.
Then do the same for negative words.

How to use NLTK BigramAssocMeasures.ch_sq

I have list of words, I want to calculate the relatedness of two words by considering their co-occurrences. From a paper I found that it can be calculated using Pearsson chi-square test. Also I found nltk.BigramAssocMeasures.ch_sq() for calculating chi-sqare value.
Can I use this for my needs? How can I find chi-square value using nltk?
Have a look at this blog from Streamhacker, it gives a good explanation with code examples.
One of the best metrics for information gain is chi square. NLTK includes this in the BigramAssocMeasures class in the metrics package. To use it, first we need to calculate a few frequencies for each word: its overall frequency and its frequency within each class. This is done with a FreqDist for overall frequency of words, and a ConditionalFreqDist where the conditions are the class labels. Once we have those numbers, we can score words with the BigramAssocMeasures.chi_sq function, then sort the words by score and take the top 10000. We then put these words into a set, and use a set membership test in our feature selection function to select only those words that appear in the set. Now each file is classified based on the presence of these high information words.

Using WordNet to determine semantic similarity between two texts?

How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps

Categories