How do I get the probability of a string being similar to another string in Python?
I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.
e.g.
similar("Apple","Appel") #would have a high prob.
similar("Apple","Mango") #would have a lower prob.
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Solution #1: Python builtin
use SequenceMatcher from difflib
pros:
native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.
example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
Solution #2: jellyfish library
its a very good library with good coverage and few issues.
it supports:
- Levenshtein Distance
- Damerau-Levenshtein Distance
- Jaro Distance
- Jaro-Winkler Distance
- Match Rating Approach Comparison
- Hamming Distance
pros:
easy to use, gamut of supported algorithms, tested.
cons: not native library.
example:
>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1
I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:
Hamming distance
Levenshtein distance
Damerau–Levenshtein distance
Jaro–Winkler distance
TheFuzz is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:
>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
100
You can create a function like:
def similar(w1, w2):
w1 = w1 + ' ' * (len(w2) - len(w1))
w2 = w2 + ' ' * (len(w1) - len(w2))
return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))
Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:
>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012 # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)] # only the first block is recorded
Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:
>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'
Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:
>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el
Package distance includes Levenshtein distance:
import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3
You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity
Here some examples;
Normalized, metric, similarity and distance
(Normalized) similarity and distance
Metric distances
Shingles (n-gram) based similarity and distance
Levenshtein
Normalized Levenshtein
Weighted Levenshtein
Damerau-Levenshtein
Optimal String Alignment
Jaro-Winkler
Longest Common Subsequence
Metric Longest Common Subsequence
N-Gram
Shingle(n-gram) based algorithms
Q-Gram
Cosine similarity
Jaccard index
Sorensen-Dice coefficient
Overlap coefficient (i.e.,Szymkiewicz-Simpson)
The builtin SequenceMatcher is very slow on large input, here's how it can be done with diff-match-patch:
from diff_match_patch import diff_match_patch
def compute_similarity_and_diff(text1, text2):
dmp = diff_match_patch()
dmp.Diff_Timeout = 0.0
diff = dmp.diff_main(text1, text2, False)
# similarity
common_text = sum([len(txt) for op, txt in diff if op == 0])
text_length = max(len(text1), len(text2))
sim = common_text / text_length
return sim, diff
BLEUscore
BLEU, or the Bilingual Evaluation Understudy, is a score for comparing
a candidate translation of text to one or more reference translations.
A perfect match results in a score of 1.0, whereas a perfect mismatch
results in a score of 0.0.
Although developed for translation, it can be used to evaluate text
generated for a suite of natural language processing tasks.
Code:
import nltk
from nltk.translate import bleu
from nltk.translate.bleu_score import SmoothingFunction
smoothie = SmoothingFunction().method4
C1='Text'
C2='Best'
print('BLEUscore:',bleu([C1], C2, smoothing_function=smoothie))
Examples: By updating C1 and C2.
C1='Test' C2='Test'
BLEUscore: 1.0
C1='Test' C2='Best'
BLEUscore: 0.2326589746035907
C1='Test' C2='Text'
BLEUscore: 0.2866227639866161
You can also compare sentence similarity:
C1='It is tough.' C2='It is rough.'
BLEUscore: 0.7348889200874658
C1='It is tough.' C2='It is tough.'
BLEUscore: 1.0
Textdistance:
TextDistance – python library for comparing distance between two or more sequences by many algorithms. It has Textdistance
30+ algorithms
Pure python implementation
Simple usage
More than two sequences comparing
Some algorithms have more than one implementation in one class.
Optional numpy usage for maximum speed.
Example1:
import textdistance
textdistance.hamming('test', 'text')
Output:
1
Example2:
import textdistance
textdistance.hamming.normalized_similarity('test', 'text')
Output:
0.75
Thanks and Cheers!!!
There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance.
The libraries
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance import edit_distance
Jaccard Similarity
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))
and we get:
0.33333333333333337
And for the Apple and Mango
1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))
and we get:
0.0
Edit Distance
edit_distance('Apple', 'Appel')
and we get:
2
And finally,
edit_distance('Apple', 'Mango')
and we get:
5
Cosine Similarity on Q-Grams (q=2)
Another solution is to work with the textdistance library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
Adding the Spacy NLP library also to the mix;
#profile
def main():
str1= "Mar 31 09:08:41 The world is beautiful"
str2= "Mar 31 19:08:42 Beautiful is the world"
print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
if __name__ == '__main__':
#python3 -m spacy download en_core_web_sm
#nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_md")
main()
Run with Robert Kern's line_profiler
kernprof -l -v ./python/loganalysis/testspacy.py
NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562
However the time's are revealing
Function: main at line 32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
32 #profile
33 def main():
34 1 1.0 1.0 0.0 str1= "Mar 31 09:08:41 The world is beautiful"
35 1 0.0 0.0 0.0 str2= "Mar 31 19:08:42 Beautiful is the world"
36 1 43248.0 43248.0 99.1 print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
37 1 375.0 375.0 0.9 print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio())
38 1 30.0 30.0 0.1 print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))
Here's what i thought of:
import string
def match(a,b):
a,b = a.lower(), b.lower()
error = 0
for i in string.ascii_lowercase:
error += abs(a.count(i) - b.count(i))
total = len(a) + len(b)
return (total-error)/total
if __name__ == "__main__":
print(match("pple inc", "Apple Inc."))
Python3.6+=
No Libuary Imported
Works Well in most scenarios
In stack overflow, when you tries to add a tag or post a question, it bring up all relevant stuff. This is so convenient and is exactly the algorithm that I am looking for. Therefore, I coded a query set similarity filter.
def compare(qs, ip):
al = 2
v = 0
for ii, letter in enumerate(ip):
if letter == qs[ii]:
v += al
else:
ac = 0
for jj in range(al):
if ii - jj < 0 or ii + jj > len(qs) - 1:
break
elif letter == qs[ii - jj] or letter == qs[ii + jj]:
ac += jj
break
v += ac
return v
def getSimilarQuerySet(queryset, inp, length):
return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
if __name__ == "__main__":
print(compare('apple', 'mongo'))
# 0
print(compare('apple', 'apple'))
# 10
print(compare('apple', 'appel'))
# 7
print(compare('dude', 'ud'))
# 1
print(compare('dude', 'du'))
# 4
print(compare('dude', 'dud'))
# 6
print(compare('apple', 'mongo'))
# 2
print(compare('apple', 'appel'))
# 8
print(getSimilarQuerySet(
[
"java",
"jquery",
"javascript",
"jude",
"aja",
],
"ja",
2,
))
# ['javascript', 'java']
Explanation
compare takes two string and returns a positive integer.
you can edit the al allowed variable in compare, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range of allowed, if matched, add to the accumulator based on how far the letter is. (the further, the smaller)
length indicate how many items you want as result, that is most similar to input string.
I have my own for my purposes, which is 2x faster than difflib SequenceMatcher's quick_ratio(), while providing similar results. a and b are strings:
score = 0
for letters in enumerate(a):
score = score + b.count(letters[1])
Related
this code takes 9 sec which is very long time, i guess the problem in 2 loop in my code
for symptom in symptoms:
# check if the symptom is mentioned in the user text
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
print(getSimilarity([combin, norm_symptom]))
if getSimilarity([combin,norm_symptom])>0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
i tried to use zip like this:
for symptom, combin in zip(symptoms,list_of_combinations):
norm_symptom = symptom.replace("_"," ")
if (getSimilarity([combin, norm_symptom]) > 0.25 and symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)
Indeed, you're algorithm is slow because of the 2 nested loops.
It performs with big O N*M (see more here https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/)
N being the lenght of symptoms
and M being the list_of_combinations
What can takes time also is the computation getSimilarity, what is this operation ?
Use a dict to store the results of getSimilarity for each combination and symptom. This way, you can avoid calling getSimilarity multiple times for the same combination and symptom. This way it will be more efficient, thus faster.
import collections
similarity_results = collections.defaultdict(dict)
for symptom in symptoms:
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
# Check if the similarity has already been computed
if combin in similarity_results[symptom]:
similarity = similarity_results[symptom][combin]
else:
similarity = getSimilarity([combin, norm_symptom])
similarity_results[symptom][combin] = similarity
if similarity > 0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
Update:
Alternatively You could use an algorithm based on the Levenshtein distance, which is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Python-Levenshtein library does that.
import Levenshtein
def getSimilarity(s1, s2):
distance = Levenshtein.distance(s1, s2)
return 1.0 - (distance / max(len(s1), len(s2)))
extracted_symptoms = []
for symptom, combin in zip(symptoms, list_of_combinations):
norm_symptom = symptom.replace("_", " ")
if (getSimilarity(combin, norm_symptom) > 0.25) and (symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)
According to https://code.google.com/archive/p/word2vec/:
It was recently shown that the word vectors capture many linguistic
regularities, for example vector operations vector('Paris') -
vector('France') + vector('Italy') results in a vector that is very
close to vector('Rome'), and vector('king') - vector('man') +
vector('woman') is close to vector('queen') [3, 1]. You can try out a
simple demo by running demo-analogy.sh.
So we can try from the supplied demo script:
+ ../bin/word-analogy ../data/text8-vector.bin
Enter three words (EXIT to break): paris france berlin
Word: paris Position in vocabulary: 198365
Word: france Position in vocabulary: 225534
Word: berlin Position in vocabulary: 380477
Word Distance
------------------------------------------------------------------------
germany 0.509434
european 0.486505
Please note that paris france berlin is the input hint the demo suggest. The problem is that I'm unable to reproduce this behavior if I open the same word vectors in Gensim and try to compute the vectors myself. For example:
>>> word_vectors = KeyedVectors.load_word2vec_format(BIGDATA, binary=True)
>>> v = word_vectors['paris'] - word_vectors['france'] + word_vectors['berlin']
>>> word_vectors.most_similar(np.array([v]))
[('berlin', 0.7331711649894714), ('paris', 0.6669869422912598), ('kunst', 0.4056406617164612), ('inca', 0.4025722146034241), ('dubai', 0.3934606909751892), ('natalie_portman', 0.3909246325492859), ('joel', 0.3843030333518982), ('lil_kim', 0.3784593939781189), ('heidi', 0.3782389461994171), ('diy', 0.3767407238483429)]
So, what is the word analogy actually doing? How should I reproduce it?
It should be just element-wise addition and subtraction of vectors.
And cosine distance to find the most similar ones.
However, if you use original word2vec embeddings, there is difference between "paris" and "Paris" (strings were not lowered or lemmatised).
You may also try:
v = word_vectors['France'] - word_vectors['Paris'] + word_vectors['Berlin']
or
v = word_vectors['Paris'] - word_vectors['France'] + word_vectors['Germany']
because you should compare identical concepts (city - country + country -> another city)
You should be clear about exactly which word-vector set you're using: different sets will have a different ability to perform well on analogy tasks. (Those trained on the tiny text8 dataset might be pretty weak; the big GoogleNews set Google released would probably do well, at least under certain conditions like discarding low-frequnecy words.)
You're doing the wrong arithmetic for the analogy you're trying to solve. For an analogy "A is to B as C is to ?" often written as:
A : B :: C : _?_
You begin with 'B', subtract 'A', then add 'C'. So the example:
France : Paris :: Italy : _?_
...gives the formula in your excerpted text:
wv('Paris') - wv('France') + wv('Italy`) = target_coordinates # close-to wv('Rome')
And to solve instead:
Paris : France :: Berlin : _?_
You would try:
wv('France') - wv('Paris') + wv('Berlin') = target_coordinates
...then see what's closest to target_coordinates. (Note the difference in operation-ordering to your attempt.)
You can think of it as:
start at a country-vector ('France')
subtract the (country&capital)-vector ('Paris'). This leaves you with an interim vector that's, sort-of, "zero" country-ness, and "negative" capital-ness.
add another (country&capital)-vector ('Berlin'). This leaves you with a result vector that's, again sort-of, "one" country-ness, and "zero" capital-ness.
Note also that gensim's most_similar() takes multiple positive and negative word-examples, to do the arithmetic for you. So you can just do:
sims = word_vectors.most_similar(positive=['France', 'Berlin'], negative=['Paris'])
I have two sets of wordnet synsets (contained in two separate list objects, s1 and s2), from which I want to find the maximum path similarity score for each synset in the first set of synset (s1) onto the second set (s2) with the length of output equal that of s1. For example, if contains 4 synsets, then the length of output should be 4; on the contrary, when s2 first entering into the function (meaning that s1 swaps position with s2), the length output should equal that of s2.
I have experimented with the following code (so far).
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#two wordnet synsets (s1, s2)
s1 = [wn.synset('multiple_sclerosis.n.01'),
wn.synset('stewart.n.01'),
wn.synset('head.n.04'),
wn.synset('executive.n.01'),
wn.synset('washington.n.02'),
wn.synset('not.r.01'),
wn.synset('expect.v.01'),
wn.synset('attend.v.01')]
s2 = [wn.synset('multiple_sclerosis.n.01'),
wn.synset('stewart.n.01'),
wn.synset('sixty-one.s.01'),
wn.synset('information_technology.n.01'),
wn.synset('head.n.04'),
wn.synset('executive.n.01'),
wn.synset('military_officer.n.01'),
wn.synset('president.n.04'),
wn.synset('make.v.01'),
wn.synset('not.r.01'),
wn.synset('attend.v.01')]
# define a function to find the highest path similarity score for each synset in s1 onto s2, with the length of output equal that of s1
ps_list = []
def similarity_score(s1, s2):
for word1 in s1:
best = max(wn.path_similarity(word1, word2) for word2 in s2)
ps_list.append(best)
return ps_list
similarity_score(s1, s2) # this one works fine
similarity_score(s2, s1) # this one returns a nan
However, as noted at the last line of my code, when synset s2 (containing 11 synsets) first entering the function, the function returns a nan. I could not figure out what cause the problem, I am sure if this is because I am dealing with synsets of different lengths and some of the synsets in the longer one could not find a match, hence, causing the nan, or there may be something wrong with my for loop.
It will be really appreciated if someone could help clarify this to me and suggest an alternative solution to this (something that will returns a number, a float, to be precise), so that I can apply this function to other synsets.
Thank you.
There's not much I can see wrong with your code, except for the usage of the ps_list variable (which doesn't get cleared between calls to similarity_score())
If we change ps_list to a dictionary we can check the best score for each word. The following code does that as a test:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
from pprint import pprint
nltk.download("wordnet", "C:/Users/MackayA/Documents/Visual Studio 2017/Projects/PythonApplication9/PythonApplication9/nltk_data")
nltk.data.path.append("C:/Users/MackayA/Documents/Visual Studio 2017/Projects/PythonApplication9/PythonApplication9/nltk_data")
#two wordnet synsets (s1, s2)
s1 = [wn.synset('multiple_sclerosis.n.01'),
wn.synset('stewart.n.01'),
wn.synset('head.n.04'),
wn.synset('executive.n.01'),
wn.synset('washington.n.02'),
wn.synset('not.r.01'),
wn.synset('expect.v.01'),
wn.synset('attend.v.01')]
s2 = [wn.synset('multiple_sclerosis.n.01'),
wn.synset('stewart.n.01'),
wn.synset('sixty-one.s.01'),
wn.synset('information_technology.n.01'),
wn.synset('head.n.04'),
wn.synset('executive.n.01'),
wn.synset('military_officer.n.01'),
wn.synset('president.n.04'),
wn.synset('make.v.01'),
wn.synset('not.r.01'),
wn.synset('attend.v.01')]
def similarity_score(set1, set2):
ps_list = {}
for word1 in set1:
best = max(wn.path_similarity(word1, word2) for word2 in set2)
ps_list[word1] = best
return ps_list
pprint (similarity_score(s2, s1))
This gives the following results:
{Synset('attend.v.01'): 1.0,
Synset('executive.n.01'): 1.0,
Synset('head.n.04'): 1.0,
Synset('information_technology.n.01'): 0.07142857142857142,
Synset('make.v.01'): 0.3333333333333333,
Synset('military_officer.n.01'): 0.14285714285714285,
Synset('multiple_sclerosis.n.01'): 1.0,
Synset('not.r.01'): 1.0,
Synset('president.n.04'): 0.25,
Synset('sixty-one.s.01'): None,
Synset('stewart.n.01'): 1.0}
This would seem to suggest that the algorithm cannot find a match of any kind for the s2 'sixty-one.s.01' entry anywhere in s1. It might be worth running more tests on that single entry.
Hope this is useful.
I have a doubt at using Natural Language ToolKit (NLTK). I'm trying to make an app in order to translate a Natural Language Question into it's logic representation, and query to a database.
The result I got after using the simplify() method under nltk.sem.logic package and got the following expression:
exists z2.(owner(fido, z2) & (z0 = z2))
But what I need is to simplify it as follow:
owner(fido, z0)
Is there another method that could reduce the sentence as I want?
In NLTK, simplify() performs beta reduction (according to the book) which is not what you need. What you are asking is only doable with theorem provers when you apply certain tactics. Which in this case, you either need to know what you expect to get at the end or you know what kinds of axioms can be applied to get such result.
The theorem prover in NLTK is Prover9 which provides tools to check entailment relations. Basically, you can only check if there is a proof with a limited number of steps from a list of expressions (premises) to a goal expression. In your case for example, this was the result:
============================== PROOF =================================
% -------- Comments from original proof --------
% Proof 1 at 0.00 (+ 0.00) seconds.
% Length of proof is 8.
% Level of proof is 4.
% Maximum clause weight is 4.
% Given clauses 0.
1 (exists x (owner(fido,x) & y = x)) # label(non_clause). [assumption].
2 owner(fido,x) # label(non_clause) # label(goal). [goal].
3 owner(fido,f1(x)). [clausify(1)].
4 x = f1(x). [clausify(1)].
5 f1(x) = x. [copy(4),flip(a)].
6 -owner(fido,c1). [deny(2)].
7 owner(fido,x). [back_rewrite(3),rewrite([5(2)])].
8 $F. [resolve(7,a,6,a)].
============================== end of proof ==========================
In NLTK python:
from nltk import Prover9
from nltk.sem import Expression
read_expr = Expression.fromstring
p1 = read_expr('exists z2.(owner(fido, z2) & (z0 = z2))')
c = read_expr('owner(fido, z0)')
result = Prover9().prove(c, [p1])
print(result)
# returns True
UPDATE
In case that you insist on using available tools in python and you want to manually check this certain pattern with regular expressions. You can probably do something like this with regular expression (I don't approve but let's try my nasty tactic):
def my_nasty_tactic(exp):
parameter = re.findall(r'exists ([^.]*)\..*', exp)
if len(parameter) == 1:
parameter = parameter[0]
substitution = re.findall(r'&[ ]*\([ ]*([^ ]+)[ ]*=[ ]*'+parameter+r'[ ]*\)', exp)
if len(substitution) == 1:
substitution = substitution[0]
exp_abs = re.sub(r'exists(?= [^.]*\..*)', "\ ", exp)
exp_abs = re.sub(r'&[ ]*\([ ]*' + substitution + '[ ]*=[ ]*'+parameter+r'[ ]*\)', '', exp_abs)
return read_expr('(%s)(%s)' % (exp_abs, substitution)).simplify()
Then you can use it like this:
my_nasty_tactic('exists z2.(owner(fido, z2) & (z0 = z2))')
# <ApplicationExpression owner(fido,z0)>
I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.