How to use NLTK to generate sentences from an induced grammar? - python

I have a (large) list of parsed sentences (which were parsed using the Stanford parser), for example, the sentence "Now you can be entertained" has the following tree:
(ROOT
(S
(ADVP (RB Now))
(, ,)
(NP (PRP you))
(VP (MD can)
(VP (VB be)
(VP (VBN entertained))))
(. .)))
I am using the set of sentence trees to induce a grammar using nltk:
import nltk
# ... for each sentence tree t, add its production to allProductions
allProductions += t.productions()
# Induce the grammar
S = nltk.Nonterminal('S')
grammar = nltk.induce_pcfg(S, allProductions)
Now I would like to use grammar to generate new, random sentences. My hope is that since the grammar was learned from a specific set of input examples, then the generated sentences will be semantically similar. Can I do this in nltk?
If I can't use nltk to do this, do any other tools exist that can take the (possibly reformatted) grammar and generate sentences?

In NLTK 2.0 you can use nltk.parse.generate to generate all possible sentences for a given grammar.
This code defines a function which should generate a single sentence based on the production rules in a (P)CFG.
# This example uses choice to choose from possible expansions
from random import choice
# This function is based on _generate_all() in nltk.parse.generate
# It therefore assumes the same import environment otherwise.
def generate_sample(grammar, items=["S"]):
frags = []
if len(items) == 1:
if isinstance(items[0], Nonterminal):
for prod in grammar.productions(lhs=items[0]):
frags.append(generate_sample(grammar, prod.rhs()))
else:
frags.append(items[0])
else:
# This is where we need to make our changes
chosen_expansion = choice(items)
frags.append(generate_sample,chosen_expansion)
return frags
To make use of the weights in your PCFG, you'll obviously want to use a better sampling method than choice(), which implicitly assumes all expansions of the current node are equiprobable.

First of all, if you generate random sentences, they may be semantically correct, but they will probably lose their sense.
(It sounds to me a bit like those MIT students did with their SCIgen program which is auto-generating scientific paper. Very interesting btw.)
Anyway, I never did it myself, but it seems possible with nltk.bigrams, you may way to have a look there under Generating Random Text with Bigrams.
You can also generate all subtrees of a current tree, I'm not sure if it is what you want either.

My solution to generate a random sentence from an existing nltk.CFG grammar:
def generate_sample(grammar, prod, frags):
if prod in grammar._lhs_index: # Derivation
derivations = grammar._lhs_index[prod]
derivation = random.choice(derivations)
for d in derivation._rhs:
generate_sample(grammar, d, frags)
elif prod in grammar._rhs_index:
# terminal
frags.append(str(prod))
And now it can be used:
frags = []
generate_sample(grammar, grammar.start(), frags)
print( ' '.join(frags) )

With an nltk Text object you can call 'generate()' on it which will "Print random text, generated using a trigram language model."http://nltk.org/_modules/nltk/text.html

Inspired by the above, here's one which uses iteration instead of recursion.
import random
def rewrite_at(index, replacements, the_list):
del the_list[index]
the_list[index:index] = replacements
def generate_sentence(grammar):
sentence_list = [grammar.start()]
all_terminals = False
while not all_terminals:
all_terminals = True
for position, symbol in enumerate(sentence_list):
if symbol in grammar._lhs_index:
all_terminals = False
derivations = grammar._lhs_index[symbol]
derivation = random.choice(derivations) # or weighted_choice(derivations) if you have a function for that
rewrite_at(position, derivation.rhs(), sentence_list)
return sentence_list
Or if you want the tree of the derivation, this one.
from nltk.tree import Tree
def tree_from_production(production):
return Tree(production.lhs(), production.rhs())
def leaf_positions(the_tree):
return [the_tree.leaf_treeposition(i) for i in range(len(the_tree.leaves()))]
def generate_tree(grammar):
initial_derivations = grammar._lhs_index[grammar.start()]
initial_derivation = random.choice(initial_derivations) # or weighed_choice if you have that function
running_tree = tree_from_production(initial_derivation)
all_terminals = False
while not all_terminals:
all_terminals = True
for position in leaf_positions(running_tree):
node_label = running_tree[position]
if node_label in grammar._lhs_index:
all_terminals = False
derivations = grammar._lhs_index[node_label]
derivation = random.choice(derivations) # or weighed_choice if you have that function
running_tree[position] = tree_from_production(derivation)
return running_tree
Here's a weighted_choice function for NLTK PCFG production rules to use with the above, adapted from Ned Batchelder's answer here for weighted choice functions in general:
def weighted_choice(productions):
prods_with_probs = [(prod, prod.prob()) for prod in productions]
total = sum(prob for prod, prob in prods_with_probs)
r = random.uniform(0, total)
upto = 0
for prod, prob in prods_with_probs:
if upto + prob >= r:
return prod
upto += prob
assert False, "Shouldn't get here"

Related

Simplify a Logic Expression using NLTK

I have a doubt at using Natural Language ToolKit (NLTK). I'm trying to make an app in order to translate a Natural Language Question into it's logic representation, and query to a database.
The result I got after using the simplify() method under nltk.sem.logic package and got the following expression:
exists z2.(owner(fido, z2) & (z0 = z2))
But what I need is to simplify it as follow:
owner(fido, z0)
Is there another method that could reduce the sentence as I want?
In NLTK, simplify() performs beta reduction (according to the book) which is not what you need. What you are asking is only doable with theorem provers when you apply certain tactics. Which in this case, you either need to know what you expect to get at the end or you know what kinds of axioms can be applied to get such result.
The theorem prover in NLTK is Prover9 which provides tools to check entailment relations. Basically, you can only check if there is a proof with a limited number of steps from a list of expressions (premises) to a goal expression. In your case for example, this was the result:
============================== PROOF =================================
% -------- Comments from original proof --------
% Proof 1 at 0.00 (+ 0.00) seconds.
% Length of proof is 8.
% Level of proof is 4.
% Maximum clause weight is 4.
% Given clauses 0.
1 (exists x (owner(fido,x) & y = x)) # label(non_clause). [assumption].
2 owner(fido,x) # label(non_clause) # label(goal). [goal].
3 owner(fido,f1(x)). [clausify(1)].
4 x = f1(x). [clausify(1)].
5 f1(x) = x. [copy(4),flip(a)].
6 -owner(fido,c1). [deny(2)].
7 owner(fido,x). [back_rewrite(3),rewrite([5(2)])].
8 $F. [resolve(7,a,6,a)].
============================== end of proof ==========================
In NLTK python:
from nltk import Prover9
from nltk.sem import Expression
read_expr = Expression.fromstring
p1 = read_expr('exists z2.(owner(fido, z2) & (z0 = z2))')
c = read_expr('owner(fido, z0)')
result = Prover9().prove(c, [p1])
print(result)
# returns True
UPDATE
In case that you insist on using available tools in python and you want to manually check this certain pattern with regular expressions. You can probably do something like this with regular expression (I don't approve but let's try my nasty tactic):
def my_nasty_tactic(exp):
parameter = re.findall(r'exists ([^.]*)\..*', exp)
if len(parameter) == 1:
parameter = parameter[0]
substitution = re.findall(r'&[ ]*\([ ]*([^ ]+)[ ]*=[ ]*'+parameter+r'[ ]*\)', exp)
if len(substitution) == 1:
substitution = substitution[0]
exp_abs = re.sub(r'exists(?= [^.]*\..*)', "\ ", exp)
exp_abs = re.sub(r'&[ ]*\([ ]*' + substitution + '[ ]*=[ ]*'+parameter+r'[ ]*\)', '', exp_abs)
return read_expr('(%s)(%s)' % (exp_abs, substitution)).simplify()
Then you can use it like this:
my_nasty_tactic('exists z2.(owner(fido, z2) & (z0 = z2))')
# <ApplicationExpression owner(fido,z0)>

How to correct my Naive Bayes method returning extremely small conditional probabilities?

I'm trying to calculate the probability that an email is spam with Naive Bayes. I have a document class to create the documents (fed in from a website), and another class to train and classify documents. My train function calculates all the unique terms in all the documents, all documents in the spam class, all documents in the non-spam class, computes prior probabilities (one for spam, another for ham). Then I use the following formula to store conditional probabilities for each term into a dict
Tct = the number of occurances of a term in a given class
Tct' is the # terms in terms in a given class
B' = # unique terms in all documents
classes = either spam or ham
spam = spam, ham = not spam
the issue is that when I use this formula in my code it gives me extremely small conditional probability scores such as 2.461114392596968e-05. I'm quite sure this is because the values for Tct are very small (like 5 or 8) compared to the denominator values of Tct' (which is 64878 for ham and 308930 for spam) and B' (which is 16386). I can't figure out how to get the condprob scores down to something like .00034155, as I can only assume my condprob scores aren't supposed to be as exponentially small as they are. Am I doing something wrong with my calculations? Are the values actually supposed to be this small?
If it helps, my goal is to score a test set of documents and get results like 327.82, 758.80, or 138.66
using this formula
however, using my small condprob values I only get negative numbers.
Code
-Create Document
class Document(object):
"""
The instance variables are:
filename....The path of the file for this document.
label.......The true class label ('spam' or 'ham'), determined by whether the filename contains the string 'spmsg'
tokens......A list of token strings.
"""
def __init__(self, filename=None, label=None, tokens=None):
""" Initialize a document either from a file, in which case the label
comes from the file name, or from specified label and tokens, but not
both.
"""
if label: # specify from label/tokens, for testing.
self.label = label
self.tokens = tokens
else: # specify from file.
self.filename = filename
self.label = 'spam' if 'spmsg' in filename else 'ham'
self.tokenize()
def tokenize(self):
self.tokens = ' '.join(open(self.filename).readlines()).split()
-NaiveBayes
class NaiveBayes(object):
def train(self, documents):
"""
Given a list of labeled Document objects, compute the class priors and
word conditional probabilities, following Figure 13.2 of your
book. Store these as instance variables, to be used by the classify
method subsequently.
Params:
documents...A list of training Documents.
Returns:
Nothing.
"""
###TODO
unique = []
proxy = []
proxy2 = []
proxy3 = []
condprob = [{},{}]
Tct = defaultdict()
Tc_t = defaultdict()
prior = {}
count = 0
oldterms = []
old_terms = []
for a in range(len(documents)):
done = False
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
if documents[a].label == "ham":
proxy2.append(item)
if done == False:
count += 1
elif documents[a].label == "spam":
proxy3.append(item)
done = True
V = unique
N = len(documents)
print("N:",N)
LB = len(unique)
print("THIS IS LB:",LB)
self.V = V
print("THIS IS COUNT/NC", count)
Nc = count
prior["ham"] = Nc / N
self.prior = prior
Nc = len(documents) - count
print("THIS IS SPAM COUNT/NC", Nc)
prior["spam"] = Nc / N
self.prior = prior
text2 = proxy2
text3 = proxy3
TctTotal = len(text2)
Tc_tTotal = len(text3)
print("THIS IS TCTOTAL",TctTotal)
print("THIS IS TC_TTOTAL",Tc_tTotal)
for term in text2:
if term not in oldterms:
Tct[term] = text2.count(term)
oldterms.append(term)
for term in text3:
if term not in old_terms:
Tc_t[term] = text3.count(term)
old_terms.append(term)
for term in V:
if term in text2:
condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})
if term in text3:
condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})
print("This is condprob", condprob)
self.condprob = condprob
def classify(self, documents):
""" Return a list of strings, either 'spam' or 'ham', for each document.
Params:
documents....A list of Document objects to be classified.
Returns:
A list of label strings corresponding to the predictions for each document.
"""
###TODO
#return list["string1", "string2", "stringn"]
# docs2 = ham, condprob[0] is ham
# docs3 = spam, condprob[1] is spam
unique = []
ans = []
hscore = 0
sscore = 0
for a in range(len(documents)):
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
W = unique
hscore = math.log(float(self.prior['ham']))
sscore = math.log(float(self.prior['spam']))
for t in W:
try:
hscore += math.log(self.condprob[0][t])
except KeyError:
continue
try:
sscore += math.log(self.condprob[1][t])
except KeyError:
continue
print("THIS IS SSCORE",sscore)
print("THIS IS HSCORE",hscore)
unique = []
if hscore > sscore:
str = "Spam"
elif sscore > hscore:
str = "Ham"
ans.append(str)
return ans
-Test
if not os.path.exists('train'): # download data
from urllib.request import urlretrieve
import tarfile
urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
train_docs = [Document(filename=f) for f in glob.glob("train/*.txt")]
test_docs = [Document(filename=f) for f in glob.glob("test/*.txt")]
test = train_docs
nb = NaiveBayes()
nb.train(train_docs[1500:])
#uncomment when testing classify()
#predictions = nb.classify(test_docs[:200])
#print("PREDICTIONS",predictions)
The eventual goal is to be able to classify documents as spam or ham, but I want to work on the conditional probability issue first.
The Issue
Are the conditional probability values supposed to be this small? if so, why am I getting strange scores via classify? If not, how do I fix my code to give me the proper condprob values?
Values
The current condprob values that I am getting are along the lines of this:
'tradition': 2.461114392596968e-05, 'fillmore': 2.461114392596968e-05, '796': 2.461114392596968e-05, 'zann': 2.461114392596968e-05
condprob is a list containing two dictionaries, the first is ham and the next is spam. Each dictionary maps a term to it's conditional probability. I want to have "normal" small values such as .00031235 not 3.1235e-05.
The reason for this is that when I run the condprob values through the classify method with some test documents I get scores like
THIS IS HSCORE -2634.5292392650663, THIS IS SSCORE -1707.983339196181
when they should look like
THIS IS HSCORE 327.82, THIS IS SSCORE 758.80
Running Time
~1 min, 30 sec
(You seem to be working with log probabilities, which is very sensible, but I am going to write most of the following for the raw probabilities, which you could get by taking the exponential of the log probabilities, because it makes the algebra easier even if it does in practice mean that you would probably get numerical underflow if you didn't use logs)
As far as I can tell from your code you start with prior probabilities p(Ham) and p(Spam) and then use probabilities estimated from previous data to work out p(Ham) * p(Observed data | Ham) and p(Spam) * p(Observed data | Spam).
Bayes Theorem rearranges p(Obs|Spam) = p(Obs & Spam) / p(Spam) = p(Obs) p(Spam|Obs) / p(Spam) to give you P(Spam|Obs) = p(Spam) p(Obs|Spam)/p(Obs) and you seem to have calculated p(Spam) p(Obs|Spam) = p(Obs & Spam) but not divided by p(Obs). Since there are only two possibilities, Ham and Spam, the easiest thing to do is probably to note that p(Obs) = p(Obs & Spam) + p(Obs & Ham) and so just divide each of your two calculated values by their sum, essentially scaling the values so that they do indeed sum to 1.0.
This scaling is trickier if you start off with log probabilities lA and lB. To scale these I would first of all bring them into range by scaling them both by a rough value as logarithms, so doing a subtraction
lA = lA - max(lA, lB)
lB = lB - max(lA, lB)
Now at least the larger of the two won't overflow. The smaller still might, but I'd rather deal with underflow than overflow. Now turn them into not quite scaled probabilities:
pA = exp(lA)
pB = exp(lB)
and scale properly so they add to zero
truePA = pA / (pA + pB)
truePB = pB / (pA + pB)

Parsing an equation with sub-formulas in python

I'm trying to develop an equation parser using a compiler approach in Python. The main issue that I encounter is that it is more likely that I don't have all the variables and need therefore to look for sub-formulas. Let's show an example that is worth a thousand words ;)
I have four variables whom I know the values: vx, vy, vz and c:
list_know_var = ['vx', 'vy', 'vz', 'c']
and I want to compute the Mach number (M) defined as
equation = 'M = V / c'
I already know the c variable but I don't know V. However, I know that the velocity V that can be computed using the vx, vy and vz and this is stored in a dictionary with other formulas (here only one sub formula is shown)
know_equations = {'V': '(vx ** 2 + vy ** 2 + vz ** 2) ** 0.5'}
Therefore, what I need is to parse the equation and check if I have all the variables. If not, I shall look into the know_equations dictionary to see if an equation is defined for it and this recursively until my equation is well defined.
For now on, I have been able using the answer given here to parse my equation and know if I know all the variables. The issue is that I did not find a way to replace the unknown variable (here V) by its expression in know_equation:
parsed_equation = compiler.parse(equation)
list_var = re.findall("Name\(\'(\w*)\'\)", str(parsed_equation.getChildNodes()[0]))
list_unknow_var = list(set(list_var) - set(list_know_var))
for var in list_unknow_var:
if var in know_equations:
# replace var in equation by its expression given in know_equations
# and repeate until all the variables are known or raise Error
pass
Thank you in advance for your help, much appreciate!
Adrien
So i'm spitballing a bit, but here goes.
The compiler.parse function returns an instance of compiler.ast.Module which contains an abstract syntax tree. You can traverse this instance using the getChildNodes method. By recursively examining the left and right attributes of the nodes as you traverse the tree you can isolate compiler.ast.Name instances and swap them out for your substitution expressions.
So a worked example might be:
import compiler
def recursive_parse(node,substitutions):
# look for left hand side of equation and test
# if it is a variable name
if hasattr(node.left,"name"):
if node.left.name in substitutions.keys():
node.left = substitutions[node.left.name]
else:
# if not, go deeper
recursive_parse(node.left,substitutions)
# look for right hand side of equation and test
# if it is a variable name
if hasattr(node.right,"name"):
if node.right.name in substitutions.keys():
node.right = substitutions[node.right.name]
else:
# if not, go deeper
recursive_parse(node.right,substitutions)
def main(input):
substitutions = {
"r":"sqrt(x**2+y**2)"
}
# each of the substitutions needs to be compiled/parsed
for key,value in substitutions.items():
# this is a quick ugly way of getting the data of interest
# really this should be done in a programatically cleaner manner
substitutions[key] = compiler.parse(substitutions[key]).getChildNodes()[0].getChildNodes()[0].getChildNodes()[0]
# compile the input expression.
expression = compiler.parse(input)
print "Input: ",expression
# traverse the selected input, here we only pass the branch of interest.
# again, as with above, this done quick and dirty.
recursive_parse(expression.getChildNodes()[0].getChildNodes()[0].getChildNodes()[1],substitutions)
print "Substituted: ",expression
if __name__ == "__main__":
input = "t = r*p"
main(input)
I have admittedly only tested this on a handful of use cases, but I think the basis is there for a generic implementation that can handle a wide variety of inputs.
Running this, I get the output:
Input: Module(None, Stmt([Assign([AssName('t', 'OP_ASSIGN')], Mul((Name('r'), Name('p'))))]))
Substituted: Module(None, Stmt([Assign([AssName('t', 'OP_ASSIGN')], Mul((CallFunc(Name('sqrt'), [Add((Power((Name('x'), Const(2))), Power((Name('y'), Const(2)))))], None, None), Name('p'))))]))
EDIT:
So the compiler module is depreciated in Python 3.0, so a better (and cleaner) solution would be to use the ast module:
import ast
from math import sqrt
# same a previous recursion function but with looking for 'id' not 'name' attribute
def recursive_parse(node,substitutions):
if hasattr(node.left,"id"):
if node.left.id in substitutions.keys():
node.left = substitutions[node.left.id]
else:
recursive_parse(node.left,substitutions)
if hasattr(node.right,"id"):
if node.right.id in substitutions.keys():
node.right = substitutions[node.right.id]
else:
recursive_parse(node.right,substitutions)
def main(input):
substitutions = {
"r":"sqrt(x**2+y**2)"
}
for key,value in substitutions.items():
substitutions[key] = ast.parse(substitutions[key], mode='eval').body
# As this is an assignment operation, mode must be set to exec
module = ast.parse(input, mode='exec')
print "Input: ",ast.dump(module)
recursive_parse(module.body[0].value,substitutions)
print "Substituted: ",ast.dump(module)
# give some values for the equation
x = 3
y = 2
p = 1
code = compile(module,filename='<string>',mode='exec')
exec(code)
print input
print "t =",t
if __name__ == "__main__":
input = "t = r*p"
main(input)
This will compile the expression and execute it in the local space. The output should be:
Input: Module(body=[Assign(targets=[Name(id='t', ctx=Store())], value=BinOp(left=Name(id='r', ctx=Load()), op=Mult(), right=Name(id='p', ctx=Load())))])
Substituted: Module(body=[Assign(targets=[Name(id='t', ctx=Store())], value=BinOp(left=Call(func=Name(id='sqrt', ctx=Load()), args=[BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(), right=Num(n=2)), op=Add(), right=BinOp(left=Name(id='y', ctx=Load()), op=Pow(), right=Num(n=2)))], keywords=[], starargs=None, kwargs=None), op=Mult(), right=Name(id='p', ctx=Load())))])
t = r*p
t = 3.60555127546

Memory not being returned after function python call

I've got a function which parses a sentence by building up a chart. But Python holds on to whatever memory was allocated during that function call. That is, I do
best = translate(sentence, grammar)
and somehow my memory goes up and stays up. Here is the function:
from string import join
from heapq import nsmallest, heappush
from collections import defaultdict
MAX_TRANSLATIONS=4 # or choose something else
def translate(f, g):
words = f.split()
chart = {}
for col in range(len(words)):
for row in reversed(range(0,col+1)):
# get rules for this subspan
rules = g[join(words[row:col+1], ' ')]
# ensure there's at least one rule on the diagonal
if not rules and row==col:
rules=[(0.0, join(words[row:col+1]))]
# pick up rules below & to the left
for k in range(row,col):
if (row,k) and (k+1,col) in chart:
for (w1, e1) in chart[row, k]:
for (w2, e2) in chart[k+1,col]:
heappush(rules, (w1+w2, e1+' '+e2))
# add all rules to chart
chart[row,col] = nsmallest(MAX_TRANSLATIONS, rules)
(w, best) = chart[0, len(words)-1][0]
return best
g = defaultdict(list)
g['cela'] = [(8.28, 'this'), (11.21, 'it'), (11.57, 'that'), (15.26, 'this ,')]
g['est'] = [(2.69, 'is'), (10.21, 'is ,'), (11.15, 'has'), (11.28, ', is')]
g['difficile'] = [(2.01, 'difficult'), (10.08, 'hard'), (10.19, 'difficult ,'), (10.57, 'a difficult')]
sentence = "cela est difficile"
best = translate(sentence, g)
I'm using Python 2.7 on OS X.
Within the function, you set rules to an element of grammar; rules then refers to that element, which is a list. You then add items to rules with heappush, which (as lists are mutable) means grammar holds on to the pushed values via that list. If you don't want this to happen, use copy when assigning rules or deepcopy on the grammar at the start of translate. Note that even if you copy the list to rules, the grammar will record an empty list every time you retrieve an element for a missing key.
Try running gc.collect after you run the function.

Grouping Similar Strings

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

Categories