How to extract literal words from a consecutive string efficiently? [duplicate]

How to extract literal words from a consecutive string efficiently? [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to split text without spaces into list of words?
There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor. Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable.
So, what's the fastest way to extract these words?

I'm not really sure a naive algorithm would serve your purpose well, as pointed out by eumiro, so I'll describe a slightly more complex one.
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
import math
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
Examples
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.

"Apparently" is good for humans, not for computers…
words = set(possible words)
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
for i in xrange(len(s) - 1):
for j in xrange(1, len(s) - i):
if s[i:i+j] in words:
print s[i:i+j]
For possible words in /usr/share/dict/words and for j in xrange(3, len(s) - i): (minimal words length of 3), it finds:
thumb
hum
green
nap
apple
plea
lea
act
active
ass
assign
assignment
sign
men
twee
wee
week
weekly
met
eta
tap

Related

Genetic Algorithm / AI; basically, am I on the right track?

I know Python isn't the best idea to be writing any kind of software of this nature. My reasoning is to use this type of algorithm for a Raspberry Pi 3 in it's decision making (still unsure how that will go), and the libraries and APIs that I'll be using (Adafruit motor HATs, Google services, OpenCV, various sensors, etc) all play nicely for importing in Python, not to mention I'm just more comfortable in this environment for the rPi specifically. Already I've cursed it as object oriented such as Java or C++ just makes more sense to me, but Id rather deal with its inefficiencies and focus on the bigger picture of integration for the rPi.
I won't explain the code here, as it's pretty well documented in the comment sections throughout the script. My questions is as stated above; can this be considered basically a genetic algorithm? If not, what must it have to be a basic AI or genetic code? Am I on the right track for this type of problem solving? I know usually there are weighted variables and functions to promote "survival of the fittest", but that can be popped in as needed, I think.
I've read up quite a bit of forums and articles about this topic. I didn't want to copy someone else's code that I barely understand and start using it as a base for a larger project of mine; I want to know exactly how it works so I'm not confused as to why something isn't working out along the way. So, I just tried to comprehend the basic idea of how it works, and write how I interpreted it. Please remember I'd like to stay in Python for this. I know rPi's have multiple environments for C++, Java, etc, but as stated before, most hardware components I'm using have only Python APIs for implementation. if I'm wrong, explain at the algorithmic level, not just with a block of code (again, I really want to understand the process). Also, please don't nitpick code conventions unless it's pertinent to my problem, everyone has a style and this is just a sketch up for now. Here it is, and thanks for reading!
# Created by X3r0, 7/3/2016
# Basic genetic algorithm utilizing a two dimensional array system.
# the 'DNA' is the larger array, and the 'gene' is a smaller array as an element
# of the DNA. There exists no weighted algorithms, or statistical tracking to
# make the program more efficient yet; it is straightforwardly random and solves
# its problem randomly. At this stage, only the base element is iterated over.
# Basic Idea:
# 1) User inputs constraints onto array
# 2) Gene population is created at random given user constraints
# 3) DNA is created with randomized genes ( will never randomize after )
# a) Target DNA is created with loop control variable as data (basically just for some target structure)
# 4) CheckDNA() starts with base gene from DNA, and will recurse until gene matches the target gene
# a) Randomly select two genes from DNA
# b) Create a candidate gene by splicing both parent genes together
# c) Check candidate gene against the target gene
# d) If there exists a match in gene elements, a child gene is created and inserted into DNA
# e) If the child gene in DNA is not equal to target gene, recurse until it is
import random
DNAsize = 32
geneSize = 5
geneDiversity = 9
geneSplit = 4
numRecursions = 0
DNA = []
targetDNA = []
def init():
global DNAsize, geneSize, geneDiversity, geneSplit, DNA
print("This is a very basic form of genetic software. Input variable constraints below. "
"Good starting points are: DNA strand size (array size): 32, gene size (sub array size: 5, gene diversity (randomized 0 - x): 5"
"gene split (where to split gene array for splicing): 2")
DNAsize = int(input('Enter DNA strand size: '))
geneSize = int(input('Enter gene size: '))
geneDiversity = int(input('Enter gene diversity: '))
geneSplit = int(input('Enter gene split: '))
# initializes the gene population, and kicks off
# checkDNA recursion
initPop()
checkDNA(DNA[0])
def initPop():
# builds an array of smaller arrays
# given DNAsize
for x in range(DNAsize):
buildDNA()
# builds the goal array with a recurring
# numerical pattern, in this case just the loop
# control variable
buildTargetDNA(x)
def buildDNA():
newGene = []
# builds a smaller array (gene) using a given geneSize
# and randomized with vaules 0 - [given geneDiversity]
for x in range(geneSize):
newGene.append(random.randint(0,geneDiversity))
# append the built array to the larger array
DNA.append(newGene)
def buildTargetDNA(x):
# builds the target array, iterating with x as a loop
# control from the call in init()
newGene = []
for y in range(geneSize):
newGene.append(x)
targetDNA.append(newGene)
def checkDNA(childGene):
global numRecursions
numRecursions = numRecursions+1
gene = DNA[0]
targetGene = targetDNA[0]
parentGeneA = DNA[random.randint(0,DNAsize-1)] # randomly selects an array (gene) from larger array (DNA)
parentGeneB = DNA[random.randint(0,DNAsize-1)]
pos = random.randint(geneSplit-1,geneSplit+1) # randomly selects a position to split gene for splicing
candidateGene = parentGeneA[:pos] + parentGeneB[pos:] # spliced gene given split from parentA and parentB
print("DNA Splice Position: " + str(pos))
print("Element A: " + str(parentGeneA))
print("Element B: " + str(parentGeneB))
print("Candidate Element: " + str(candidateGene))
print("Target DNA: " + str(targetDNA))
print("Old DNA: " + str(DNA))
# iterates over the candidate gene and compares each element to the target gene
# if the candidate gene element hits a target gene element, the resulting child
# gene is created
for x in range(geneSize):
#if candidateGene[x] != targetGene[x]:
#print("false ")
if candidateGene[x] == targetGene[x]:
#print("true ")
childGene.pop(x)
childGene.insert(x, candidateGene[x])
# if the child gene isn't quite equal to the target, and recursion hasn't reached
# a max (apparently 900), the child gene is inserted into the DNA. Recursion occurs
# until the child gene equals the target gene, or max recursuion depth is exceeded
if childGene != targetGene and numRecursions < 900:
DNA.pop(0)
DNA.insert(0, childGene)
print("New DNA: " + str(DNA))
print(numRecursions)
checkDNA(childGene)
init()
print("Final DNA: " + str(DNA))
print("Number of generations (recursions): " + str(numRecursions))

I'm working with evolutionary computation right now so I hope my answer will be helpful for you, personally, I work with java, mostly because is one of my main languages, and for the portability, because I tested in linux, windows and mac. In my case I work with permutation encoding, but if you are still learning how GA works, I strongly recommend binary encoding. This is what I called my InitialPopulation. I try to describe my program's workflow:
1-. Set my main variables
This are PopulationSize, IndividualSize, MutationRate, CrossoverRate. Also you need to create an objective function and decide the crossover method you use. For this example lets say that my PopulationSize is equals to 50, the IndividualSize is 4, MutationRate is 0.04%, CrossoverRate is 90% and the crossover method will be roulette wheel.
My objective function only what to check if my Individuals are capable to represent the number 15 in binary, so the best individual must be 1111.
2-. Initialize my Population
For this I create 50 individuals (50 is given by my PopulationSize) with random genes.
3-. Loop starts
For each Individuals in Population you need to:
Evaluate fitness according to the objective function. If an Individual is represented by the next characters: 00100 this mean that his fitness is 1. As you can see this is a simple fitness function. You can create your own while you are learning, like fitness = 1/numberOfOnes. Also you need to assign the sum of all the fitness to a variable called populationFitness, this will be useful in the next step.
Select the best individuals. For this task there's a lot of methods you can use, but we will use the roulette wheel method as we say before. In this method, You assign a value to every individual inside your population. This value is given by the next formula: (fitness/populationFitness) * 100. So, if your population fitness is 10, and a certain individual fitness is 3, this mean that this individual has a 30% chance to be selected to make a crossover with another individual. Also, if another individual have a 4 in his fitness, his value will be 40%.
Apply crossover. Once you have the "best" individuals of your population, you need to create a new population. This new population is formed by others individuals of the previous population. For each individual you create a random number from 0 to 1. If this numbers is in the range of 0.9 (since our crossoverRate = 90%), this individual can reproduce, so you select another individual. Each new individual has this 2 parents who inherit his genes. For example:
Lets say that parentA = 1001 and parentB = 0111. We need to create a new individual with this genes. There's a lot of methods to do this, uniform crossover, single point crossover, two point crossover, etc. We will use the single point crossover. In this method we choose a random point between the first gene and the last gene. Then, we create a new individual according to the first genes of parentA and the last genes of parentB. In a visual form:
parentA = 1001
parentB = 0111
crossoverPoint = 2
newIndividual = 1011
As you can see, the new individual share his parents genes.
Once you have a new population with new individuals, you apply the mutation. In this case, for each individual in the new population generate a random number between 0 and 1. If this number is in the range of 0.04 (since our mutationRate = 0.04), you apply the mutation in a random gene. In binary encoding the mutation is just change the 1's for 0's or viceversa. In a visual form:
individual = 1011
randomPoint = 3
mutatedIndividual = 1010
Get the best individual
If this individual has reached the solution stop. Else, repeat the loop
End
As you can see, my english is not very good, but I hope you understand the basic idea of a genetic algorithm. If you are truly interested in learning this, you can check the following links:
http://www.obitko.com/tutorials/genetic-algorithms/
This link explains in a clearer way the basics of a genetic algorithm
http://natureofcode.com/book/chapter-9-the-evolution-of-code/
This book also explain what a GA is, but also provide some code in Processing, basically java. But I think you can understand.
Also I would recommend the following books:
An Introduction to Genetic Algorithms - Melanie Mitchell
Evolutionary algorithms in theory and practice - Thomas Bäck
Introduction to genetic algorithms - S. N. Sivanandam
If you have no money, you can easily find all this books in PDF.
Also, you can always search for articles in scholar.google.com
Almost all are free to download.

Just to add a bit to Alberto's great answer, you need to watch out for two issues as your solution evolves.
The first one is Over-fitting. This basically means that your solution is complex enough to "learn" all samples, but it is not applicable outside the training set. To avoid this, your need to make sure that the "amount" of information in your training set is a lot larger than the amount of information that can fit in your solution.
The second problem is Plateaus. There are cases where you would arrive at certain mediocre solutions that are nonetheless, good enough to "outcompete" any emerging solution, so your progress stalls (one way to see this is, if you see your fitness "stuck" at a certain, less than optimal number). One method for dealing with this is Extinctions: You could track the rate of improvement of your optimal solution, and if the improvement has been 0 for the last N generations, you just Nuke your population. (That is, delete your population and the list of optimal individuals and start over). Randomness will make it so that eventually the Solutions will surpass the Plateau.
Another thing to keep in mind, is that the default Random class is really bad at Randomness. I have had solutions improve dramatically by simply using something like the Mesernne Twister Random generator or a Hardware Entropy Generator.
I hope this helps. Good luck.

How to create all possible sentence of length 100 characters from a list of strings in Python

I am trying to create a sentence of length 100 characters from a given list of strings. The length has to be exactly one hundred characters. We also have to find all possible sentences using permutation. There has to be a space between each word, and duplicate words are not allowed. The list is given below:
['saintliness', 'wearyingly', 'shampoo', 'headstone', 'dripdry', 'elapse', 'redaction', 'allegiance', 'expressionless', 'awesomeness', 'hearkened', 'aloneness', 'beheld', 'courtship', 'swoops', 'memphis', 'attentional', 'pintsized', 'rustics', 'hermeneutics', 'dismissive', 'delimiting', 'proposes', 'between', 'postilion', 'repress', 'racecourse', 'matures', 'directions', 'bloodline', 'despairing', 'syrian', 'guttering', 'unsung', 'suspends', 'coachmen', 'usurpation', 'convenience', 'portal', 'deferentially', 'tarmacadam', 'underlay', 'lifetime', 'nudeness', 'influences', 'unicyclists', 'endangers', 'unbridled', 'kennedy', 'indian', 'reminiscent', 'ravish', 'republics', 'nucleic', 'acacia', 'redoubled', 'minnows', 'bucklers', 'decays', 'garnered', 'aussies', 'harshen', 'monogram', 'consignments', 'continuum', 'pinion', 'inception', 'immoderate', 'reiterated', 'hipster', 'stridently', 'relinquished', 'microphones', 'righthanders', 'ethereally', 'glutted', 'dandies', 'entangle', 'selfdestructive', 'selfrighteous', 'rudiments', 'spotlessly', 'comradeinarms', 'shoves', 'presidential', 'amusingly', 'schoolboys', 'phlogiston', 'teachable', 'letting', 'remittances', 'armchairs', 'besieged', 'monophthongs', 'mountainside', 'aweless', 'redialling', 'licked', 'shamming', 'eigenstate']
Approach:
My first approach is to use backtracking and permutations to generate all sentences. But I think the complexity will be too high since my list is so big.
Is there any other method I can use here or some inbuilt functions/packages I can use here? What will be best way in python to do this? Any pointers will be helpful here.

You can't do it.
Think about it: even for selecting 4 words you already have 100 × 99 × 98 × 97 possibilities, almost 100 million.
Given the length of your words at least 8 of them will fit in the sentence. There is 100 × 99 × 98 … × 93 possibilities. That's approximately 7×10^15, a totally infeasible number.

This problem is similar to the problem of partitioning in number theory.
The complexity of the problem can (presumably) be reduced using some of the constraints that are encoded in the problem statement:
The lengths of the words in the words list.
Repeats of word lengths: for example a word of length 8 is repeated X times.
Here's a possible general approach (would take some refining):
Find all partitions for the number 100 using only the lengths of the words in the words list. (You would start with word lengths and their repeats, and not by brute forcing all possible partitions.)
Filter out partitions that have repeat length values exceeding repeat length values for words in the list.
Apply combinations of words onto the partitions. A set of words of equal length will be mapped to length values in a partition. Say for example you have the partition (15+15+15+10+10+10+10+5+5+5) then you would generate combinations for all length 15 words over 3, length 10 words over 4, and length 5 words over 3. (I'm ignoring the space separation issue here).
Generate permutations of all the combinations over all the partitions.

Simplify a bit: Change all the strings from "xxx" to "xxx ". Then set the sentence length to 101. This allows you to use len(x) instead of len(x)+1 and eliminates the edge case for the last word in the sentence. As you traverse, and build the sentence left to right, you can eliminate words that would overflow the length, based on the sentence you've just constructed.
UPDATE:
Consider this to be a base n number problem where n is the number of words you have. Create a vector initialized with 0 [NOTE: it's only fixed size to illustrate]:
acc = [0, 0, 0, 0]
This is your "accumulator".
Now construct your sentence:
dict[acc[0]] + dict[acc[1]] + dict[acc[2]] + dict[acc[3]]
So, you get able able able able
Now increment the most significant "digit" in the acc. This is denoted by "curpos". Here curpos is 3.
[0, 0, 0, 1]
Now you get able able able baker
You keep bumping acc[curpos] until you hit [0, 0, 0, n] Now you've got a "carry out". "Go left" by decrementing curpos to 2. increment acc[curpos]. If it doesn't "carry out", "go right" by incrementing curpos and set acc[curpos] = 0. If you had gotten a carry out, you'd do a "go left" by decrementing curpos to 1.
This is a form of backtracking (e.g. the "go left"), but you don't need a tree. Just this acc vector and a state machine with three states: goleft, goright, test/trunc/output/inc.
After the "go right" curpos will be back to the "most significant" position. That is, the sentence length constructed from acc[0 to curpos - 1] (the length without adding the final word) is less than 100. If it's too long (e.g. it's already over 100), do a "go left". If it's too short (e.g. you've got to add another word to get near [enough] to 100), do a "go right"
When you get a carry out and curpos==0, you're done
I recently devised this as a solution to the "vampire number challenge" and the traversal you need is very similar.

I am not going to provide a complete solution, but I'll walk through my thinking.
Constraints:
A permutation of your complete list that exceeds 100 characters can be immediately thrown out. (Ok, 99 + len(longest_word)).)
You are essentially dealing with a subset of the power set of elements in your list.
Given that:
Build the power set, but discard any sentences that exceed your maximum
Filter the final set for sentences that exactly match your needs
So you can have the following:
def construct_sentences(dictionary: list, length: int) -> list:
if not dictionary:
return [(0, [])]
else:
word = dictionary[0]
word_length = len(word) + 1
subset_length = length - word_length
sentence_subset = construct_sentences(dictionary[1:], subset_length)
new_sentences = []
for sentence_length, sentence in sentence_subset:
if sentence_length + word_length <= length:
new_sentences = new_sentences + [(sentence_length + word_length, sentence + [word])]
return new_sentences + sentence_subset
I'm using tuples to write-aside the length of the list and make it easily available for comparison. The result of the above function will give you a list of sentences that are all less than the length (which is key when considering potential permutations: 100 is fairly short so there is a vast number of permutations that can be readily discarded). The next step would be to simply filter any sentence that isn't long enough (i.e. 100 characters).
Note that at this point you have every possible list filtering your criteria, but that list may be reordered 2^n ways. Still, that becomes a more manageable situation. With a list of 100 words, averaging under 9 characters a word, you have a average number of words in a sentence equal to 10. 2^10 isn't the worst situation in the world...
You'll have to modify it for your truncation case, of course, but this gets you in the ballpark. Unless I completely missed something, which is always possible. I doubly think something is wrong because running this produces a surprisingly short list.

Your problem size is way too large, but if 1) your actual problem is much smaller in scope, and/or 2) you have a lot of time and a very fast computer, you can generate these permutations using a recursive generator.
def f(string, list1):
for word in list1:
new_string = string + (' ' if string else '') + word
# If there are other constraints that will allow you to prune branches,
# you can add those conditions here and break out of the for loop
if len(new_string) >= 100:
yield new_string[:100]
else:
list2 = list1[:]
list2.remove(word)
for item in f(new_string, list2):
yield item
x = f('', list1)
for sentence in x:
check(sentence)
One caveat is that this may produce identical sentences if two words at the end get truncated to look the same.

Splitting long string without breaking words fulfilling lines

Before you think that it's duplicated (there are many question asking how to split long strings without breaking words) take in mind that my problem is a bit different: order is not important and I've to fit the words in order to use every line as much as possible.
I've a unordered set of words and I want to combine them without using more than 253 characters.
def compose(words):
result = " ".join(words)
if len(result) > 253:
pass # this should not happen!
return result
My problem is that I want to try to fill the line as much as possible. For example:
words = "a bc def ghil mno pq r st uv"
limit = 5 # max 5 characters
# This is good because it's the shortest possible list,
# but I don't know how could I get it
# Note: order is not important
good = ["a def", "bc pq", "ghil", "mno r", "st uv"]
# This is bad because len(bad) > len(good)
# even if the limit of 5 characters is respected
# This is equivalent to:
# bad = ["a bc", "def", "ghil", "mno", "pq r", "st uv"]
import textwrap
bad = textwrap.wrap(words, limit)
How could I do?

This is the bin packing problem; the solution is NP-hard, although there exist non-optimal heuristic algorithms, principally first fit decreasing and best fit decreasing. See https://github.com/type/Bin-Packing for implementations.

Non-optimal offline fast 1D bin packing Python algorithm
def binPackingFast(words, limit, sep=" "):
if max(map(len, words)) > limit:
raise ValueError("limit is too small")
words.sort(key=len, reverse=True)
res, part, others = [], words[0], words[1:]
for word in others:
if len(sep)+len(word) > limit-len(part):
res.append(part)
part = word
else:
part += sep+word
if part:
res.append(part)
return res
Performance
Tested over /usr/share/dict/words (provided by words-3.0-20.fc18.noarch) it can do half million words in a second on my slow dual core laptop, with an efficiency of at least 90% with those parameters:
limit = max(map(len, words))
sep = ""
With limit *= 1.5 I get 92%, with limit *= 2 I get 96% (same execution time).
Optimal (theoretical) value is calculated with: math.ceil(len(sep.join(words))/limit)
no efficient bin-packing algorithm can be guaranteed to do better
Source: http://mathworld.wolfram.com/Bin-PackingProblem.html
Moral of the story
While it's interesting to find the best solution, I think that for the most cases it would be much better to use this algorithm for 1D offline bin packing problems.
Resources
http://mathworld.wolfram.com/Bin-PackingProblem.html
https://github.com/hudora/pyShipping/
Notes
I didn't use textwrap for my implementation because it's slower than my simple Python code.
Maybe it's related with: Why are textwrap.wrap() and textwrap.fill() so slow?
It seems to work perfectly even if the sorting is not reversed.

detect allusions (e.g. very fuzzy matches) in language of inaugural addresses

I'm trying to develop a Python script to examine every sentence in Barack Obama's second inaugural address and find similar sentences in past inaugurals. I've developed a very crude fuzzy match, and I'm hoping to improve it.
I start by reducing all inaugurals to lists of stopword-free sentences. I then build a frequency index.
Next, I compare each sentence in Obama's 2013 address to each sentence of every other address, and evaluate the similarity like so:
#compare two lemmatized sentences. Assumes stop words already removed. frequencies is dict of frequencies across all inaugural
def compare(sentA, sentB, frequencies):
intersect = [x for x in sentA if x in sentB]
N = [frequencies[x] for x in intersect]
#calculate sum that weights uncommon words based on frequency inaugurals
n = sum([10.0 / (x + 1) for x in N])
#ratio of matches to total words in both sentences. (John Adams and William Harrison both favored loooooong sentences that tend to produce matches by sheer probability.)
c = float(len(intersect)) / (len(sentA) + len(sentB))
return (intersect, N, n, c)
Last, I filter out results based on arbitrary cutoffs for n and c.
It works better than one might think, identifying sentences that share uncommon words in a non-negligible proportion to total words.
For example, it picked up these matches:
Obama, 2013:
For history tells us that while these truths may be self-evident, they have never been self-executing; that while freedom is a gift from God, it must be secured by His people here on Earth.
Kennedy, 1961:
With a good conscience our only sure reward, with history the final judge of our deeds, let us go forth to lead the land we love, asking His blessing and His help, but knowing that here on earth God's work must truly be our own.
Obama, 2013
Through blood drawn by lash and blood drawn by sword, we learned that no union founded on the principles of liberty and equality could survive half-slave and half-free.
Lincoln, 1861
Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said "the judgments of the Lord are true and righteous altogether.
Obama, 2013
This generation of Americans has been tested by crises that steeled our resolve and proved our resilience
Kennedy, 1961
Since this country was founded, each generation of Americans has been summoned to give testimony to its national loyalty.
But it's very crude.
I don't have the chops for a major machine-learning project, but I do want to apply more theory if possible. I understand bigram searching, but I'm not sure that will work here -- it's not so much exact bigrams we're interested in as general proximity of two words that are shared between quotes. Is there a fuzzy sentence comparison that looks at probability and distribution of words without being too rigid? The nature of allusion is that it's very approximate.
Current effort available on Cloud9IDE
UPDATE, 1/24/13
Per the accepted answer, here's a simple Python function for bigram windows:
def bigrams(tokens, blur=1):
grams = []
for c in range(len(tokens) - 1):
for i in range(c + 1, min(c + blur + 1, len(tokens))):
grams.append((tokens[c], tokens[i]))
return grams

If you are inspired to use bigrams, you could build your bigrams while allowing gaps of one, two, or even three words so as to loosen up the definition of bigram a little bit. This could work since allowing n gaps means not even n times as many "bigrams", and your corpus is pretty small. With this, for example, a "bigram" from your first paragraph could be (similar, inaugurals).

Generating a list of distinct (distant, by edit distance) words by filtering

I have a long (> 1000 items) list of words, from which I would like to remove words that are "too similar" to other words, until the remaining words are all "significantly different". For example, so that no two words are within an edit distance D.
I do not need a unique solution, and it doesn't have to be exactly optimal, but it should be reasonably quick (in Python) and not discard way too many entries.
How can I achieve this? Thanks.
Edit: to be clear, I can google for a python routine that measures edit distance. The problem is how to do this efficiently, and, perhaps, in some way that finds a "natural" value of D. Maybe by constructing some kind of trie from all words and then pruning?

You can use a bk-tree, and before each item is added check that it is not within distance D of any others (thanks to #DietrichEpp in the comments for this idea.
You can use this recipe for a bk-tree (though any similar recipes are easily modified). Simply make two changes: change the line:
def __init__(self, items, distance, usegc=False):
to
def __init__(self, items, distance, threshold=0, usegc=False):
And change the line
if el not in self.nodes: # do not add duplicates
to
if (el not in self.nodes and
(threshold == None or len(self.find(el, threshold)) == 0)):
This makes sure there are no duplicates when an item is added. Then, the code to remove duplicates from a list is simply:
from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
Note that this relies on the python-Levenshtein package for its distance function, which is much faster than the one provided by bk-tree. python-Levenshtein has C-compiled components, but it's worth the installation.
Finally, I set up a performance test with an increasing number of words (grabbed randomly from /usr/share/dict/words) and graphed the amount of time each took to run:
import random
import time
from Levenshtein import distance
from bktree import BKtree
with open("/usr/share/dict/words") as inf:
word_list = [l[:-1] for l in inf]
def remove_duplicates(lst, threshold):
tr = BKtree(iter(lst), distance, threshold)
return tr.nodes.keys()
def time_remove_duplicates(n, threshold):
"""Test using n words"""
nwords = random.sample(word_list, n)
t = time.time()
newlst = remove_duplicates(nwords, threshold)
return len(newlst), time.time() - t
ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)
from matplotlib import pyplot as plt
plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")
Without confirming it mathematically, I don't think it's quadratic, and I think it might actually be n log n, which would make sense if inserting into a bk-tree is a log time operation. Most notably, it runs pretty quickly with under 5000 strings, which hopefully is the OP's goal (and it's reasonable with 15000, which a traditional for loop solution would not be).

Tries will not be helpful, nor will hash maps. They are simply not useful for spatial, high-dimensional problems like this one.
But the real problem here is the ill-specified requirement of "efficient". How fast is "efficient"?
import Levenshtein
def simple(corpus, distance):
words = []
while corpus:
center = corpus[0]
words.append(center)
corpus = [word for word in corpus
if Levenshtein.distance(center, word) >= distance]
return words
I ran this on 10,000 words selected uniformly from the "American English" dictionary I have on my hard drive, looking for sets with a distance of 5, yielding around 2,000 entries.
real 0m2.558s
user 0m2.404s
sys 0m0.012s
So, the question is, "How efficient is efficient enough"? Since you didn't specify your requirements, it's really hard for me to know if this algorithm works for you or not.
The rabbit hole
If you want something faster, here's how I would do it.
Create a VP tree, BK tree, or other suitable spatial index. For each word in the corpus, insert that word into the tree if it has a suitable minimum distance from every word in the index. Spatial indexes are specifically designed to support this kind of query.
At the end, you will have a tree containing nodes with the desired minimum distance.

Your trie thought is definitely and interesting one. This page has a great setup for fast edit distance calculations in a trie and would definitely be efficient if you needed to expand your wordlist to millions rather than a thousand, which is pretty small in the corpora linguistics business.
Good luck, it sounds like a fun representation of the problem!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.