detect allusions (e.g. very fuzzy matches) in language of inaugural addresses - python

I'm trying to develop a Python script to examine every sentence in Barack Obama's second inaugural address and find similar sentences in past inaugurals. I've developed a very crude fuzzy match, and I'm hoping to improve it.
I start by reducing all inaugurals to lists of stopword-free sentences. I then build a frequency index.
Next, I compare each sentence in Obama's 2013 address to each sentence of every other address, and evaluate the similarity like so:
#compare two lemmatized sentences. Assumes stop words already removed. frequencies is dict of frequencies across all inaugural
def compare(sentA, sentB, frequencies):
intersect = [x for x in sentA if x in sentB]
N = [frequencies[x] for x in intersect]
#calculate sum that weights uncommon words based on frequency inaugurals
n = sum([10.0 / (x + 1) for x in N])
#ratio of matches to total words in both sentences. (John Adams and William Harrison both favored loooooong sentences that tend to produce matches by sheer probability.)
c = float(len(intersect)) / (len(sentA) + len(sentB))
return (intersect, N, n, c)
Last, I filter out results based on arbitrary cutoffs for n and c.
It works better than one might think, identifying sentences that share uncommon words in a non-negligible proportion to total words.
For example, it picked up these matches:
Obama, 2013:
For history tells us that while these truths may be self-evident, they have never been self-executing; that while freedom is a gift from God, it must be secured by His people here on Earth.
Kennedy, 1961:
With a good conscience our only sure reward, with history the final judge of our deeds, let us go forth to lead the land we love, asking His blessing and His help, but knowing that here on earth God's work must truly be our own.
Obama, 2013
Through blood drawn by lash and blood drawn by sword, we learned that no union founded on the principles of liberty and equality could survive half-slave and half-free.
Lincoln, 1861
Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said "the judgments of the Lord are true and righteous altogether.
Obama, 2013
This generation of Americans has been tested by crises that steeled our resolve and proved our resilience
Kennedy, 1961
Since this country was founded, each generation of Americans has been summoned to give testimony to its national loyalty.
But it's very crude.
I don't have the chops for a major machine-learning project, but I do want to apply more theory if possible. I understand bigram searching, but I'm not sure that will work here -- it's not so much exact bigrams we're interested in as general proximity of two words that are shared between quotes. Is there a fuzzy sentence comparison that looks at probability and distribution of words without being too rigid? The nature of allusion is that it's very approximate.
Current effort available on Cloud9IDE
UPDATE, 1/24/13
Per the accepted answer, here's a simple Python function for bigram windows:
def bigrams(tokens, blur=1):
grams = []
for c in range(len(tokens) - 1):
for i in range(c + 1, min(c + blur + 1, len(tokens))):
grams.append((tokens[c], tokens[i]))
return grams

If you are inspired to use bigrams, you could build your bigrams while allowing gaps of one, two, or even three words so as to loosen up the definition of bigram a little bit. This could work since allowing n gaps means not even n times as many "bigrams", and your corpus is pretty small. With this, for example, a "bigram" from your first paragraph could be (similar, inaugurals).

Related

Getting rid of duplicates in text strings in new column by identifying differences in original data and using this difference in new column

I have sort of more general question on the process on working with text data.
My goal is to create UNIQUE short labels/description on products from existing long descriptions based on specific rules.
In practice it looks like this. I get the data that you see in column Existing Long Description and based on rules and loops in python I changed it to the data in "New_Label" column.
Existing_Long_Description
New_Label
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm blac
Edge protector BLACK RNG 1-2MM L=10M
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm red
Edge protector RED RNG 1-2MM L=10M
This shortening to the desired format is not a problem. The problem starts when checking uniqueness of "New_label" column. Due to this shortening I might create duplicates:
Existing_Long_Description
New_Label
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=1
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=6
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=8
Draw-in collet chuck dm 1-10MM
To solve this I need to add some distinguishing factor to my New_Label column based on the difference in Existing_Long_Description.
The problem is that it might not be between unknown number of articles.
I thought about following process:
Identify the duplicates in Existing_Long_description = if there are duplicates, I will know those cant be solved in New_Label
Identify the duplicates in New_Label column and if they are not in selection above = I know these can be solved
For these that can be solved I need to run some distinguisher to find where they differ and extract this difference into other column to elaborate later on what to use to New_label column
Does what I want to do make sense? As I am doing it for the first time I am wondering - is there any way of working that you recommend me?
I read some articles like this: Find the similarity metric between two strings
or elsewhere in stackoverflow I read about this: https://docs.python.org/3/library/difflib.html
That I am planning to use but still it feels rather ineffective to me and maybe here is someone who can help me.
Thanks!
A relational database would be a good fit for this problem,
with appropriate UNIQUE indexes configured.
But let's assume you're going to solve it in memory, rather than on disk.
Assume that get_longs() will read long descriptions from your data source.
dup long descriptions
Avoid processing like this:
longs = []
for long in get_longs():
if long not in longs:
longs.append(long)
Why?
It is quadratic, running in O(N^2) time, for N descriptions.
Each in takes linear O(N) time,
and we perform N such operations on the list.
To process 1000 parts would regrettably require a million operations.
Instead, take care to use an appropriate data structure, a set:
longs = set(get_longs())
That's enough to quickly de-dup the long descriptions, in linear time.
dup short descriptions
Now the fun begins.
You explained that you already have a function that works like a champ.
But we must adjust its output in the case of collisions.
class Dedup:
def __init__(self):
self.short_to_long = {}
def get_shorts(self):
"""Produces unique short descriptions."""
for long in sorted(set(get_longs())):
short = summary(long)
orig_long = self.short_to_long.get(short)
if orig_long:
short = self.deconflict(short, orig_long, long)
self.short_to_long[short] = long
yield short
def deconflict(self, short, orig_long, long):
"""Produces a novel short description that won't conflict with existing ones."""
for word in sorted(set(long.split()) - set(orig_long.split())):
short += f' {word}'
if short not in self.short_to_long: # Yay, we win!
return short
# Boo, we lose.
raise ValueError(f"Sorry, can't find a good description: {short}\n{orig_long}\n{long}")
The expression that subtracts one set from another is answering the question,
"What words in long would help me to uniqueify this result?"
Now of course, some of them may have already been used
by other short descriptions, so we take care to check for that.
Given several long descriptions
that collide in the way you're concerned about,
the 1st one will have the shortest description,
and ones appearing later will tend to have longer "short" descriptions.
The approach above is a bit simplistic, but it should get you started.
It does not, for example, distinguish between "claw hammer" and "hammer claw".
Both strings survive initial uniqueification,
but then there's no more words to help with deconflicting.
For your use case the approach above is likely to be "good enough".

To many lists of Unique Words

This is a homework project from last week. I had problems so did not turn it it. But I like to go back and see if I can make them work. Now that I have it printing the right words in alphabetical order. I have the problem that it is printing 3 separate lists of unique words all with different number of words in the lists. How can I fix this?
import string
def process_line(line_str,word_set):
line_str=line_str.strip()
list_of_words=line_str.split()
for word in list_of_words:
if word!="--":
word=word.strip()
word=word.strip(string.punctuation)
word=word.lower()
word_set.add(word)
def pretty_print(word_set):
list_of_words=[]
for w in word_set:
list_of_words.append(w)
list_of_words.sort()
for w in list_of_words:
print(w,end=" ")
word_set=set([])
fObject=open("gettysburg.txt")
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)
Below is the output I get, I only want it to give me the last one with the 138 words. Appreciate any help.
length of the word set: 29
Unique words in set:
a ago all and are brought conceived continent created dedicated equal fathers forth four in liberty men nation new on our proposition score seven that the this to years
length of the word set: 71
Unique words in set:
a ago all altogether and any are as battlefield brought can civil come conceived continent created dedicate dedicated do endure engaged equal fathers field final fitting for forth four gave great have here in is it liberty live lives long men met might nation new now of on or our place portion proper proposition resting score seven should so testing that the their this those to war we whether who years
length of the word set: 138
Unique words in set:
a above add advanced ago all altogether and any are as battlefield be before birth brave brought but by can cause civil come conceived consecrate consecrated continent created dead dedicate dedicated detract devotion did died do earth endure engaged equal far fathers field final fitting for forget forth fought four freedom from full gave god government great ground hallow have here highly honored in increased is it larger last liberty little live lives living long measure men met might nation never new nobly nor not note now of on or our people perish place poor portion power proper proposition rather remaining remember resolve resting say score sense seven shall should so struggled take task testing that the their these they this those thus to under unfinished us vain war we what whether which who will work world years
Take last 3 lines out of for:
....
for line_str in fObject:
process_line(line_str,word_set)
print("\nlength of the word set: ",len(word_set))
print("\nUnique words in set: ")
pretty_print(word_set)

Calculating the Letter Frequency in Python

I need to define a function that will slice a string according to a certain character, sum up those indices, divide by the number of times the character occurs in the string and then divide all that by the length of the text.
Here's what I have so far:
def ave_index(char):
passage = "string"
if char in passage:
word = passage.split(char)
words = len(word)
number = passage.count(char)
answer = word / number / len(passage)
return(answer)
elif char not in passage:
return False
So far, the answers I've gotten when running this have been quite off the mark
EDIT: The passage we were given to use as a string -
'Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'
when char = 's' the answer should be 0.5809489252885479
You can use Counter to check frequencies:
from collections import Counter
words = 'The passage we were given to use as a string - Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people\'s hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'
freqs = Counter(list(words)) # list(words) returns a list of all the characters in words, then Counter will calculate the frequencies
print(float(freqs['s']) / len(words))
The problem is how you are counting the letters. Take the string hello world and you are trying to count how many l there are. Now we know there are 3 l, but if you do a split:
>>> s.split('l')
['he', '', 'o wor', 'd']
This will result in a count of 4. Further, we have to get the position of each instance of the character in the string.
The enumerate built-in helps us out here:
>>> s = 'hello world'
>>> c = 'l' # The letter we are looking for
>>> results = [k for k,v in enumerate(s) if v == c]
>>> results
[2, 3, 9]
Now we have the total number of occurrences len(results), and the positions in the string where the letter occurs.
The final "trick" to this problem is to make sure you divide by a float, in order to get the proper result.
Working against your sample text (stored in s):
>>> c = 's'
>>> results = [k for k,v in enumerate(s) if v == c]
>>> results_sum = sum(results)
>>> (results_sum / len(results)) / float(len(s))
0.5804132973944295

Comparing similarity between multiple strings with a random starting point

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M#rrrrryy Richard etc etc. Some typos but some totally different names.
Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!
I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc
Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…
Here is how to generate the pairs for a given list:
import itertools
persons = ['person1', 'person2', 'person3']
for p1, p2 in itertools.combinations(persons, 2):
print "Compare", p1, "and", p2

How to extract literal words from a consecutive string efficiently? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to split text without spaces into list of words?
There are masses of text information in people's comments which are parsed from html, but there are no delimiting characters in them. For example: thumbgreenappleactiveassignmentweeklymetaphor. Apparently, there are 'thumb', 'green', 'apple', etc. in the string. I also have a large dictionary to query whether the word is reasonable.
So, what's the fastest way to extract these words?
I'm not really sure a naive algorithm would serve your purpose well, as pointed out by eumiro, so I'll describe a slightly more complex one.
The idea
The best way to proceed is to model the distribution of the output. A good first approximation is to assume all words are independently distributed. Then you only need to know the relative frequency of all words. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/(n log N) where N is the number of words in the dictionary.
Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows.
The code
import math
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("words-by-frequency.txt").read().split()
wordcost = dict((k,math.log((i+1)*math.log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
which you can use with
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
print(infer_spaces(s))
Examples
I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.
Before: thumbgreenappleactiveassignmentweeklymetaphor.
After: thumb green apple active assignment weekly metaphor.
Before: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen
odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho
rapparentlytherearethumbgreenappleetcinthestringialsohavealargedictionarytoquery
whetherthewordisreasonablesowhatsthefastestwayofextractionthxalot.
After: there is masses of text information of peoples comments which is parsed from html but there are no delimited characters in them for example thumb green apple active assignment weekly metaphor apparently there are thumb green apple etc in the string i also have a large dictionary to query whether the word is reasonable so what s the fastest way of extraction thx a lot.
Before: itwasadarkandstormynighttherainfellintorrentsexceptatoccasionalintervalswhenitwascheckedbyaviolentgustofwindwhichsweptupthestreetsforitisinlondonthatoursceneliesrattlingalongthehousetopsandfiercelyagitatingthescantyflameofthelampsthatstruggledagainstthedarkness.
After: it was a dark and stormy night the rain fell in torrents except at occasional intervals when it was checked by a violent gust of wind which swept up the streets for it is in london that our scene lies rattling along the housetops and fiercely agitating the scanty flame of the lamps that struggled against the darkness.
As you can see it is essentially flawless. The most important part is to make sure your word list was trained to a corpus similar to what you will actually encounter, otherwise the results will be very bad.
Optimization
The implementation consumes a linear amount of time and memory, so it is reasonably efficient. If you need further speedups, you can build a suffix tree from the word list to reduce the size of the set of candidates.
If you need to process a very large consecutive string it would be reasonable to split the string to avoid excessive memory usage. For example you could process the text in blocks of 10000 characters plus a margin of 1000 characters on either side to avoid boundary effects. This will keep memory usage to a minimum and will have almost certainly no effect on the quality.
"Apparently" is good for humans, not for computers…
words = set(possible words)
s = 'thumbgreenappleactiveassignmentweeklymetaphor'
for i in xrange(len(s) - 1):
for j in xrange(1, len(s) - i):
if s[i:i+j] in words:
print s[i:i+j]
For possible words in /usr/share/dict/words and for j in xrange(3, len(s) - i): (minimal words length of 3), it finds:
thumb
hum
green
nap
apple
plea
lea
act
active
ass
assign
assignment
sign
men
twee
wee
week
weekly
met
eta
tap

Categories