I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching e.g.
String A: The quick brown fox.
String B: The quick brown fox jumped
over the lazy dog.
These should match as all words in
string A are in string B.
Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level.
I like Drew's answer.
You can use difflib to find the longest match:
>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)
Or pick some minimum matching threshold. Example:
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542
Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.
from fuzzywuzzy import fuzz
s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"
fuzz.partial_ratio(s1, s2)
> 100
fuzz.token_set_ratio(s2, s3)
> 73
SeatGeek website
and Github repo
If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:
if not [word for word in b.split(' ') if word not in a.split(' ')]:
print 'Match!'
If you want to score them instead of a binary test, why not just do something like:
((# of matching words) / (# of words in bigger string)) *
((# of words in smaller string) / (# of words in bigger string))
?
If you wanted to, you could get fancier and do fuzzy match on each string.
You can try this python package which uses fuzzy name matching with machine learning.
pip install hmni
Initialize a Matcher Object
import hmni
matcher = hmni.Matcher(model='latin')
Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
Note: I have not built this package. Sharing it here because it was quite useful for my use.
GitHub
You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.
Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.
I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.
Ideas you should use writing your own
algorithm is something like this:
Have a list with original "titles" (words/sentences you want to match
with).
Each title item should have minimal match score on word/sentence, ignore
title as well.
You also should have global minimal match percentage of final result.
You should calculate each word - word Levenshtein distance.
You should increase total match weight if words are going in the same
order (quick brown vs quick brown,
should have definitively higher weight
than quick brown vs. brown quick.)
You can try FuzzySearchEngine from https://github.com/frazenshtein/fastcd/blob/master/search.py.
This fuzzy search supports only search for words and has a fixed admissible error for the word (only one substitution or transposition of two adjacent characters).
However, for example you can try something like:
import search
string = "Chapter I. The quick brown fox jumped over the lazy dog."
substr = "the qiuck broqn fox."
def fuzzy_search_for_sentences(substr, string):
start = None
pos = 0
for word in substr.split(" "):
if not word:
continue
match = search.FuzzySearchEngine(word).search(string, pos=pos)
if not match:
return None
if start is None:
start = match.start()
pos = match.end()
return start
print(fuzzy_search_for_sentences(substr, string))
11 will be printed
Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.
def ld(s1, s2): # Levenshtein Distance
len1 = len(s1)+1
len2 = len(s2)+1
lt = [[0 for i2 in range(len2)] for i1 in range(len1)] # lt - levenshtein_table
lt[0] = list(range(len2))
i = 0
for l in lt:
l[0] = i
i += 1
for i1 in range(1, len1):
for i2 in range(1, len2):
if s1[i1-1] == s2[i2-1]:
v = 0
else:
v = 1
lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
return lt[-1][-1]
str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"
print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))
Related
I'm trying to compare the output of a speech-to-text API with a ground truth transcription. What I'd like to do is capitalize the words in the ground truth which the speech-to-text API either missed or misinterpreted.
For Example:
Truth:
The quick brown fox jumps over the lazy dog.
Speech-to-text Output:
the quick brown box jumps over the dog
Desired Result:
The quick brown FOX jumps over the LAZY dog.
My initial instinct was to remove the capitalization and punctuation from the ground truth and use difflib. This gets me an accurate diff, but I'm having trouble mapping the output back to positions in the original text. I would like to keep the ground truth capitalization and punctuation to display the results, even if I'm only interested in word errors.
Is there any way to express difflib output as word-level changes on an original text?
I would also like to suggest a solution using difflib but I'd prefer using RegEx for word detection since it will be more precise and more tolerant to weird characters and other issues.
I've added some weird text to your original strings to show what I mean:
import re
import difflib
truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'
truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())
for d in difflib.ndiff(truth, speech):
print(d)
Output
the
quick
brown
- fox
+ box
jumps
over
the
- lazy
dog
Another possible output:
diff = difflib.unified_diff(truth, speech)
print(''.join(diff))
Output
---
+++
## -1,9 +1,8 ##
the quick brown-fox+box jumps over the-lazy dog
Why not just split the sentence into words then use difflib on those?
import difflib
truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
'.').split()
speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()
for d in difflib.ndiff(truth, speech):
print(d)
So I think I've solved the problem. I realised that difflib's "contextdiff" provides indices of lines that have changes in them. To get the indices for the "ground truth" text, I remove the capitalization / punctuation, split the text into individual words, and then do the following:
altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
if line.startswith('*** ') and line.endswith(' ****\n'):
line = line.replace(' ', '').replace('\n', '').replace('*', '')
if ',' in line:
split_line = line.split(',')
for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
altered_word_indices.append((int(split_line[0]) + i) - 1)
else:
altered_word_indices.append(int(line) - 1)
Following this, I print it out with the changed words capitalized:
split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
if i in altered_word_indices:
print(split_ground_truth[i].upper(), end=' ')
else:
print(split_ground_truth[i], end=' ')
This allows me to print out "The quick brown FOX jumps over the LAZY dog." (capitalization / punctuation included) instead of "the quick brown FOX jumps over the LAZY dog".
This is...not a super elegant solution, and it's subject to testing, cleanup, error handling, etc. But it seems like a decent start and is potentially useful for someone else running into the same problem. I'll leave this question open for a few days in case someone comes up with a less gross way of getting the same result.
I have a string that has around 10 lines of text. What I am trying to do is find a sentence that has a specific word(s) in it, and display the word following.
Example String:
The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat
I want the script to search for 'The slow', then print the following word, so in this case, 'donkey'.
I have tried using the Find function, but that just prints the location of the word(s).
Example code:
sSearch = output.find("destination-pattern")
print(sSearch)
Any help would be greatly appreciated.
output = "The slow donkey brown fox"
patt = "The slow"
sSearch = output.find(patt)
print(output[sSearch+len(patt)+1:].split(' ')[0])
output:
donkey
You could work with regular expressions. Python has builtin library called re.
Example usage:
s = "The slow donkey some more text"
finder = "The slow"
idx_finder_end = s.find(finder) + len(finder)
next_word_match = re.match(r"\s\w*\s", s[idx_finder_end:])
next_word = next_word_match.group().strip()
# donkey
I would do it using regular expressions (re module) following way:
import re
txt = '''The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat'''
words = re.findall(r'(?<=The slow) (\w*)',txt)
print(words) # prints ['donkey']
Note that words is now list of words, if you are sure that there is exactly 1 word to be found you could do then:
word = words[0]
print(word) # prints donkey
Explanation: I used so-called lookbehind assertion in first argument of re.findall, which mean I am looking for something behind The slow. \w* means any substring consisting of: letters, digits, underscores (_). I enclosed it in group (brackets) because it is not part of word.
You can do it using regular expressions:
>>> import re
>>> r=re.compile(r'The slow\s+\b(\w+)\b')
>>> r.match('The slow donkey')[1]
'donkey'
>>>
I have a message of text and a list of terms. I'd like to create an array that shows which terms are found in the message. For instance:
message = "the quick brown fox jumps over the lazy dog"
terms = ["quick", "fox", "horse", "Lorem", "Ipsum", "the"]
result = idealMethod(message, terms)
=> [1,1,0,0,0,1]
Because "quick" was the first item in the list of terms and was also in the message a 1 is placed in the first position in the result. Here's another example:
message2 = "Every Lorem has a fox"
result2 = idealMethod(message2, terms)
=> [0,1,0,1,0,0]
Update:
The terms need to be exact matches. For instance if my search term include sam I don't want a match for same
I think you want:
words = set(message.split(" "))
result = [int(word in words) for word in terms]
Note that split() splits on whitespace by default, so you could leave out the " ".
You can use list comprehension and leverage fact, that True is evaluated as 1 in numeric context:
words = set(message.split())
result = [int(term in words) for term in terms]
Out[24]: [1, 1, 0, 0, 0]
EDIT changed to look only for whole word matches, after clarification.
I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:
content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
for j in range(i+1,len(L)):
wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
if (phrasecoccur>0):
f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")
Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?
You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:
"brown fox", 3
"lazy dog", 8
etc.
You can either capture all of the output and post-process it, or you can process the matches as they're found.
It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.
I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.
It should be possible to assemble your 40000 phrases into a big regular expression pattern, and use that to match against your document. It might not be as fast as something more job-specific, but it does work. Here's how I'd do it:
import re
class Matcher(object):
def __init__(self, phrases):
phrase_pattern = "|".join("(?:{})".format(phrase) for phrase in phrases)
gap_pattern = r"\W+(?:\w+\W+){0,4}?"
full_pattern = "({0}){1}({0})".format(phrase_pattern, gap_pattern)
self.regex = re.compile(full_pattern)
def match(self, doc):
return self.regex.findall(doc) # or use finditer to generate match objs
Here's how you can use it:
>>> L = ["brown fox", "lazy dog"]
>>> matcher = Matcher(L)
>>> doc = "The quick brown fox jumps over the lazy dog."
>>> matcher.match(doc)
[('brown fox', 'lazy dog')]
This solution does have a few limitations. One is that it won't detect overlapping pairs of phrases. So in the example, if you added the phrase "jumps over" to the phrase list, you would still only get one matched pair, ("brown fox", "jumps over"). It would miss both ("brown fox", "lazy dog") and ("jumps over", "lazy dog"), since they include some of the same words.
Expanding on Joel's answer, your iterator could be something like this:
def doc_iter(doc):
words=doc[0:4]
yield words
for i in range(3,len(doc)):
words=words[1:]
words.append(doc[i])
yield words
put your phrases in a dict and use the iterator over the doc, checking the phrases at each iteration. This should give you performance between O(n) and O(n*log(n)).
I'm having trouble with the NLTK under Python, specifically the .generate() method.
generate(self, length=100)
Print random text, generated using a trigram language model.
Parameters:
* length (int) - The length of text to generate (default=100)
Here is a simplified version of what I am attempting.
import nltk
words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)
This will always generate
Building ngram index...
The quick brown
None
As opposed to building a random phrase out of the words.
Here is my output when I do
print text.generate()
Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None
Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that always starts with the first 3 tokens (one of which is a space in this case) and then goes on to randomly generate text.
What am I doing wrong here?
To generate random text, U need to use Markov Chains
code to do that: from here
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
Explaination:
Generating pseudo random text with Markov chains using Python
You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.
In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.
Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.
Are you sure that using word_tokenize is the right approach?
This Google groups page has the example:
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()
But I've never used nltk, so I can't say whether that works the way you want.
Maybe you can sort the tokens array randomly before generating a sentence.