How to highlight (only) word errors using difflib? - python

I'm trying to compare the output of a speech-to-text API with a ground truth transcription. What I'd like to do is capitalize the words in the ground truth which the speech-to-text API either missed or misinterpreted.
For Example:
Truth:
The quick brown fox jumps over the lazy dog.
Speech-to-text Output:
the quick brown box jumps over the dog
Desired Result:
The quick brown FOX jumps over the LAZY dog.
My initial instinct was to remove the capitalization and punctuation from the ground truth and use difflib. This gets me an accurate diff, but I'm having trouble mapping the output back to positions in the original text. I would like to keep the ground truth capitalization and punctuation to display the results, even if I'm only interested in word errors.
Is there any way to express difflib output as word-level changes on an original text?

I would also like to suggest a solution using difflib but I'd prefer using RegEx for word detection since it will be more precise and more tolerant to weird characters and other issues.
I've added some weird text to your original strings to show what I mean:
import re
import difflib
truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'
truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())
for d in difflib.ndiff(truth, speech):
print(d)
Output
the
quick
brown
- fox
+ box
jumps
over
the
- lazy
dog
Another possible output:
diff = difflib.unified_diff(truth, speech)
print(''.join(diff))
Output
---
+++
## -1,9 +1,8 ##
the quick brown-fox+box jumps over the-lazy dog

Why not just split the sentence into words then use difflib on those?
import difflib
truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
'.').split()
speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()
for d in difflib.ndiff(truth, speech):
print(d)

So I think I've solved the problem. I realised that difflib's "contextdiff" provides indices of lines that have changes in them. To get the indices for the "ground truth" text, I remove the capitalization / punctuation, split the text into individual words, and then do the following:
altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
if line.startswith('*** ') and line.endswith(' ****\n'):
line = line.replace(' ', '').replace('\n', '').replace('*', '')
if ',' in line:
split_line = line.split(',')
for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
altered_word_indices.append((int(split_line[0]) + i) - 1)
else:
altered_word_indices.append(int(line) - 1)
Following this, I print it out with the changed words capitalized:
split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
if i in altered_word_indices:
print(split_ground_truth[i].upper(), end=' ')
else:
print(split_ground_truth[i], end=' ')
This allows me to print out "The quick brown FOX jumps over the LAZY dog." (capitalization / punctuation included) instead of "the quick brown FOX jumps over the LAZY dog".
This is...not a super elegant solution, and it's subject to testing, cleanup, error handling, etc. But it seems like a decent start and is potentially useful for someone else running into the same problem. I'll leave this question open for a few days in case someone comes up with a less gross way of getting the same result.

Related

How to extract the text between a word and its next occurrence?

I have the following sample text:
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\begin{itemize}
\item Item 1
\item Item 2
\end{itemize}
\end{document}'''
What I'm trying to accomplish is to create a regex expression which will extract the following lines from mystr:
['This is introduction paragraph','This is non-introduction paragraph',' This is sample section paragraph\n \begin{itemize}\n\item Item 1\n\item Item 2\n\end{itemize}']
For any reason you need to use regex. Perhaps the splitting string is more involved than just "a". The re module has a split function too:
import re
str_ = "a quick brown fox jumps over a lazy dog than a quick elephant"
print(re.split(r'\s?\ba\b\s?',str_))
# ['', 'quick brown fox jumps over', 'lazy dog than', 'quick elephant']
EDIT: expanded answer with the new information you provided...
After your edit in which you write a better description of your problem and you include a text that looks like LaTeX, I think you need to extract those lines that do not start with a \, which are the latex commands. In other words, you need the lines with only text. Try the following, always using regular expressions:
import re
mystr = r'''\documentclass[12pt]{article}
\usepackage{amsmath}
\title{\LaTeX}
\begin{document}
\section{Introduction}
This is introduction paragraph
\section{Non-Introduction}
This is non-introduction paragraph
\section{Sample section}
This is sample section paragraph
\end{document}'''
pattern = r"^[^\\]*\n"
matches = re.findall(pattern, mystr, flags=re.M)
print(matches)
# ['This is introduction paragraph\n', 'This is non-introduction paragraph\n', 'This is sample section paragraph\n']
You can use the split method from str:
my_string = "a quick brown fox jumps over a lazy dog than a quick elephant"
word = "a "
my_string.split(word)
Results in:
['', 'quick brown fox jumps over ', 'lazy dog than ', 'quick elephant']
Note: Don't use str as a variable name as it is a keyword in Python.

Python - removing some punctuation from text

I want Python to remove only some punctuation from a string, let's say I want to remove all the punctuation except '#'
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Here the output is
The quick brown fox like totally jumped man
But what I want is something like this
The quick brown fox like totally jumped #man
Is there a way to selectively remove punctuation from a text leaving out the punctuation that we want in the text intact?
str.punctuation contains all the punctuations. Remove # from it. Then replace with '' whenever you get that punctuation string.
>>> import re
>>> a = string.punctuation.replace('#','')
>>> re.sub(r'[{}]'.format(a),'','The quick brown fox, like, totally jumped, #man!')
'The quick brown fox like totally jumped #man'
Just remove the character you don't want to touch from the replacement string:
import string
remove = dict.fromkeys(map(ord, '\n' + string.punctuation.replace('#','')))
sample = 'The quick brown fox, like, totally jumped, #man!'
sample.translate(remove)
Also note that I changed '\n ' to '\n', as the former will remove spaces from your string.
Result:
The quick brown fox like totally jumped #man

How do I specify token ordering in pyparsing?

Suppose I'm parsing the following line:
The quick brown fox jumps over the lazy dog
I'd like to parse this as:
Words('The quick brown fox') + Literal('jumps') + Words('over the lazy dog')
My current pyparsing definition is:
some_words = OneOrMore(Word(alphas))
jumps = Literal('jumps')
sentence = some_words + jumps + some_words
What's happening is that the some_words swallows up the 'jumps', and I get a parsing error. How do I make pyparsing lex the jumps as a literal token?
You are already thinking like the parser, since you understand that OneOrMore(Word(alphas)) keeps going, even to reading the word "jumps". Now turn that around and write the parser to do things the way you think.
For every word up to "jumps", how do you know that it should be added to the leading set of words? You know for each word because it is not the word "jumps". Pyparsing does not automatically do this lookahead, but you can do it for yourself with NotAny (which can be abbreviated using the '~' operator):
JUMPS = Literal("jumps")
some_words = OneOrMore(~JUMPS + Word(alphas))
Now before matching another word, some_words first verifies that the word is not "jumps".

What is a simple fuzzy string matching algorithm in Python?

I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching e.g.
String A: The quick brown fox.
String B: The quick brown fox jumped
over the lazy dog.
These should match as all words in
string A are in string B.
Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level.
I like Drew's answer.
You can use difflib to find the longest match:
>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)
Or pick some minimum matching threshold. Example:
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542
Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.
from fuzzywuzzy import fuzz
s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"
fuzz.partial_ratio(s1, s2)
> 100
fuzz.token_set_ratio(s2, s3)
> 73
SeatGeek website
and Github repo
If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:
if not [word for word in b.split(' ') if word not in a.split(' ')]:
print 'Match!'
If you want to score them instead of a binary test, why not just do something like:
((# of matching words) / (# of words in bigger string)) *
((# of words in smaller string) / (# of words in bigger string))
?
If you wanted to, you could get fancier and do fuzzy match on each string.
You can try this python package which uses fuzzy name matching with machine learning.
pip install hmni
Initialize a Matcher Object
import hmni
matcher = hmni.Matcher(model='latin')
Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
Note: I have not built this package. Sharing it here because it was quite useful for my use.
GitHub
You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.
Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.
I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.
Ideas you should use writing your own
algorithm is something like this:
Have a list with original "titles" (words/sentences you want to match
with).
Each title item should have minimal match score on word/sentence, ignore
title as well.
You also should have global minimal match percentage of final result.
You should calculate each word - word Levenshtein distance.
You should increase total match weight if words are going in the same
order (quick brown vs quick brown,
should have definitively higher weight
than quick brown vs. brown quick.)
You can try FuzzySearchEngine from https://github.com/frazenshtein/fastcd/blob/master/search.py.
This fuzzy search supports only search for words and has a fixed admissible error for the word (only one substitution or transposition of two adjacent characters).
However, for example you can try something like:
import search
string = "Chapter I. The quick brown fox jumped over the lazy dog."
substr = "the qiuck broqn fox."
def fuzzy_search_for_sentences(substr, string):
start = None
pos = 0
for word in substr.split(" "):
if not word:
continue
match = search.FuzzySearchEngine(word).search(string, pos=pos)
if not match:
return None
if start is None:
start = match.start()
pos = match.end()
return start
print(fuzzy_search_for_sentences(substr, string))
11 will be printed
Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.
def ld(s1, s2): # Levenshtein Distance
len1 = len(s1)+1
len2 = len(s2)+1
lt = [[0 for i2 in range(len2)] for i1 in range(len1)] # lt - levenshtein_table
lt[0] = list(range(len2))
i = 0
for l in lt:
l[0] = i
i += 1
for i1 in range(1, len1):
for i2 in range(1, len2):
if s1[i1-1] == s2[i2-1]:
v = 0
else:
v = 1
lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
return lt[-1][-1]
str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"
print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))

Generating random sentences from custom text in Python's NLTK?

I'm having trouble with the NLTK under Python, specifically the .generate() method.
generate(self, length=100)
Print random text, generated using a trigram language model.
Parameters:
* length (int) - The length of text to generate (default=100)
Here is a simplified version of what I am attempting.
import nltk
words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)
This will always generate
Building ngram index...
The quick brown
None
As opposed to building a random phrase out of the words.
Here is my output when I do
print text.generate()
Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None
Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that always starts with the first 3 tokens (one of which is a space in this case) and then goes on to randomly generate text.
What am I doing wrong here?
To generate random text, U need to use Markov Chains
code to do that: from here
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
Explaination:
Generating pseudo random text with Markov chains using Python
You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.
In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.
Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.
Are you sure that using word_tokenize is the right approach?
This Google groups page has the example:
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()
But I've never used nltk, so I can't say whether that works the way you want.
Maybe you can sort the tokens array randomly before generating a sentence.

Categories