I'm having trouble with the NLTK under Python, specifically the .generate() method.
generate(self, length=100)
Print random text, generated using a trigram language model.
Parameters:
* length (int) - The length of text to generate (default=100)
Here is a simplified version of what I am attempting.
import nltk
words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)
This will always generate
Building ngram index...
The quick brown
None
As opposed to building a random phrase out of the words.
Here is my output when I do
print text.generate()
Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None
Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that always starts with the first 3 tokens (one of which is a space in this case) and then goes on to randomly generate text.
What am I doing wrong here?
To generate random text, U need to use Markov Chains
code to do that: from here
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
Explaination:
Generating pseudo random text with Markov chains using Python
You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.
In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.
Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.
Are you sure that using word_tokenize is the right approach?
This Google groups page has the example:
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()
But I've never used nltk, so I can't say whether that works the way you want.
Maybe you can sort the tokens array randomly before generating a sentence.
Related
I'm given a list of annotations. The start and end position of each annotation correspond to the distance from the start of the corpus, but I want the positions to correspond to the distance from start of the current sentence.
Input: (Spans that indicate "fox" are from 10-13, 35-38, 50-53)
example = DocumentMulti(
"The brown fox jumped over the lazy fox. The brown fox jumped over the lazy dog",
[Span(10, 13, "fox", None), Span(35, 38, "fox", None), Span(50, 53, "fox", None)]
)
Output
sentencize_pubtator(example)
[DocumentMulti(text='The brown fox jumped over the lazy fox.', denotations=[Span(start=10, end=13, mention='fox', descriptor_record=None), Span(start=35, end=38, mention='fox', descriptor_record=None)]),
DocumentMulti(text='The brown fox jumped over the lazy dog', denotations=[Span(start=10, end=13, mention='fox', descriptor_record=None)])]
I am trying to:
Split an abstract into multiple sentences
Match the given annotations its respective sentence
Update the start and end character positions to be wrt. the sentence rather than the abstract. In other words, the positions should represent the length from the beginning of the sentence.
My current code looks un-pythonic and was hoping for there another method that doesn't involve explicit loops. I was hoping to use a combination of groupby() to aggregate the spans based on the start and end indices of each sentence and map() to update the start and end of each span, but cannot get a solution.
Colab Notebook with code
#dataclass
class Span:
start: int
end: int
mention: str
descriptor_record: DescriptorRecord
#dataclass
class DocumentMulti:
text: str
denotations: List[Span]
def sentencize_pubtator(document):
# Splits sentences
nlp = English()
nlp.add_pipe("sentencizer")
doc = nlp(document.text)
# Dataframe for sentence-level annotations
df = []
# Index of abstract-annotation
i = 0
for sentence in doc.sents:
sent_denotations = []
# (Step 2) Loops through each abstract-annotation that corresponds with the current sentence
while (
i < len(document.denotations)
and int(document.denotations[i].start) >= sentence.start_char
and int(document.denotations[i].end) <= sentence.end_char
):
# (Step 3) Updates abstract-positions to sentence-positions
document.denotations[i].start -= sentence.start_char
document.denotations[i].end -= sentence.start_char
sent_denotations.append(document.denotations[i])
# Go to next abstract-annotation
i += 1
# Appends single sentence and corresponding annotations
df.append(DocumentMulti(str(sentence), sent_denotations))
return df
I'm trying to compare the output of a speech-to-text API with a ground truth transcription. What I'd like to do is capitalize the words in the ground truth which the speech-to-text API either missed or misinterpreted.
For Example:
Truth:
The quick brown fox jumps over the lazy dog.
Speech-to-text Output:
the quick brown box jumps over the dog
Desired Result:
The quick brown FOX jumps over the LAZY dog.
My initial instinct was to remove the capitalization and punctuation from the ground truth and use difflib. This gets me an accurate diff, but I'm having trouble mapping the output back to positions in the original text. I would like to keep the ground truth capitalization and punctuation to display the results, even if I'm only interested in word errors.
Is there any way to express difflib output as word-level changes on an original text?
I would also like to suggest a solution using difflib but I'd prefer using RegEx for word detection since it will be more precise and more tolerant to weird characters and other issues.
I've added some weird text to your original strings to show what I mean:
import re
import difflib
truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'
truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())
for d in difflib.ndiff(truth, speech):
print(d)
Output
the
quick
brown
- fox
+ box
jumps
over
the
- lazy
dog
Another possible output:
diff = difflib.unified_diff(truth, speech)
print(''.join(diff))
Output
---
+++
## -1,9 +1,8 ##
the quick brown-fox+box jumps over the-lazy dog
Why not just split the sentence into words then use difflib on those?
import difflib
truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
'.').split()
speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()
for d in difflib.ndiff(truth, speech):
print(d)
So I think I've solved the problem. I realised that difflib's "contextdiff" provides indices of lines that have changes in them. To get the indices for the "ground truth" text, I remove the capitalization / punctuation, split the text into individual words, and then do the following:
altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
if line.startswith('*** ') and line.endswith(' ****\n'):
line = line.replace(' ', '').replace('\n', '').replace('*', '')
if ',' in line:
split_line = line.split(',')
for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
altered_word_indices.append((int(split_line[0]) + i) - 1)
else:
altered_word_indices.append(int(line) - 1)
Following this, I print it out with the changed words capitalized:
split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
if i in altered_word_indices:
print(split_ground_truth[i].upper(), end=' ')
else:
print(split_ground_truth[i], end=' ')
This allows me to print out "The quick brown FOX jumps over the LAZY dog." (capitalization / punctuation included) instead of "the quick brown FOX jumps over the LAZY dog".
This is...not a super elegant solution, and it's subject to testing, cleanup, error handling, etc. But it seems like a decent start and is potentially useful for someone else running into the same problem. I'll leave this question open for a few days in case someone comes up with a less gross way of getting the same result.
I am trying to use the word2vec module from gensim natural language processing library in Python.
The docs say to initialize the model:
from gensim.models import word2vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
What format does gensim expect for the input sentences? I have raw text
"the quick brown fox jumps over the lazy dogs"
"Then a cop quizzed Mick Jagger's ex-wives briefly."
etc.
What additional processing do I need to post into word2fec?
UPDATE: Here is what I have tried. When it loads the sentences, I get nothing.
>>> sentences = ['the quick brown fox jumps over the lazy dogs',
"Then a cop quizzed Mick Jagger's ex-wives briefly."]
>>> x = word2vec.Word2Vec()
>>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
>>> x.vocab
{}
A list of utf-8 sentences. You can also stream the data from the disk.
Make sure it's utf-8, and split it:
sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
Like alKid pointed out, make it utf-8.
Talking about two additional things you might have to worry about.
Input is too large and you're loading it from a file.
Removing stop words from the sentences.
Instead of loading a big list into the memory, you can do something like:
import nltk, gensim
class FileToSent(object):
def __init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
And then,
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:
content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
for j in range(i+1,len(L)):
wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
if (phrasecoccur>0):
f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")
Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?
You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:
"brown fox", 3
"lazy dog", 8
etc.
You can either capture all of the output and post-process it, or you can process the matches as they're found.
It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.
I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.
It should be possible to assemble your 40000 phrases into a big regular expression pattern, and use that to match against your document. It might not be as fast as something more job-specific, but it does work. Here's how I'd do it:
import re
class Matcher(object):
def __init__(self, phrases):
phrase_pattern = "|".join("(?:{})".format(phrase) for phrase in phrases)
gap_pattern = r"\W+(?:\w+\W+){0,4}?"
full_pattern = "({0}){1}({0})".format(phrase_pattern, gap_pattern)
self.regex = re.compile(full_pattern)
def match(self, doc):
return self.regex.findall(doc) # or use finditer to generate match objs
Here's how you can use it:
>>> L = ["brown fox", "lazy dog"]
>>> matcher = Matcher(L)
>>> doc = "The quick brown fox jumps over the lazy dog."
>>> matcher.match(doc)
[('brown fox', 'lazy dog')]
This solution does have a few limitations. One is that it won't detect overlapping pairs of phrases. So in the example, if you added the phrase "jumps over" to the phrase list, you would still only get one matched pair, ("brown fox", "jumps over"). It would miss both ("brown fox", "lazy dog") and ("jumps over", "lazy dog"), since they include some of the same words.
Expanding on Joel's answer, your iterator could be something like this:
def doc_iter(doc):
words=doc[0:4]
yield words
for i in range(3,len(doc)):
words=words[1:]
words.append(doc[i])
yield words
put your phrases in a dict and use the iterator over the doc, checking the phrases at each iteration. This should give you performance between O(n) and O(n*log(n)).
I'm trying to find some sort of a good, fuzzy string matching algorithm. Direct matching doesn't work for me — this isn't too good because unless my strings are a 100% similar, the match fails. The Levenshtein method doesn't work too well for strings as it works on a character level. I was looking for something along the lines of word level matching e.g.
String A: The quick brown fox.
String B: The quick brown fox jumped
over the lazy dog.
These should match as all words in
string A are in string B.
Now, this is an oversimplified example but would anyone know a good, fuzzy string matching algorithm that works on a word level.
I like Drew's answer.
You can use difflib to find the longest match:
>>> a = 'The quick brown fox.'
>>> b = 'The quick brown fox jumped over the lazy dog.'
>>> import difflib
>>> s = difflib.SequenceMatcher(None, a, b)
>>> s.find_longest_match(0,len(a),0,len(b))
Match(a=0, b=0, size=19) # returns NamedTuple (new in v2.6)
Or pick some minimum matching threshold. Example:
>>> difflib.SequenceMatcher(None, a, b).ratio()
0.61538461538461542
Take a look at this python library, which SeatGeek open-sourced yesterday. Obviously most of these kinds of problems are very context dependent, but it might help you.
from fuzzywuzzy import fuzz
s1 = "the quick brown fox"
s2 = "the quick brown fox jumped over the lazy dog"
s3 = "the fast fox jumped over the hard-working dog"
fuzz.partial_ratio(s1, s2)
> 100
fuzz.token_set_ratio(s2, s3)
> 73
SeatGeek website
and Github repo
If all you want to do is to test whether or not all the words in a string match another string, that's a one liner:
if not [word for word in b.split(' ') if word not in a.split(' ')]:
print 'Match!'
If you want to score them instead of a binary test, why not just do something like:
((# of matching words) / (# of words in bigger string)) *
((# of words in smaller string) / (# of words in bigger string))
?
If you wanted to, you could get fancier and do fuzzy match on each string.
You can try this python package which uses fuzzy name matching with machine learning.
pip install hmni
Initialize a Matcher Object
import hmni
matcher = hmni.Matcher(model='latin')
Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
Note: I have not built this package. Sharing it here because it was quite useful for my use.
GitHub
You could modify the Levenshtein algorithm to compare words rather than characters. It's not a very complex algorithm and the source is available in many languages online.
Levenshtein works by comparing two arrays of chars. There is no reason that the same logic could not be applied against two arrays of strings.
I did this some time ago with C#, my previous question is here. There is starter algorith for your interest, you can easily transform it to python.
Ideas you should use writing your own
algorithm is something like this:
Have a list with original "titles" (words/sentences you want to match
with).
Each title item should have minimal match score on word/sentence, ignore
title as well.
You also should have global minimal match percentage of final result.
You should calculate each word - word Levenshtein distance.
You should increase total match weight if words are going in the same
order (quick brown vs quick brown,
should have definitively higher weight
than quick brown vs. brown quick.)
You can try FuzzySearchEngine from https://github.com/frazenshtein/fastcd/blob/master/search.py.
This fuzzy search supports only search for words and has a fixed admissible error for the word (only one substitution or transposition of two adjacent characters).
However, for example you can try something like:
import search
string = "Chapter I. The quick brown fox jumped over the lazy dog."
substr = "the qiuck broqn fox."
def fuzzy_search_for_sentences(substr, string):
start = None
pos = 0
for word in substr.split(" "):
if not word:
continue
match = search.FuzzySearchEngine(word).search(string, pos=pos)
if not match:
return None
if start is None:
start = match.start()
pos = match.end()
return start
print(fuzzy_search_for_sentences(substr, string))
11 will be printed
Levenshtein should work ok if you compare words (strings separated by sequences of stop charactes) instead of individual letters.
def ld(s1, s2): # Levenshtein Distance
len1 = len(s1)+1
len2 = len(s2)+1
lt = [[0 for i2 in range(len2)] for i1 in range(len1)] # lt - levenshtein_table
lt[0] = list(range(len2))
i = 0
for l in lt:
l[0] = i
i += 1
for i1 in range(1, len1):
for i2 in range(1, len2):
if s1[i1-1] == s2[i2-1]:
v = 0
else:
v = 1
lt[i1][i2] = min(lt[i1][i2-1]+1, lt[i1-1][i2]+1, lt[i1-1][i2-1]+v)
return lt[-1][-1]
str1 = "The quick brown fox"
str2 = "The quick brown fox jumped over the lazy dog"
print("{} words need to be added, deleted or replaced to convert string 1 into string 2".format(ld(str1.split(),str2.split())))