quick question here: if you run the code below you get a list of frequencies of bigrams per list from the corpus.
I would like to be able to display and keep track of a total running tally. IE instead of what you see displayed when you run it as 1 or maybe 2 for the frequency because the index is so small, it counts through the whole corpus and displays frequencies.
I then basically need to generate text from the frequencies that models the original corpus.
#---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project
#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown
#---------------------------------------------------------
#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'
#---------------------------------------------------------
#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
#Simply check for an instance of a period, and if so, replace with '</s>'
if corpus_list[-1] == '.':
corpus_list[-1] = '</s>'
#Stripe is a modifier that allows us to remove all special characters, IE '\n'
corpus_list[-1].strip()
#Else add to the end of the list item
else:
corpus_list.append('</s>')
return ['<s>'] + corpus_list
#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
print "The Corpus Annotated with <s> and </s> looks like : "
print "Displaying [",user,"] rows of the corpus : ", '\n'
for corpus_list in news[:user]:
print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
print "Fine I Won't Show You Any... ",'\n'
#---------------------------------------------------------
print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0
#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
passer = news[count]
def ngrams(passer, n = 2, padding = True):
#Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
#'None' the first item in each list so that calculations of frequencies can be made
pad = [] if not padding else [None]*(n-1)
grams = pad + passer + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
#In this case, arguments are first: n-gram type (bi, tri, quad)
#Followed by in our case the addition of 'padding'
#Padding is used in every case here because we need it for calculations
#This function structure allows us to pull in corpus parts without the added annotations if need be
for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
print '\n%d - grams || padding = %d' % (size, padding)
print list(ngrams(passer, size, padding))
# show frequency
counts = defaultdict(int)
for n_gram in ngrams(passer, 2, False):
counts[n_gram] += 1
print ("======================================================================================")
print '\nFrequencies Of Bigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
print '\nFrequencies Of Trigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
count = count + 1
#---------------------------------------------------------
I'm not sure I understand the question. nltk has a function generate. The book from which nltk comes from is available online.
http://nltk.org/book/ch01.html
Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
The problem is that you define the dict counts anew for each sentence, so the ngram counts get reset to zero. Define it above the while loop and the counts will accumulate over the entire Brown corpus.
Bonus advice: You should also move the definition of ngram outside the loop-- it's nonsensical to define the same function over and over and over. (But it does no harm, except to performance). Better yet, you should use the nltk's ngram function and read about FreqDist, which is like a dict counter on steroids. It will come in handy when you tackle the statistical text generation.
Related
I have a problem which I solved, but not in an efficient manner. I have a list of strings, which are captions for images. I need to get any word of this list of strings and create a dictionary containing the following information
Word, if that word appears 5 times or more in that list
A simple id for that word
Therefore, my vocabulary in a python dictionary would contain word:id entries
First, I have an auxiliary function to divide a string into tokens, or words
def split_sentence(sentence):
return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))
Then, I will generate the vocabulary like this, which works
def generate_vocabulary(train_captions):
"""
Return {token: index} for all train tokens (words) that occur 5 times or more,
`index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
"""
#convert the list of whole captions to one string
string=listToStr = ' '.join([str(elem) for elem in train_captions])
#divide the string tokens (individual words), by calling the previous function
individual_words=split_sentence(string)
#create a list of words that happen 5 times or more in that string
more_than_5=list(set([x for x in individual_words if individual_words.count(x) >= 5]))
#generate ids
ids=[i for i in range(0,len(more_than_5))]
#generate the vocabulary(dictionary)
vocab = dict(zip(more_than_5,ids))
return {token: index for index, token in enumerate(sorted(vocab))}
The code works like a charm for relatively small lists of captions. However, with lists with thousands of lengths (e.g., 80000), it lasts forever. I am running this code for one hour now.
Is there any way to speed up my code? how can I calculate my more_than_5 variable faster?
EDIT: I forgot menstioning that, in very few specific members of this list of strings, there are \n symbols in just some elements at the beginning of the sentence. Is that possible to eliminate just this symbol from my list and then apply the algorithm again?
You can calculate a number of word's occurrences once instead of calculating it on every step of list comprehension using Counter from collections package.
import re
from collections import Counter
def split_sentence(sentence):
return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))
def generate_vocabulary(train_captions, min_threshold):
"""
Return {token: index} for all train tokens (words) that occur min_threshold times or more,
`index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
"""
#convert the list of whole captions to one string
concat_str = ' '.join([str(elem).strip('\n') for elem in train_captions])
#divide the string tokens (individual words), by calling the split_sentence function
individual_words = split_sentence(concat_str)
#create a list of words that happen min_threshold times or more in that string
condition_keys = sorted([key for key, value in Counter(individual_words).items() if value >= min_threshold])
#generate the vocabulary(dictionary)
result = dict(zip(condition_keys, range(len(condition_keys))))
return result
train_captions = ['Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been.',
'I felt happy because I saw the others were happy and because I knew I should feel happy, but I wasn’t really happy.',
'Almost nothing was more annoying than having our wasted time wasted on something not worth wasting it on.']
generate_vocabulary(train_captions, min_threshold=5)
# {'a': 0, 'because': 1, 'catholic': 2, 'i': 3, 'was': 4}
Like #Eduard Ilyasov said, the Counter class is the best when needing to count things.
Here's my solution:
import re
import collections
original_text = (
"I say to you today, my friends, though, even though ",
"we face the difficulties of today and tomorrow, I still have ",
"a dream. It is a dream deeply rooted in the American ",
"dream. I have a dream that one day this nation will rise ",
'up, live out the true meaning of its creed: "We hold these ',
'truths to be self-evident, that all men are created equal."',
"",
"I have a dream that one day on the red hills of Georgia ",
"sons of former slaves and the sons of former slave-owners ",
"will be able to sit down together at the table of brotherhood. ",
"I have a dream that one day even the state of ",
"Mississippi, a state sweltering with the heat of injustice, ",
"sweltering with the heat of oppression, will be transformed ",
"into an oasis of freedom and justice. ",
"",
"I have a dream that my four little chi1dren will one day ",
"live in a nation where they will not be judged by the color ",
"of their skin but by the content of their character. I have ",
"a dream… I have a dream that one day in Alabama, ",
"with its vicious racists, with its governor having his lips ",
"dripping with the words of interposition and nullification, ",
"one day right there in Alabama little black boys and black ",
"girls will he able to join hands with little white boys and ",
"white girls as sisters and brothers. "
)
def split_sentence(sentence):
return (x.lower() for x in re.split('\W+', sentence.strip()) if x)
def generate_vocabulary(train_captions):
word_count = collections.Counter()
for current_sentence in train_captions:
word_count.update(split_sentence(str(current_sentence)))
return {key: value for (key, value) in word_count.items() if value >= 5}
print(generate_vocabulary(original_text))
I made some assumptions that you didn't specified:
I didn't think that a word would span two sentences
I kept the fact that your captions aren't going to be always strings. If you know they will always be you can simply the code by changing word_count.update(split_sentence(str(current_sentence))) to word_count.update(split_sentence(current_sentence))
I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1
Are there any python libraries which can generate random titles and random descriptions.
Random title: A grammatically correct(but random) English sentence with less than 5 words.
Random description: A grammatically correct(but random) English sentence with less than 20 words.
I am testing a product which has title and description field. I want to create multiple objects with random title and random descriptions instead of "Title 1" "Description 1".
For a fairly simple solution, just find matches for a regex like [A-Z][a-z'\-]+[, ]([a-zA-Z'\-]+[;,]? ){15,25}[a-zA-Z'\-]+[.?!] (Match a capitalized word followed by 15-25 words (potentially with commas or semicolons following them) then followed by a final word and an ending punctuation mark) in some large block of text. To get shorter, title-like phrases, you could just match any sequence of about 5 words (probably without punctuation between them):
([a-zA-Z'\-]+ ){4,6}
From Generating pseudo random text with Markov chains using Python:
You can use Markov chains to achieve this. To do that, you'll need to do the following steps (from the page I linked):
Have a text which will serve as the corpus from which we choose the
next transitions.
Start with two consecutive words from the text.
The last two words constitute the present state.
Generating next
word is the markov transition. To generate the next word, look in
the corpus, and find which words are present after the given two
words. Choose one of them randomly.
Repeat 2, until text of required
size is generated.
The code they supply to accomplish this:
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
With this code, you then do something like the following example, replacing their jeeves.txt with some seed text of your choice (longer is better).
In [1]: file_ = open('/home/shabda/jeeves.txt')
In [2]: import markovgen
In [3]: markov = markovgen.Markov(file_)
In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'
After In[1] through In[3], you'd just need to call markov.generate_markov_text() with the proper arguments to generate sequences of 5 and 20 words as you needed them.
I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}