I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.
So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()
I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)
Related
I have the following sentence where I want to get rid of everything with the format '(number)/(... ; number)' :
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure (1; 2 ; 3). Many of these architectural proteins are
molecular bridges that can bind at two or more distinct DNA sites to
form loops. For example, bacterial DNA is looped and compacted by the
histonelike protein H-NS, which has two distinct DNA-binding domains
(4). In eukaryotes, complexes of transcription factors and RNA
polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while
HP1 (9), histone H1 (10), and the polycomb-repressor complex PRC1/2
(11 ; 12) organize inactive chromatin. Proteins also bind to specific
DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies (13; 14; 15; 16;
17 ; 18). The selective binding of molecular bridges to active and
inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions (6; 8
; 19).
I want it to be in the form:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure . Many of these architectural proteins are molecular bridges
that can bind at two or more distinct DNA sites to form loops. For
example, bacterial DNA is looped and compacted by the histonelike
protein H-NS, which has two distinct DNA-binding domains . In
eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin.
Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and
promyeloleukemia bodies . The selective binding of molecular bridges
to active and inactive regions of chromatin has also been highlighted
as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
My attempt was as follows:
import re
x=re.sub(r'\(.+; \d+\)', '', x) # eliminate brackets with multiple numbers
#### NOTE: there are 2 spaces between the last ';' and the last digit
x=re.sub(r'\d+\)', '', x) # eliminate brackets with single number
My output was this:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure .
So clearly my code is missing something. I thought that '(.+)' would identify all brackets containing non-arbitrary characters and then I could further specify that I want all the ones ending in a '; number'.
I just want a flexible way of indexing a sentence at all places with '(number' and 'number)' and eliminate everything in between....
Maybe you can try to use the pattern
re.sub('\([0-9; ]+\)', '', x)
which removes all parenthesis that contein at least a number, a ";" or a space.
I think it's not the case to use the r prefix.
Try the following regex:
r'\s\((\d+\s?;?\s?)+\)'
This regex will match one or more groups of numbers (followed by spaces/semicolons) inside of parenthesis.
There seems to always be a space before the collection of numbers, so matching that should help with the "trailing space".
You can use a pattern like \(\d+(?:;\s?\d+\s?)*\), which matches an initial parentheses and digits ( <number>, and then any possible repeating ; <number>s that ends in ). Test it.
Or if you're feeling brave you can use \([;\d\s]+\) which just matches everything with digits/spaces/semicolons between two parentheses. Test it.
I found a glitch in your expected text, there's 1 space missing after PRC1/2. But this code works, with that space added back in:
text="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure (1; 2 ; 3).
Many of these architectural proteins are molecular bridges that can bind at two
or more distinct DNA sites to form loops. For example, bacterial DNA is looped
and compacted by the histonelike protein H-NS, which has two distinct
DNA-binding domains (4). In eukaryotes, complexes of transcription factors and
RNA polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while HP1 (9),
histone H1 (10), and the polycomb-repressor complex PRC1/2 (11 ; 12) organize
inactive chromatin. Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and promyeloleukemia
bodies (13; 14; 15; 16; 17 ; 18). The selective binding of molecular bridges to
active and inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically associated domains
(TADs)—regions rich in local DNA interactions (6; 8 ; 19).
""".replace('\n', ' ')
expected="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure . Many of
these architectural proteins are molecular bridges that can bind at two or more
distinct DNA sites to form loops. For example, bacterial DNA is looped and
compacted by the histonelike protein H-NS, which has two distinct DNA-binding
domains . In eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin. Proteins also
bind to specific DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies . The selective binding of
molecular bridges to active and inactive regions of chromatin has also been
highlighted as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
""".replace('\n', ' ')
import re
cites = r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
edited = re.sub(cites, '', text)
i = 0
while i < len(edited):
if edited[i] == expected[i]:
print(edited[i], sep='', end='')
else:
print('[', edited[i], ']', sep='', end='')
i+=1
print('')
The regex I'm using is cites, and it looks like this:
r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
The syntax r"..." means "raw", which for our purposes means "leave backslashes alone!" It's what you should (nearly) always use for regexes.
The outer \( and \) match the actual parens in the citation.
The \s* matches zero or more "white-space" characters, which include spaces, tabs, and newlines.
The \d+ matches one or more digits [0..9].
So to start with, there's a regex like:
\( \s* \d+ \s* \)
Which is just "parens around a number, maybe with spaces before or after."
The inner part,
(?:\s*;\s+\d+)*
says "don't capture": (?:...) is a non-capturing group, because we don't care about \1 or getting anything out of the pattern, we just want to delete it.
The \s*; matches optional spaces before a semicolon.
The \s+\d+ matches required spaces before another number - you might have to make those optional spaces if you have something like "(1;3;5)".
The * after the non-capturing group means zero or more occurrences.
Put it all together and you have:
open-paren
optional spaces
number
followed by zero or more of:
optional spaces
semicolon
required spaces
number
optional spaces
close-paren
I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words
I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}
So, this is the script that has very kindly been given to me as a starter:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from __future__ import with_statement # needed for Python 2.5
from itertools import chain
def chunk(s):
"""Split a string on whitespace or hyphens"""
return chain(*(c.split("-") for c in s.split()))
def process(latin, gloss, trans):
chunks = zip(chunk(latin), chunk(gloss))
# now you have to DO SOMETHING with the chunks!
def main():
with open("examples.txt") as inf:
try:
while True:
latin = inf.next().strip()
gloss = inf.next().strip()
trans = inf.next().strip()
process(latin, gloss, trans)
inf.next() # skip blank line
except StopIteration:
# reached end of file
pass
if __name__=="__main__":
main()
However,
I've just spoken to my lecturer, who has let me know that he doesn't want us using the
__ x __
function, as it is "too advanced for the students' needs at this point in the course".
I'm absolutely stumped as to what I need to put into the "chunks" or "process" fields, up until now I've been able to figure most of the other exercises out (with a few hints) but this one is just way beyond me. This particular part is worth 15 points out of 20, and it's making me feel just a little bit sick!
Any further help would be greatly appreciated.
Original post (sorry it's so long!)
I'm trying to do the following: I have a text with a language other than english, broken up into morphemes (parts of each word) using hyphens, with the English gloss (linguistic translation of each morpheme) and a direct translation below. eg.
Itali-am fat-o profug-us Lavini-a-que ven-it
Italy-Fem:Sg:Acc fate-Neut:Sg:Abl fleeing-Masc:Sg:Nom Lavinian-Neut:Pl:Acc come:Perf-3-Sg:Indic:Act
'in flight [driven] by fate came to Italy and the Lavinian [shores]'
I'll have several texts such as the above in one file - i.e.
blank line
a line of latin broken up with hyphens
a line of gloss broken up with corresponding hyphens, using colons to join elements
a line of translation
blank line
latin
gloss
translation
ad infinitum.
What I need to do is write a file that gives me the following output:
Itali: 1 Italy
am: 1 Fem:Sg:Acc
fat: 1 fate
o: 1 Neut:Sg:Abl
profug: 1 fleeing
us: 1 Masc:Sg:Nom
Lavini: 1 Lavinian
a: 1 Neug:Pl:Acc
que: 1 come:Perf
ven: 1 3
it: 1 Sg:Indic:Act
where the first column represents the first line of text without hyphens; the second column indicates the number of occurrences (it's only 1 each in this example), and the third column is the English translation of the first column, as written in the text.
If there's a latin morpheme with no corresponding English gloss/translation, the Latin column will be as normal but the English column will print [unknown], like:
a: 1 [unknown]
And if the opposite, i.e. an English morpheme with no corresponding Latin, it should print
[unknown]: 1 kitten
Finally, the prog needs to be able to deal with homophonous morphemes (i.e. two identically spelled latin morphemes with different meanings). e.g.
a: 16 Neuter:Plural
a: 28 Feminine:Singular
Whenever you need to count occurrences, you need a dictonary.
Create a dictionary where the key is the tuple generated by zip, and the value is a list that has: [latin, amount, translation]. Each time you encounter the same tuple you increment the amount.
The dictionary has to outlive the function so you probably want to add it as a parameter.
Once you are done, you can do: result = dict.keys(); result.sort().
I'm not sure I understand the part of the unknowns. If this does not solve that part, you might need to show a relevant example.