Related
For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules
Let's take as example the word "wider"
As it is an adjective the rule for lemmatization should be take from this list:
ADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
As I understand the process will be like this:
1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.
Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?
Here it can be tested.
Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
Class
It starts off with initializing 3 variables:
class Lemmatizer(object):
#classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer
Why don't Spacy just read a file?
Most probably because declaring the string in-code is faster that streaming strings through I/O.
Where does these index, exceptions and rules come from?
Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html
Rules
Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749
And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html
Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:
PUNCT_RULES = [
["“", "\""],
["”", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
Exceptions
As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.
It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:
alvas#ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas#ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
Index
If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:
alvas#ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.
We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
The main action comes from the function rather than the Lemmatizer class:
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
# forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
Why is the lemmatize method outside of the Lemmatizer class?
That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that #staticmethod and #classmethod exist perhaps there are other considerations as to why the function and class has been decoupled
Morphy vs Spacy
Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:
Check the exception lists
Apply rules once to the input to get y1, y2, y3, etc.
Return all that are in the database (and check the original too)
If there are no matches, keep applying rules until we find a match
Return an empty list if we can't find anything
For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76
But the general process seems to be:
Look for the exceptions, get them if the lemma from the exception list if the word is in it.
Apply the rules
Save the ones that are in the index lists
If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
Return the lemma forms
In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
Checking for infinitive before lemmatization
Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).
But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.
Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:
person
number
gender
Updated
SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.
So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py
Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92
But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)
Epilogue
I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"
But before even answering the question before, "What exactly is a lemma?" might the better question to ask.
TLDR: spaCy checks whether the lemma it's trying to generate is in the known list of words or exceptions for that part of speech.
Long Answer:
Check out the lemmatizer.py file, specifically the lemmatize function at the bottom.
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
For English adjectives, for instance, it takes in the string we're evaluating, the index of known adjectives, the exceptions, and the rules, as you've referenced, from this directory (for English model).
The first thing we do in lemmatize after making the string lower case is check whether the string is in our list of known exceptions, which includes lemma rules for words like "worse" -> "bad".
Then we go through our rules and apply each one to the string if it is applicable. For the word wider, we would apply the following rules:
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
and we would output the following forms: ["wid", "wide"].
Then, we check if this form is in our index of known adjectives. If it is, we append it to the forms. Otherwise, we add it to oov_forms, which I'm guessing is short for out of vocabulary. wide is in the index, so it gets added. wid gets added to oov_forms.
Lastly, we return a set of either the lemmas found, or any lemmas that matched rules but weren't in our index, or just the word itself.
The word-lemmatize link you posted above works for wider, because wide is in the word index. Try something like He is blandier than I. spaCy will mark blandier (word I made up) as an adjective, but it's not in the index, so it will just return blandier as the lemma.
There is a set of rules and a set of words known for each word type(adjective, noun, verb, adverb). The mapping happens here:
INDEX = {
"adj": ADJECTIVES,
"adv": ADVERBS,
"noun": NOUNS,
"verb": VERBS
}
EXC = {
"adj": ADJECTIVES_IRREG,
"adv": ADVERBS_IRREG,
"noun": NOUNS_IRREG,
"verb": VERBS_IRREG
}
RULES = {
"adj": ADJECTIVE_RULES,
"noun": NOUN_RULES,
"verb": VERB_RULES,
"punct": PUNCT_RULES
}
Then on this line in lemmatizer.py the correct index, rules and exc (excl I believe stands for exceptions e.g. irregular examples) get loaded:
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
self.exc.get(univ_pos, {}),
self.rules.get(univ_pos, []))
All the remaining logic is in the function lemmatize and is surprisingly short. We perform the following operations:
If there is an exception(i.e. the word is irregular) including the provided string, use it and add it to the lemmatized forms
For each rule in the order they are given for the selected word type check if it matches the given word. If it does try to apply it.
2a. If after applying the rule the word is in the list of known words(i.e. index), add it to the lemmatized forms of the word
2b. Otherwise add the word to a separate list called oov_forms(here I believe oov stands for "out of vocabulary")
In case we've found at least one form using the rules above we return the list of forms found, otherwise we return the oov_forms list.
I have the following sentence where I want to get rid of everything with the format '(number)/(... ; number)' :
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure (1; 2 ; 3). Many of these architectural proteins are
molecular bridges that can bind at two or more distinct DNA sites to
form loops. For example, bacterial DNA is looped and compacted by the
histonelike protein H-NS, which has two distinct DNA-binding domains
(4). In eukaryotes, complexes of transcription factors and RNA
polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while
HP1 (9), histone H1 (10), and the polycomb-repressor complex PRC1/2
(11 ; 12) organize inactive chromatin. Proteins also bind to specific
DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies (13; 14; 15; 16;
17 ; 18). The selective binding of molecular bridges to active and
inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions (6; 8
; 19).
I want it to be in the form:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure . Many of these architectural proteins are molecular bridges
that can bind at two or more distinct DNA sites to form loops. For
example, bacterial DNA is looped and compacted by the histonelike
protein H-NS, which has two distinct DNA-binding domains . In
eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin.
Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and
promyeloleukemia bodies . The selective binding of molecular bridges
to active and inactive regions of chromatin has also been highlighted
as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
My attempt was as follows:
import re
x=re.sub(r'\(.+; \d+\)', '', x) # eliminate brackets with multiple numbers
#### NOTE: there are 2 spaces between the last ';' and the last digit
x=re.sub(r'\d+\)', '', x) # eliminate brackets with single number
My output was this:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure .
So clearly my code is missing something. I thought that '(.+)' would identify all brackets containing non-arbitrary characters and then I could further specify that I want all the ones ending in a '; number'.
I just want a flexible way of indexing a sentence at all places with '(number' and 'number)' and eliminate everything in between....
Maybe you can try to use the pattern
re.sub('\([0-9; ]+\)', '', x)
which removes all parenthesis that contein at least a number, a ";" or a space.
I think it's not the case to use the r prefix.
Try the following regex:
r'\s\((\d+\s?;?\s?)+\)'
This regex will match one or more groups of numbers (followed by spaces/semicolons) inside of parenthesis.
There seems to always be a space before the collection of numbers, so matching that should help with the "trailing space".
You can use a pattern like \(\d+(?:;\s?\d+\s?)*\), which matches an initial parentheses and digits ( <number>, and then any possible repeating ; <number>s that ends in ). Test it.
Or if you're feeling brave you can use \([;\d\s]+\) which just matches everything with digits/spaces/semicolons between two parentheses. Test it.
I found a glitch in your expected text, there's 1 space missing after PRC1/2. But this code works, with that space added back in:
text="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure (1; 2 ; 3).
Many of these architectural proteins are molecular bridges that can bind at two
or more distinct DNA sites to form loops. For example, bacterial DNA is looped
and compacted by the histonelike protein H-NS, which has two distinct
DNA-binding domains (4). In eukaryotes, complexes of transcription factors and
RNA polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while HP1 (9),
histone H1 (10), and the polycomb-repressor complex PRC1/2 (11 ; 12) organize
inactive chromatin. Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and promyeloleukemia
bodies (13; 14; 15; 16; 17 ; 18). The selective binding of molecular bridges to
active and inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically associated domains
(TADs)—regions rich in local DNA interactions (6; 8 ; 19).
""".replace('\n', ' ')
expected="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure . Many of
these architectural proteins are molecular bridges that can bind at two or more
distinct DNA sites to form loops. For example, bacterial DNA is looped and
compacted by the histonelike protein H-NS, which has two distinct DNA-binding
domains . In eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin. Proteins also
bind to specific DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies . The selective binding of
molecular bridges to active and inactive regions of chromatin has also been
highlighted as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
""".replace('\n', ' ')
import re
cites = r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
edited = re.sub(cites, '', text)
i = 0
while i < len(edited):
if edited[i] == expected[i]:
print(edited[i], sep='', end='')
else:
print('[', edited[i], ']', sep='', end='')
i+=1
print('')
The regex I'm using is cites, and it looks like this:
r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
The syntax r"..." means "raw", which for our purposes means "leave backslashes alone!" It's what you should (nearly) always use for regexes.
The outer \( and \) match the actual parens in the citation.
The \s* matches zero or more "white-space" characters, which include spaces, tabs, and newlines.
The \d+ matches one or more digits [0..9].
So to start with, there's a regex like:
\( \s* \d+ \s* \)
Which is just "parens around a number, maybe with spaces before or after."
The inner part,
(?:\s*;\s+\d+)*
says "don't capture": (?:...) is a non-capturing group, because we don't care about \1 or getting anything out of the pattern, we just want to delete it.
The \s*; matches optional spaces before a semicolon.
The \s+\d+ matches required spaces before another number - you might have to make those optional spaces if you have something like "(1;3;5)".
The * after the non-capturing group means zero or more occurrences.
Put it all together and you have:
open-paren
optional spaces
number
followed by zero or more of:
optional spaces
semicolon
required spaces
number
optional spaces
close-paren
I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.
So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()
I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)
I have a list of strings and I want to find popular prefixes. The prefixes are special in that they occur as strings in the input list.
I found a similar question here but the answers are geared to find the one most common prefix:
Find *most* common prefix of strings - a better way?
While my problem is similar, it differs in that I need to find all popular prefixes. Or to maybe state it a little simplistically, rank prefixes from most common to least.
As an example, consider the following list of strings:
in, india, indian, indian flag, bull, bully, bullshit
Prefixes rank:
in - 4 times
india - 3 times
bull - 3 times
...and so on. Please note - in, bull, india are all present in the input list.
The following are not valid prefixes:
ind
bu
bul
...since they do not occur in the input list.
What data structure should I be looking at to model my solution? I'm inclined to use a "trie" with a counter on each node that tracks how many times has that node been touched during the creation of the trie.
All suggestions are welcome.
Thanks.
p.s. - I love python and would love if someone could post a quick snippet that could get me started.
words = [ "in", "india", "indian", "indian", "flag", "bull", "bully", "bullshit"]
Result = sorted([ (sum([ w.startswith(prefix) for w in words ]) , prefix ) for prefix in words])[::-1]
it goes through every word as a prefix and checks how many of the other words start with it and then sorts the result. the[::-1] simply reverses that order
If we know the length of the prefix (say 3)
from nltk import FreqDist
suffixDist=FreqDist()
for word in vocabulary:
suffixDist[word[-3:]] +=1
commonSuffix=[suffix for (suffix,count) in suffixDist.most_common(150) ]
print(commonSuffix)
Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}