Match longest substring in Python - python

Consider I have following string with a tab in between left & right part in a text file:
The dreams of REM (Geo) sleep The sleep paralysis
I want to match the above string that match both left part & right part in each line of another following file:
The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
If can not match with fill string, then try to match with substring.
I want to search with leftmost and rightmost pattern.
eg.(leftmost cases)
The dreams of REM sleep paralysis
The dreams of REM sleep The sleep
eg.(Right most cases):
REM sleep The sleep paralysis
The dreams of The sleep paralysis
Thanks a lot again for any kind of help.

(Ok, you clarified most of what you want. Let me restate, then clarify the points I listed below as remaining unclear... Also take the starter code I show you, adapt it, post us the result.)
You want to search, line-by-line, case-insensitive, for the longest contiguous matches to each of a pair of match-patterns. All the patterns seem to be disjoint (impossible to get a match on both patternX and patternY, since they use different phrases, e.g. can't match both 'frontal lobe' and 'prefrontal cortex').
Your patterns are supplied as a sequence of pairs ('dom','rang'), => let's just refer to them by their subscript [0] and [1, you can use string.split('\t') to get that.)
The important thing is a matching line must match both the dom and rang patterns (fully or partially).
Order is independent, so we can match rang then dom, or vice versa => use 2 separate regexes per line, and test d and r matched.
Patterns have optional parts, in parentheses => so just write/convert them to regex syntax using (optionaltext)? syntax already, e.g.: re.compile('Frontallobes of (leftside)? the brain', re.IGNORECASE)
The return value should be the string buffer with the longest substring match so far.
Now this is where several things remain to be clarified - please edit your question to explain the following:
If you find full matches to any pair of patterns, then return that.
If you can't find any full matches, then search for partial matches of both of the pair of patterns. Where 'partial match' means 'the most words' or 'the highest proportion(%) of words' from a pattern? Presumably we exclude spurious matches to words like 'the', in which case we lose nothing by simply omitting 'the' from your search patterns, then this guarantees that all partial matches to any pattern are significant.
We score the partial matches (somehow), e.g. 'contains most words from pattern X', or 'contains highest % of words from pattern X'. We should do this for all patterns, then return the pattern with the highest score. You'll need to think about this a little, is it better to match 2 words of a 5-word pattern (40%) e.g. 'dreams of', or 1 of 2 (50%) e.g. 'prefrontal BUT NOT cortex'? How do we break ties, etc? What happens if we match 'sleep' but nothing else?
Each of the above questions will affect the solution, so you need to answer them for us. There's no point in writing pages of code to solve the most general case when you only needed something simple.
In general this is called 'NLP' (natural language processing). You might end up using an NLP library.
The general structure of the code so far is sounding like:
import re
# normally, read your input directly from file, but this allows us to test:
input = """The pons also contains the sleep paralysis center of the brain as well as generating the dreams of REM sleep.
The optic tract is a part of the visual system in the brain.
The inferior frontal gyrus is a gyrus of the frontal lobe of the human brain.
The prefrontal cortex (PFC) is the anterior part of the frontallobes of the brain, lying in front of the motor and premotor areas.
There are three possible ways to define the prefrontal cortex as the granular frontal cortex as that part of the frontal cortex whose electrical stimulation does not evoke movements.
This allowed the establishment of homologies despite the lack of a granular frontal cortex in nonprimates.
Modern tracing studies have shown that projections of the mediodorsal nucleus of the thalamus are not restricted to the granular frontal cortex in primates.
""".split('\n')
patterns = [
('(dreams of REM (Geo)? sleep)', '(sleep paralysis)'),
('(frontal lobe)', '(inferior frontal gyrus)'),
('(prefrontal cortex)', '(frontallobes of (leftside )?(the )?brain)'),
('(modern tract)', '(probably mediodorsal nucleus)') ]
# Compile the patterns as regexes
patterns = [ (re.compile(dstr),re.compile(rstr)) for (dstr,rstr) in patterns ]
def longest(t):
"""Get the longest from a tuple of strings."""
l = list(t) # tuples can't be sorted (immutable), so convert to list...
l.sort(key=len,reverse=True)
return l[0]
def custommatch(line):
for (d,r) in patterns:
# If got full match to both (d,r), return it immediately...
(dm,rm) = (d.findall(line), r.findall(line))
# Slight design problem: we get tuples like: [('frontallobes of the brain', '', 'the ')]
#... so return the longest match strings for each of dm,rm
if dm and rm: # must match both dom & rang
return [longest(dm), longest(rm)]
# else score any partial matches to (d,r) - how exactly?
# TBD...
else:
# We got here because we only have partial matches (or none)
# TBD: return the 'highest-scoring' partial match
return ('TBD... partial match')
for line in input:
print custommatch(line)
and running on the 7 lines of input you supplied currently gives:
TBD... partial match
TBD... partial match
['frontal lobe', 'inferior frontal gyrus']
['prefrontal cortex', ('frontallobes of the brain', '', 'the ')]
TBD... partial match
TBD... partial match
TBD... partial match
TBD... partial match

Related

Using regex to match adjacent words using first letter of abbreviations in text python

I have a list of abbreviations that I am trying to find in my text using regex. However I am struggling to find adjacent words by matching letters and have only achieved this with word matching. Here is my text
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
My list is as such: [USNS, UGF, NFZ, AR]
I would like to find the corresponding long forms in the text using the first letter of each abbreviation. It would also need to be non-case sensitive. My attempt so far has been as such:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
which returns United States Navy Seals however when I try and just use the first letter:
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
It then returns nothing. Furthermore some of the abbreviations contain more than the initial of a word in the text such as UGF - underground facility.
My actual goal is to eventually replace all abbreviations in the text (USNS, UGF, NFZ, AR) with their corresponding long forms (United States Navy Seals, underground facility, no-fly-zone, assault-rifle).
In your last regex [1]
re.search(r'\bU\W+?S\b\W+?N\b\W+?S\b', text)
you get no match because you made several mistakes:
\w+ means one or more word characters, \W+ is for one or more non-word characters.
the \b boundary anchor is sometimes in the wrong place (i.e. between the initial letter and the rest of the word)
re.search(r'\bU\w+\sS\w+?\sN\w+?\sS\w+', text)
should match.
And, well,
print(re.search(r'\bu\w+?g\w+\sf\w+', text))
matches of course underground facility but in a long text, there will be much more irrelevant matches.
Approach to generalization
Finally I built a little "machine" that dynamically creates regular expressions from the known abbreviations:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
abbrs = ['USNS', 'UGF', 'NFZ', 'AR']
for abbr in abbrs:
pattern = ''.join(map(lambda i: '['+i.upper()+i.lower()+'][a-z]+[ a-z-]', abbr))
print(pattern)
print(re.search(pattern, text, flags=re.IGNORECASE))
The output of above script is:
[Uu][a-z]+[ a-z-][Ss][a-z]+[ a-z-][Nn][a-z]+[ a-z-][Ss][a-z]+[ a-z-]
<re.Match object; span=(20, 45), match='United States Navy Seals '>
[Uu][a-z]+[ a-z-][Gg][a-z]+[ a-z-][Ff][a-z]+[ a-z-]
<re.Match object; span=(89, 110), match='underground facility '>
[Nn][a-z]+[ a-z-][Ff][a-z]+[ a-z-][Zz][a-z]+[ a-z-]
<re.Match object; span=(140, 152), match='no-fly-zone '>
[Aa][a-z]+[ a-z-][Rr][a-z]+[ a-z-]
<re.Match object; span=(170, 184), match='assault-rifle '>
Further generalization
If we assume that in a text each abbreviation is introduced after the first occurrence of the corresponding long form, and we further assume that the way it is written definitely starts with a word boundary and definitely ends with a word boundary (no assumptions about capitalization and the use of hyphens), we can try to extract a glossary automatically like this:
import re
text = '''They posted out the United States Navy Seals (USNS) to the area.
Entrance was through an underground facility (UGF) as they has to bypass a no-fly-zone (NFZ).
I found an assault-rifle (AR) in the armoury.'''
# build a regex for an initial
def init_re(i):
return f'[{i.upper()+i.lower()}][a-z]+[ -]??'
# build a regex for an abbreviation
def abbr_re(abbr):
return r'\b'+''.join([init_re(i) for i in abbr])+r'\b'
# build an inverse glossary from a text
def inverse_glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
igloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
igloss[longform] = abbr
return igloss
igloss = inverse_glossary(text)
for long in igloss:
print('{} -> {}'.format(long, igloss[long]))
The output is
no-fly-zone -> NFZ
United States Navy Seals -> USNS
assault-rifle -> AR
underground facility -> UGF
By using an inverse glossary you may easily replace all long forms into their corresponding abbreviation. A bit harder is it to do for all but the first occurrence. There is much space for refinement, for example to correctly handle line breaks within long forms (also to use re.compile).
As to replace the abbreviations with the long forms, you have to build a normal glossary instead of an inverse one:
# build a glossary from a text
def glossary(text):
abbreviations = set(re.findall('\([A-Z]+\)', text))
gloss = dict()
for pabbr in abbreviations:
abbr = pabbr[1:-1]
pattern = '('+abbr_re(abbr)+') '+r'\('+abbr+r'\)'
m = re.search(pattern, text)
if m:
longform = m.group(1)
gloss[abbr] = longform
return gloss
gloss = glossary(text)
for abbr in gloss:
print('{}: {}'.format(abbr, gloss[abbr]))
The output here is
AR: assault-rifle
NFZ: no-fly-zone
UGF: underground facility
USNS: United States Navy Seals
The replacement itself is left to the reader.
[1]
Let's take a closer look at your first regex again:
re.search(r'\bUnited\W+?States\b\W+?Navy\b\W+?Seals\b', text)
The boundary anchors (\b) are redundant. They can be removed without changing anything in the result because \W+? means at least one non-word character after the last character of States and Navy. They cause no problems here but I guess that they led to the confusion when you started by modifying from it to get a more general one.
You could use the below regex which would take care of the case sensitivity as well. Click here.
This would just find United States Navy Seals.
\s[u|U].*?[s|S].*?[n|N].*?[s|S]\w+
Similarly, for UF,
You can use - \s[u|U].*?[g|G].*?[f|F]\w+
Please find a pattern above. The characters are just joined with .*? and each character is used as [a|A] which would match either lower case or upper case. The start would be \s since it should be a word and the end would \w+.
Play around.

How can I get the full match on Python re module without including keyword simultaneously?

In the following example:
"noun 1 left and right sides 左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum. 2 [after a numeral] about; or so 八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan. 3 those in close attendance; retinue 屏退左右 Píng tuì zuǒyòu order one's attendants to clear out verb master; control; influence 左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others. adverb dialect anyway; anyhow; in any case 左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you."
I would like to get the string separated based on the noun, adjective, adverb, etc... and also based on the number, if they have multiple.
So the final result should be:
noun
["left and right sides", "左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum."]
["[after a numeral] about; or so", "八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan."]
["those in close attendance; retinue", "屏退左右 Píng tuì zuǒyòu order one's attendants to clear out"]
verb
["master; control; influence", "左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others."]
adverb
["dialect anyway; anyhow; in any case", "左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you"]
The noun, verb, and adverb should be keys, while the value might be a dict. Since noun has three objects here, it should have three distinctive results.
So the first step is take the component from noun, adjective adverb, verb, etc... and store it to some variables. But in this case, I fail to get the relevant result based on the specific string. For example:
re.findall("(noun|verb|adverb|adjective)", s)
This returns ['noun', 'verb', 'adverb'] as it only focuses on the exact match.
So I added .+ to make it re.findall("(noun|verb|adverb|adjective).+", s) and get any word after noun, but then it caught all the strings after noun, including any strings after verb or adverb (and returns ['noun']).
So I hit the wall. Is it possible to get the relevant part but also get the full result except the keyword match?
This is not a job for a regular expression. What you are trying to match is too variable.
Write a proper grammar for the dictionary entry, as if it were a programming language, and then parse your data according to that grammar.
Like this:
Your language keywords are noun, verb, adverb.
Each introduces one unnumbered or several numbered definitions.
Numbering of numbered definitions increases monotonically, so other
numbers appearing inside a definition should be treated as part of the definition and not start a new one.
As a sometime lexicographer I would also recommend that you should treat labels like dialect (which are generally drawn from a standard vocabulary) as optional keywords rather than as part of the definition.
You may use
(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))
See the regex demo
Details
(?s) - an inline re.DOTALL equivalent
(noun|verb|adverb|adjective) - Group 1: a word noun, verb, adverb or adjective
(.*?) - Group 2: any 0+ chars as few as possible, up to (but excluding) the first occurrence of:
(?=(?:noun|verb|adverb|adjective|$)) - either noun, verb, adverb, adjective or end of string (as it is a positive lookahead, (?=...), the texts matched do not become part of a match).
In Python, use with re.findall:
re.findall(r'(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))', s)
Probably the easiest thing will be to re.split the string on the part-of-speech pattern first: re.split('(noun|adjective|verb|adverb)', s). For the provided input, this include an empty item at the start, and then the rest will alternate between part-of-speech labels and the bits in between, which you can then process further.

fixing words with spaces using a dictionary look up in python?

I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.
Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.
To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.
This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

Categories