How to use regex to extract citation/reference context?

How to use regex to extract citation/reference context? - python

I'm working with quotations/citations/references. In particular, given a text, I would like to extract citations and references, and a context for each of them. In my project, context is defined as the string of at most 10 characters to the left or right of a quote/citation/reference.
This is my code:
# some toy text
text = 'Once upon a time a cat says «gross!». A long story you can check here (ref. 11). People witnessed the scene [...]'
quoting_pattern = '\([^\(]*\)|„[^„]*"|<<.*>>|«[^«]*»|“[^“]*”|‹[^‹]*›|"[^"]*"|›[^›]*‹|»[^»]*«'
context_pattern = ".{0,100}(?:" + quoting_pattern + ").{0,100}"
# get all quotations
quotations = re.findall(r'{}'.format(quoting_pattern), text, re.DOTALL)
# get all contexts
contexts = re.findall(r'{}'.format(context_pattern), text, re.DOTALL)
for i, q in enumerate(quotations):
print(q, contexts[i])
My expected result is this one:
"«gross!»", " cat says «gross!». A long s"
"(ref. 11)", "heck here (ref. 11)"
However, I got an error: IndexError: list index out of range.
Even if "«gross!»" and "(ref. 11)" are extracted in the 'quotations' variable and I'm able to extract the context for "«gross!»", I can't find any context for "(ref. 11)".
Why does this happen? How can I solve this issue?
Thanks in advance

Try re.finditer. Match objects have .start() and .end() methods you can use to get context:
import re
text = "Once upon a time a cat says «gross!». A long story you can check here (ref. 11). People witnessed the scene [...]"
pat = re.compile(r"«[^»]*»|\"[^\"]*\"|\([^)]*\)")
for m in pat.finditer(text):
ctx = text[max(m.start() - 10, 0) : min(m.end() + 10, len(text))]
print(m.group(0), ctx)
Prints:
«gross!» cat says «gross!». A long s
(ref. 11) heck here (ref. 11). People w

Related

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!

Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".

The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:

\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)

Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Creating a regular expression for parsing IUPAC organic compound names

I am trying to create a parser in my free time that could parse out all the functional groups from the name of a organic compound. Side by side, I am also trying to make a display program which can read data from files to draw a visual representation of the compound on screen. Both are being done in python.
Right now the displayer is using a coordinate system to store the positions of atoms, which is why I am making the parser.
Here's the code so far:
import re
main_pattern = r"(.*)(meth|eth|prop|but|pent|hex|hept|oct|non|dec|isodec|dodec)-?([,?\d+,?]*)?-?(di|tri|tetra|penta)?(ane|ene|yne)(.*)"
prefix_patterns = [r"(?<!\d-\()(?<!-\()-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(methyl|ethyl|propyl|butyl|pentyl|hexyl|heptyl|octyl|nonyl|decyl)(?!\))",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(bromo|chloro|iodo|flouro)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(cyano)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(oxo|keto)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(alkoxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(hydroxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(formyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(carboxy)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(alkoxycabonyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(halocarbonyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(amino)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(carbamoyl)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(nitro)",
r"-?([,?\d+,?]*)?-(di|tri|tetra|penta)?(suplho)"]
branch_pattern = r"-?(\d+,?)*?-\((.*?)\)"
compound_name = r"1-methyl-2-pentyl-3,64,7-trihexyl-5-oxo-12,6,7-triketo-23-(siugvuis)-68-(asdlkhdrjnkln)-42-(3,4-dimethylpentyl)pent-5,2,7-triyne"
prefixes = list(prefix_patterns)
print compound_name
print '\n\n'
main=re.findall(main_pattern,compound_name)
print main
print '\n\n'
for x in prefix_patterns:
prefixes = re.findall(x,main[0][0])
print prefixes
branches = re.findall(branch_pattern,main[0][0])
print branches
In the example when the re matches the prefixes methyl in "1-methyl" , it also matches methyl from
-42-(3,4-dimethylpentyl). I looked up on negative lookahead/lookbehind. but couldn't get satisfying results.
Could you kindly point out the problem, and guide me to the answer?

Output surrounded by [' and ]' - How to stop?

I am pulling information down from an rss feed. Due to further analysis,, I don't particularly want to use the likes of beautiful soup or feedparser. The explanation is kind of out of scope for this question.
The output is generating the text covered in [' and ']. For example
Title:
['The Morning Download: Apple Stumbles but Mobile Soars']
Published:
['Tue, 28 Jan 2014 13:09:04 GMT']
Why is this output like this? How do I stop this?
try:
#This is the RSS Feed that is being scraped
page = 'http://finance.yahoo.com/rss/headline?s=aapl'
yahooFeed = opener.open(page).read()
try:
items = re.findall(r'<item>(.*?)</item>', yahooFeed)
for item in items:
# Prints the title
title = re.findall(r'<title>(.*?)</title>', item)
print "Title:"
print title
# Prints the Date / Time Published
print "Published:"
datetime = re.findall(r'<pubDate>(.*?)</pubDate>', item)
print datetime
print "\n"
except Exception, e:
print str(e)
I am grateful of any criticism, advise and best practice information.
I'm a Java / Perl programmer so still getting used to Python, so any great resources you know of, are greatly appreciated.

Use re.search instead of re.findall, re.findall always returns a list of all matches.
datetime = re.search(r'<pubDate>(.*?)</pubDate>', item).group(1)
Note that the difference between re.findall and re.search is that the former returns a list(Python's array data-structure) of all matches, while re.search will only return the first match found.
In case of a no match re.search returns None, so to handle that as well:
match = re.search(r'<pubDate>(.*?)</pubDate>', item)
if match is not None:
datetime = match.group(1)

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

How to save a regular expression user input value (Python)

I am making a simple chat bot in Python. It has a text file with regular expressions which help to generate the output. The user input and the bot output are separated by a | character.
my name is (?P<'name'>\w*) | Hi {'name'}!
This works fine for single sets of input and output responses, however I would like the bot to be able to store the regex values the user inputs and then use them again (i.e. give the bot a 'memory'). For example, I would like to have the bot store the value input for 'name', so that I can have this in the rules:
my name is (?P<'word'>\w*) | You said your name is {'name'} already!
my name is (?P<'name'>\w*) | Hi {'name'}!
Having no value for 'name' yet, the bot will first output 'Hi steve', and once the bot does have this value, the 'word' rule will apply. I'm not sure if this is easily feasible given the way I have structured my program. I have made it so that the text file is made into a dictionary with the key and value separated by the | character, when the user inputs some text, the program compares whether the user input matches the input stored in the dictionary, and prints out the corresponding bot response (there is also an 'else' case if no match is found).
I must need something to happen at the comparing part of the process so that the user's regular expression text is saved and then substituted back into the dictionary somehow. All of my regular expressions have different names associated with them (there are no two instances of 'word', for example...there is 'word', 'word2', etc), I did this as I thought it would make this part of the process easier. I may have structured the thing completely wrong to do this task though.
Edit: code
import re
io = {}
with open("rules.txt") as brain:
for line in brain:
key, value = line.split('|')
io[key] = value
string = str(raw_input('> ')).lower()+' word'
x = 1
while x == 1:
for regex, output in io.items():
match = re.match(regex, string)
if match:
print(output.format(**match.groupdict()))
string = str(raw_input('> ')).lower()+' word'
else:
print ' Sorry?'
string = str(raw_input('> ')).lower()+' word'

I had some difficulty to understand the principle of your algorithm because I'm not used to employ the named groups.
The following code is the way I would solve your problem, I hope it will give you some ideas.
I think that having only one dictionary isn't a good principle, it increases the complexity of reasoning and of the algorithm. So I based the code on two dictionaries: direg and memory
Theses two dictionaries have keys that are indexes of groups, not all the indexes, only some particular ones, the indexes of the groups being the last in each individual patterns.
Because, for the fun, I decided that the regexes must be able to have several groups.
What I call individual patterns in my code are the following strings:
"[mM]y name [Ii][sS] (\w*)"
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
You see that the second individual pattern has 2 capturing groups: consequently there are 3 individual patterns, but a total of 4 groups in all the individual groups.
So the creation of the dictionaries needs some additional care to take account of the fact that the index of the last matching group ( which I use with help of the attribute of name lastindex of a regex MatchObject ) may not correspond to the numbering of individual regexes present in the regex pattern: it's harder to explain than to understand. That's the reason why I count in the function distr() the occurences of strings {0} {1} {2} {3} {4} etc whose number MUST be the same as the number of groups defined in the corresponding individual pattern.
I found the suggestion of Laurence D'Oliveiro to use '||' instead of '|' as separator interesting.
My code simulates a session in which several inputs are done:
import re
regi = ("[mM]y name [Ii][sS] (\w*)"
"||Hi {0}!"
"||You said that your name was {0} !!!",
"[Ii]n repertory (\w*) I [wW][aA][nN][tT] file (\w*)"
"||OK here's your file {0}\\{1} :"
"||I already gave you the file {0}\\{1} !",
"[Ii] [wW][aA][nN][tT] to ([ \w]*)"
"||OK, I will do {0}"
"||You already did {0}. Do yo really want again ?")
direg = {}
memory = {}
def distr(regi,cnt = 0,di = direg,mem = memory,
regnb = re.compile('{\d+}')):
for i,el in enumerate(regi,start=1):
sp = el.split('||')
cnt += len(regnb.findall(sp[1]))
di[cnt] = sp[1]
mem[cnt] = sp[2]
yield sp[0]
regx = re.compile('|'.join(distr(regi)))
print 'direg :\n',direg
print
print 'memory :\n',memory
for inp in ('I say that my name is Armano the 1st',
'In repertory ONE I want file SPACE',
'I want to record music',
'In repertory ONE I want file SPACE',
'I say that my name is Armstrong',
'But my name IS Armstrong now !!!',
'In repertory TWO I want file EARTH',
'Now my name is Helena'):
print '\ninput ==',inp
mat = regx.search(inp)
if direg[mat.lastindex]:
print 'output ==',direg[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
direg[mat.lastindex] = None
memory[mat.lastindex] = memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
else:
print 'output ==',memory[mat.lastindex]\
.format(*(d for d in mat.groups() if d))
if not memory[mat.lastindex].startswith('Sorry'):
memory[mat.lastindex] = 'Sorry, ' \
+ memory[mat.lastindex][0].lower()\
+ memory[mat.lastindex][1:]
result
direg :
{1: 'Hi {0}!', 3: "OK here's your file {0}\\{1} :", 4: 'OK, I will do {0}'}
memory :
{1: 'You said that your name was {0} !!!', 3: 'I already gave you the file {0}\\{1} !', 4: 'You already did {0}. Do yo really want again ?'}
input == I say that my name is Armano the 1st
output == Hi Armano!
input == In repertory ONE I want file SPACE
output == OK here's your file ONE\SPACE :
input == I want to record music
output == OK, I will do record music
input == In repertory ONE I want file SPACE
output == I already gave you the file ONE\SPACE !
input == I say that my name is Armstrong
output == You said that your name was Armano !!!
input == But my name IS Armstrong now !!!
output == Sorry, you said that your name was Armano !!!
input == In repertory TWO I want file EARTH
output == Sorry, i already gave you the file ONE\SPACE !
input == Now my name is Helena
output == Sorry, you said that your name was Armano !!!

OK, let me see if I understand this:
You want to a dictionary of key-value pairs. This will be the “memory” of the chatbot.
You want to apply regular-expression rules to user input. But which rules might apply is conditional on which keys are already present in the memory dictionary: if “name” is not yet defined, then the rule that defines “name” applies; but if it is, then the rule that mentions “word” applies.
Seems to me you need more information attached to your rules. For example, the “word” rule you gave above shouldn’t actually add “word” to the dictionary, otherwise it would only apply once (imagine if the user keeps trying to say “my name is x” more than twice).
Does that give you a bit more idea about how to proceed?
Oh, by the way, I think “|” is a poor choice for a separator character, because it can occur in regular expressions. Not sure what to suggest: how about “||”?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.