Regular expression for extracting fields from wiki template markup - python

I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called "nominee1" and "nominee2". Toy example:
In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"
Using the election article above as an example, I would like to extract the information immediately following the "nomineeN" field but that exists before the invocation of the next field (demarcated by a pip "|"). Thus, given the example above, I would ideally like to extract "Barack Obama" and "Mitt Romney" -- or at least the syntax in which they're embedded ('''[[Barack Obama]]''' and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:
nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)
My thinking is that it should find strings like "|nominee1=" and "|nominee2=" with some whitespace possible between "|", "nominee", "=" and then return the content following it like "Barack Obama" and "Mitt Romney".

Use mwparserfromhell! It condenses your code and is more reassuring for capturing the result. For usage with this example:
import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
if template.name == 'Infobox election':
nominee1 = template.get('nominee1').value
nominee2 = template.get('nominee2').value
print nominee1
print nominee2
Very simple thing to do to capture the result.

Lookbehinds aren't necessary here—it's much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can't work here with Python's regular expression engine, since the optional spaces make the expression variable-width.)
Try this regex:
\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?
Results:
re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']

For infobox data like this, it's best to use DBpedia. They've done all the extraction work for you :)
http://wiki.dbpedia.org/Downloads38
See the "Ontology Infobox Properties " file. You don't have to be an ontologies expert here. Just use simple tsv parser to find the info you need!

First of all, you're missing a space after nominee\d. You probably want nominee\d\s*\=. In addition, you really don't want to be parsing markup with regex. Try using one of the suggestions here instead.
If you must do it with regex, why not a slightly more readable multi line solution?
import re
markup_string = """{{
| nominee1 = '''[[Barack Obama]]'''
| party1 = Democratic Party (United States)
| home_state1 = [[Illinois]]
| running_mate1 = '''[[Joe Biden]]'''
| nominee2 = [[Mitt Romney]]
| party2 = Republican Party (United States)
| home_state2 = [[Massachusetts]]
| running_mate2 = [[Paul Ryan]]<br>
}}"""
for match in re.finditer(r'(nominee\d\s*\=)[^|]*', markup_string, re.S):
end_nominee, end_line = match.end(1), match.end(0)
print end_nominee, end_line
print markup_string[end_nominee:end_line]

Related

Search strings with python

I would like to know how can I search specific strings with python. Actually I opened a markdown file which contain a sheet like below:
| --------- | -------- | --------- |
|**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. |
|**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|
|**unscrewed**| - | -Slowly and very carefully he unscrewed the ink bottle, dipped his quill into it, and began to write,|
|**downtrodden**| - | -For years, Aunt Petunia and Uncle Vernon had hoped that if they kept Harry as downtrodden as possible, they would be able to squash the magic out of him.|
|**sheets,**| - | -As long as he didn’t leave spots of ink on the sheets, the Dursleys need never know that he was studying magic by night.|
|**flinch**| - | -But he hoped she’d be back soon — she was the only living creature in this house who didn’t flinch at the sight of him.|
And I have to get the strings from each lines which decorates with |** **|, like:
propped
Pointless
unscrewed
downtrodden
sheets
flinch
I tried to use the regular expression but failed to extract it.
import re
y = '(?<=\|\*{2}).+?(?=,{0,1}\*{2}\|)'
reg = re.compile(y)
a = '| --------- | -------- | --------- | |**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. | |**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|'
reg.findall(a)
Regex(y) above explained:
(?<=\|\*{2}) - Matches if the current position in the string is preceded by a match for \|\*{2} i.e. |**
.+? - Will try to find anything(except for new line) repeated 1 or more times. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
(?=,{0,1}\*{2}\|) - ?= matches any string preceding the regex mentioned. In this case I have mentioned ,{0,1}\*{2}\|, which means zero or one , and 2 * and ending |.
Try using the following regex :
(?<=\|)(?!\s).*?(?!\s)(?=\|)
see demo / explanation
If the asterisks are in the text you are searching and you do not want the comma after sheets. The pattern would be a pipe followed by two asterisks then anything that follows that is not an asterisk or a comma.
\|\*{2}([^*,]+)
If you can live with the comma or if there might be commas you want to catch
\|\*{2}([^*]+)
Use either pattern with re.findall or re.finditer to capture the text you want.
If using the second pattern, you would need to run through the groups and strip any unwanted commas.
I have wrote below program to achieve the required output. I created a file string_test where all raw strings I copied:
a=re.compile("^\|\*\*([^*,]+)")
with open("string_test","r") as file1:
for i in file1.readlines():
match=a.search(i)
if match:
print match.group(1)

How to extract text before a specific keyword in python?

import re
col4="""May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
b=re.findall(r'\sCiteSeerX',col4)
print b
I have to print "May god bless our families studied". I'm using pythton regular expressions to extract the file name but i'm only getting CiteSeerX as output.I'm doing this on a very large dataset so i only want to use regular expression if there is any other efficient and faster way please point out.
Also I want the last year 2004 as a output.
I'm new to regular expressions and I now that my above implementation is wrong but I can't find a correct one. This is a very naive question. I'm sorry and Thank you in advance.
Here is an answer that doesn't use regex.
>>> s = "now is the time for all good men"
>>> s.find("all")
20
>>> s[:20]
'now is the time for '
>>>
If the structure of all your data is similar to the sample you provided, this should get you going:
import re
data = re.findall("(.*?) CiteSeerX.*(\d{4})$", col4)
if data:
# we have a match extract the first capturing group
title, year = data[0]
print(title, year)
else:
print("Unable to parse the string")
# Output: May god bless our families studied. 2004
This snippet extracts everything before CiteSeerX as the title and the last 4 digits as the year (again, assuming that the structure is similar for all the data you have). The brackets mark the capturing groups for the parts that we are interested in.
Update:
For the case, where there is metadata following the year of publishing, use the following regular expression:
import re
YEAR = "\d{4}"
DATE = "\d\d\d\d-\d\d-\d\d"
def parse_citation(s):
regex = "(.*?) CiteSeerX\s+{date} {date} ({year}).*$".format(date=DATE, year=YEAR)
data = re.findall(regex, s)
if data:
# we have a match extract the first group
return data[0]
else:
return None
c1 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004"""
c2 = """May god bless our families studied. CiteSeerX 2009-05-24 2007-11-19 2004 application/pdf text http //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.1483 http //www.biomedcentral.com/content/pdf/1471-2350-5-20.pdf en Metadata may be used without restrictions as long as the oai identifier remains attached to it."""
print(parse_citation(c1))
print(parse_citation(c2))
# Output:
# ('May god bless our families studied.', '2004')
# ('May god bless our families studied.', '2004')

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

How can I substitute an expression {{ text }} with re.sub() when 'text' may include further {{ text }} blocks?

I'm trying to parse raw wikipedia article content, e.g. the article on Sweden, using re.sub(). However, I am running into problems trying to substitute blocks of {{some text}}, because they can contain further blocks of {{some text}}.
Here's an abbreviated example from the above article:
{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}
The curly braces within curly braces recursion could theoretically be arbitrarily nested to any number of levels.
If I match the greedy block of {{.+}}, everything is matched from {{Infobox to eo}}, including the text I do not want matched.
If I match the ungreedy block of {{.+}}, the part from {{Infobox to icon=no}} is matched, as is {{Link GA|eo}}. But then I'm left with the string | common_name [...] not want parsed.
I also tried variants of \{\{.+(\{\{.+\}\})*.+\}\} and \{\{[^\{]+(\{\{[^\{]+\}\})*[^\{]+\}\}, in the hopes of matching only sub-blocks within the larger block, but to no avail.
I'd list all of what I've tried, but I honestly can't remember half and I doubt it'd be of much use anyway. It always comes back to the same problem: that for the double curly end braces }} to match, there needs to have been the same number of {{ occurrences beforehand.
Is this even solvable using regular expressions, or do I need another solution?
Have you considered mwparserfromhell?
import mwparserfromhell
s = """{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}"""
wikicode = mwparserfromhell.parse(s)
print wikicode.filter_templates()[0]
Prints:
{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}

My regex is not working properly

My regex is not working properly. I'm showing you before regex text and after regex text. I'm using this regex re.search(r'(?ms).*?{{(Infobox film.*?)}}', text). You will see my regex not displaying the result after | country = Assam, {{IND . My regex stuck at this point. Will you please help me ? thanks
Before regex:
{{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
}}
After regex:
{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND
Why regex stuck at this point? country = Assam, {{IND
Edit : Expecting Result
Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
Your regex is catching everything between the first {{ and the first }}, which is in the "country" entry of the infobox. If you want everything between the first {{ and the last }}, then you want to make the .* inside the braces greedy by removing the ?:
re.search(r'(?ms).*?{{(Infobox film.*)}}', text)
Note that this will find the last }} in the input (eg. if there's another template far below the end of the infobox, it will find the end of that), so this may not be what you want. When you have nesting things like this, regex is not always the best way to search.

Categories