remove content between tags in python using regex - python

I was trying to clean up wikitext. Specifically I was trying to remove all the {{.....}} and <..>...</..> in the wikitext. For example, for this wikitext:
"{{Infobox UK place\n|country = England\n|official_name =
Morcombelake\n|static_image_name = Morecombelake from Golden Cap -
geograph.org.uk - 1184424.jpg\n|static_image_caption = Morcombelake as
seen from Golden Cap\n|coordinates =
{{coord|50.74361|-2.85153|display=inline,title}}\n|map_type =
Dorset\n|population = \n|population_ref = \n|shire_district = [[West
Dorset]]\n|shire_county = [[Dorset]]\n|region = South West
England\n|constituency_westminster = West Dorset\n|post_town =
\n|postcode_district = \n|postcode_area = DT\n|os_grid_reference =
SY405938\n|website = \n}}\n'''Morcombelake''' (also spelled
'''Morecombelake''') is a small village near [[Bridport]] in
[[Dorset]], [[England]], within the ancient parish of [[Whitchurch
Canonicorum]]. [[Golden Cap]], part of the [[Jurassic Coast]] World
Heritage Site, is nearby.{{cite
web|url=http://www.nationaltrust.org.uk/golden-cap/|title=Golden
Cap|publisher=National Trust|accessdate=2014-05-04}}\n\n==
References ==\n{{reflist}}\n\n{{West
Dorset}}\n\n\n{{Dorset-geo-stub}}\n[[Category:Villages in
Dorset]]\n\n== External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n"
How can I use regular expressions in python to produce output like this:
\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small
village near [[Bridport]] in [[Dorset]], [[England]], within the
ancient parish of [[Whitchurch Canonicorum]]. [[Golden Cap]], part of
the [[Jurassic Coast]] World Heritage Site, is nearby.\n\n==
References ==\n\n\n\n\n\n\n[[Category:Villages in Dorset]]\n\n==
External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n

As the tags are nested into each other, you can find and remove them in a loop:
n = 1
while n > 0:
s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)
s is a string containing the wikitext.
There are no the <...> tags in your example, but they should be removed as well.

Related

How do I combine elements from two loops without issues?

When I execute this code...
from bs4 import BeautifulSoup
with open("games.html", "r") as page:
doc = BeautifulSoup(page, "html.parser")
titles = doc.select("a.title")
prices = doc.select("span.price-inner")
for game_soup in doc.find_all("div", {"class": "game-options-wrapper"}):
game_ids = (game_soup.button.get("data-game-id"))
for title, price_official, price_lowest in zip(titles, prices[::2], prices[1::2]):
print(title.text + ',' + str(price_official.text.replace('$', '').replace('~', '')) + ',' + str(
price_lowest.text.replace('$', '').replace('~', '')))
The output is...
153356
80011
130187
119003
73502
156474
96592
154207
155123
152790
165013
110837
Call of Duty: Modern Warfare II (2022),69.99,77.05
Red Dead Redemption 2,14.85,13.79
God of War,28.12,22.03
ELDEN RING,50.36,48.10
Cyberpunk 2077,29.99,28.63
EA SPORTS FIFA 23,41.99,39.04
Warhammer 40,000: Darktide,39.99,45.86
Marvels Spider-Man Remastered,30.71,27.07
Persona 5 Royal,37.79,43.32
The Callisto Protocol,59.99,69.41
Need for Speed Unbound,69.99,42.29
Days Gone,15.00,9.01
But I'm trying to get the value next to the other ones on the same line
Expected output:
Call of Duty: Modern Warfare II (2022),69.99,77.05,153356
Red Dead Redemption 2,14.85,13.79,80011
...
Even when adding game_ids to print(), it spams the same game id for each line.
How can I go about resolving this issue?
HTML file: https://jsfiddle.net/m3hqy54x/
I feel like all 3 details (title, price_official, price_lowest) are probably all in a shared container. It would be better to loop through these containers and select the details as sets from each container to make sure the wight prices and titles are being paired up, but I can't tell you how to do that without seeing at least a snippet from (or all of) "games.html"....
Anyway, assuming that '110837\nCall of Duty: Modern Warfare II (2022)' is from the first title here, you can rewrite your last loop as something like:
for z in zip(titles, prices[::2], prices[1::2]):
z, lw = list(z), ''
for i in len(z):
if i == 0: # title
z[0] = ' '.join(w for w in z[0].text.split('\n', 1)[-1] if w)
if '\n' in z[0].text: lw = z[0].text.split('\n', 1)[0]
continue
z[i] = z[i].text.replace('$', '').replace('~', '')
print(','.join(z+[lw]))
Added EDIT: After seeing the html, this is my suggested solution:
for g in doc.select('div[data-container-game-id]'):
gameId = g.get('data-container-game-id')
title = g.select_one('a.title')
if title: title = ' '.join(w for w in title.text.split() if w)
price_official = g.select_one('.price-wrap > div:first-child span.price')
price_lowest = g.select_one('.price-wrap > div:first-child+div span.price')
if price_official:
price_official = price_official.text.replace('$', '').replace('~', '')
if price_lowest:
price_lowest = price_lowest.text.replace('$', '').replace('~', '')
print(', '.join([title, price_official, price_lowest, gameId]))
prints
Call of Duty: Modern Warfare II (2022), 69.99, 77.05, 153356
Red Dead Redemption 2, 14.85, 13.79, 80011
God of War, 28.12, 22.03, 130187
ELDEN RING, 50.36, 48.10, 119003
Cyberpunk 2077, 29.99, 28.63, 73502
EA SPORTS FIFA 23, 41.99, 39.04, 156474
Warhammer 40,000: Darktide, 39.99, 45.86, 96592
Marvel's Spider-Man Remastered, 30.71, 27.07, 154207
Persona 5 Royal, 37.79, 43.32, 155123
The Callisto Protocol, 59.99, 69.41, 152790
Need for Speed Unbound, 69.99, 42.29, 165013
Days Gone, 15.00, 9.01, 110837
Btw, this might look ok for just four values, but if you have a large amount of details that you want to extract, you might want to consider using a function like this.

Error when creating dictionaries from text files

I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))

Python 3: XML Tag Value not being written to csv file

My python 3 script takes an xml file and creates a csv file.
Small excerpt of xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<metadata>
<dc>
<title>Golden days for boys and girls, 1895-03-16, v. XVI #17</title>
<subject>Children's literature--Children's periodicals</subject>
<description>Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries</description>
<publisher>James Elverson, 1880-</publisher>
<date>1895-06-15</date>
<type>Text | periodicals</type>
<format>image/jp2</format>
<handle>http://hdl.handle.net/11134/20002:860074494</handle>
<accessionNumber/>
<barcode/>
<identifier>20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870 | hdl:  | http://hdl.handle.net/11134/20002:860074494</identifier>
<rights>These Materials are provided for educational and research purposes only. The University of Connecticut Libraries hold the copyright except where noted. Permission must be obtained in writing from the University of Connecticut Libraries and/or theowner(s) of the copyright to publish reproductions or quotations beyond "fair use." | The collection is open and available for research.</rights>
<creator/>
<relation/>
<coverage/>
<language/>
</dc>
</metadata>
Python3 code:
import csv
import xml.etree.ElementTree as ET
tree = ET.ElementTree(file='ctda_set1_uniqueTags.xml')
doc = ET.parse("ctda_set1_uniqueTags.xml")
root = tree.getroot()
oaidc_data = open('ctda_set1_uniqueTags.csv', 'w', encoding='utf-8')
titles = 'dc/title'
subjects = 'dc/subject'
csvwriter = csv.writer(oaidc_data)
oaidc_head = ['Title', 'Subject', 'Description', 'Publisher', 'Date', 'Type', 'Format', 'Handle', 'Accession Number', 'Barcode', 'Identifiers', 'Rights', 'Creator', 'Relation', 'Coverage', 'Language']
count = 0
for member in root.findall('dc'):
if count == 0:
csvwriter.writerow(oaidc_head)
count = count + 1
dcdata = []
titles = member.find('title').text
dcdata.append(titles)
subjects = member.find('subject').text
dcdata.append(subjects)
descriptions = member.find('description').text
dcdata.append(descriptions)
publishers = member.find('publisher').text
dcdata.append(publishers)
dates = member.find('date').text
dcdata.append(dates)
types = member.find('type').text
dcdata.append(types)
formats = member.find('format').text
dcdata.append(formats)
handle = member.find('handle').text
dcdata.append(handle)
accessionNo = member.find('accessionNumber').text
dcdata.append(accessionNo)
barcodes = member.find('barcode').text
dcdata.append(barcodes)
identifiers = member.find('identifier').text
dcdata.append(identifiers)
rt = member.find('rights').text
print(member.find('rights').text)
dcdata.append('rt')
ct = member.find('creator').text
dcdata.append('ct')
rt = member.find('relation').text
dcdata.append('rt')
ce = member.find('coverage').text
dcdata.append('ce')
lang = member.find('language').text
dcdata.append('lang')
csvwriter.writerow(dcdata)
oaidc_data.close()
Everything works as expected except for rt, ce, and lang. What happens is that in the csv, all the data is written with the comma delimiter. For rt, the value is always rt, for ce, ce, lang, lang, etc.
Here's a snippet of the output:
Title,Subject,Description,Publisher,Date,Type,Format,Handle,Accession Number,Barcode,Identifiers,Rights,Creator,Relation,Coverage,Language
"Golden days for boys and girls, 1895-03-16, v. XVI #17",Children's literature--Children's periodicals,"Archives & Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Libraries","James Elverson, 1880-",1895-06-15,Text | periodicals,image/jp2,hdl.handle.net/11134/20002:860074494,,,20002:860074494 | local: 868010272 | local: 997186613502432 | local: 39153019382870,**rt,ct,rt,ce,lang**
Some of the rights statements get very long - perhaps that's the issue. That's why I added the print(member.find('rights')) to see the output. The text is printed just fine. The text just isn't written to the csv. What I'd like is to have the value or text written for these xml tags. Any help would be appreciated.
Thanks.
Jennifer
In the line dcdata.append('rt') there is no need for the quotes. Try dcdata.append(rt). Similarly, there are unnecessary quotes in the ce and lang lines.

How can I do a non-greedy (backtracking) match with OneOrMore etc. in pyparsing?

I am trying to parse a partially standardized street address into it's components using pyparsing. I want to non-greedy match a street name that may be N tokens long.
For example:
444 PARK GARDEN LN
Should be parsed into:
number: 444
street: PARK GARDEN
suffix: LN
How would I do this with PyParsing? Here's my initial code:
from pyparsing import *
def main():
street_number = Word(nums).setResultsName('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY").setResultsName('street_suffix')
street_name = OneOrMore(Word(alphas)).setResultsName('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
if __name__ == '__main__':
main()
but when I try parsing it, the street suffix gets gobbled up by the default greedy parsing behavior.
Use the negation, ~, to check to see if the upcoming street_name is actually a street_suffix.
from pyparsing import *
street_number = Word(nums)('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY")('street_suffix')
street_name = OneOrMore(~street_suffix + Word(alphas))('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
In addition, you don't have to use setResultsName, you can simply use the syntax above. IMHO it leads to a much cleaner grammar definition.

Using regex to extract information from a string

This is a follow-up and complication to this question: Extracting contents of a string within parentheses.
In that question I had the following string --
"Will Farrell (Nick Hasley), Rebecca Hall (Samantha)"
And I wanted to get a list of tuples in the form of (actor, character) --
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha')]
To generalize matters, I have a slightly more complicated string, and I need to extract the same information. The string I have is --
"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary),
with Stephen Root and Laura Dern (Delilah)"
I need to format this as follows:
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Stephen Root',''), ('Lauren Dern', 'Delilah')]
I know I can replace the filler words (with, and, &, etc.), but can't quite figure out how to add a blank entry -- '' -- if there is no character name for the actor (in this case Stephen Root). What would be the best way to go about doing this?
Finally, I need to take into account if an actor has multiple roles, and build a tuple for each role the actor has. The final string I have is:
"Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"
And I need to build a list of tuples as follows:
[('Will Farrell', 'Nick Hasley'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'),
('Glenn Howerton', 'Brad'), ('Stephen Root',''), ('Lauren Dern', 'Delilah'), ('Lauren Dern', 'Stacy')]
Thank you.
import re
credits = """Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"\s*(?:,(?![^()]*\))|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"([^(]*)(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
characters = splitre.split(credits)
pairs = []
for character in characters:
if character:
match = matchre.match(character)
if match:
actor = match.group(1).strip()
if match.group(2):
parts = splitparts.split(match.group(2))
for part in parts:
pairs.append((actor, part))
else:
pairs.append((actor, ""))
print(pairs)
Output:
[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'),
('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''),
('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]
Tim Pietzcker's solution can be simplified to (note that patterns are modified too):
import re
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
pairs = []
for character in splitre.split(credits):
gr = matchre.match(character).groups('')
for part in splitparts.split(gr[1]):
pairs.append((gr[0], part))
print(pairs)
Then:
import re
credits = """ Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with
Stephen Root and Laura Dern (Delilah, Stacy)"""
# split on commas (only if outside of parentheses), "with" or "and"
splitre = re.compile(r"(?:,(?![^()]*\))(?:\s*with)*|\bwith\b|\band\b)\s*")
# match the part before the parentheses (1) and what's inside the parens (2)
# (only if parentheses are present)
matchre = re.compile(r"\s*([^(]*)(?<! )\s*(?:\(([^)]*)\))?")
# split the parts inside the parentheses on commas
splitparts = re.compile(r"\s*,\s*")
gen = (matchre.match(character).groups('') for character in splitre.split(credits))
pp = [ (gr[0], part) for gr in gen for part in splitparts.split(gr[1])]
print pp
The trick is to use groups('') with an argument ''
What you want is identify sequences of words starting with a capital letter, plus some complications (IMHO you cannot assume each name is made of Name Surname, but also Name Surname Jr., or Name M. Surname, or other localized variation, Jean-Claude van Damme, Louis da Silva, etc.).
Now, this is likely to be overkill for the sample input you posted, but as I wrote above I assume things will soon get messy, so I would tackle this using nltk.
Here's a very crude and not very well tested snippet, but it should do the job:
import nltk
from nltk.chunk.regexp import RegexpParser
_patterns = [
(r'^[A-Z][a-zA-Z]*[A-Z]?[a-zA-Z]+.?$', 'NNP'), # proper nouns
(r'^[(]$', 'O'),
(r'[,]', 'COMMA'),
(r'^[)]$', 'C'),
(r'.+', 'NN') # nouns (default)
]
_grammar = """
NAME: {<NNP> <COMMA> <NNP>}
NAME: {<NNP>+}
ROLE: {<O> <NAME>+ <C>}
"""
text = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
tagger = nltk.RegexpTagger(_patterns)
chunker = RegexpParser(_grammar)
text = text.replace('(', '( ').replace(')', ' )').replace(',', ' , ')
tokens = text.split()
tagged_text = tagger.tag(tokens)
tree = chunker.parse(tagged_text)
for n in tree:
if isinstance(n, nltk.tree.Tree) and n.node in ['ROLE', 'NAME']:
print n
# output is:
# (NAME Will/NNP Ferrell/NNP)
# (ROLE (/O (NAME Nick/NNP Halsey/NNP) )/C)
# (NAME Rebecca/NNP Hall/NNP)
# (ROLE (/O (NAME Samantha/NNP) )/C)
# (NAME Glenn/NNP Howerton/NNP)
# (ROLE (/O (NAME Gary/NNP ,/COMMA Brad/NNP) )/C)
# (NAME Stephen/NNP Root/NNP)
# (NAME Laura/NNP Dern/NNP)
# (ROLE (/O (NAME Delilah/NNP ,/COMMA Stacy/NNP) )/C)
You must then process the tagged output and put names and roles in a list instead of printing, but you get the picture.
What we do here is do a first pass where we tag each token according to the regex in _patterns, and then do a second pass to build more complex chunks according to your simple grammar. You can complicate the grammar and the patterns as you want, ie. catch variations of names, messy inputs, abbreviations, and so on.
I think doing this with a single regex pass is going to be a pain for non-trivial inputs.
Otherwise, Tim's solution is solving the issue nicely for the input you posted, and without the nltk dependency.
In case you want a non-regex solution ... (Assumes no nested parenthesis.)
in_string = "Will Ferrell (Nick Halsey), Rebecca Hall (Samantha), Glenn Howerton (Gary, Brad), with Stephen Root and Laura Dern (Delilah, Stacy)"
in_list = []
is_in_paren = False
item = {}
next_string = ''
index = 0
while index < len(in_string):
char = in_string[index]
if in_string[index:].startswith(' and') and not is_in_paren:
actor = next_string
if actor.startswith(' with '):
actor = actor[6:]
item['actor'] = actor
in_list.append(item)
item = {}
next_string = ''
index += 4
elif char == '(':
is_in_paren = True
item['actor'] = next_string
next_string = ''
elif char == ')':
is_in_paren = False
item['part'] = next_string
in_list.append(item)
item = {}
next_string = ''
elif char == ',':
if is_in_paren:
item['part'] = next_string
next_string = ''
in_list.append(item)
item = item.copy()
item.pop('part')
else:
next_string = "%s%s" % (next_string, char)
index += 1
out_list = []
for dict in in_list:
actor = dict.get('actor')
part = dict.get('part')
if part is None:
part = ''
out_list.append((actor.strip(), part.strip()))
print out_list
Output:
[('Will Ferrell', 'Nick Halsey'), ('Rebecca Hall', 'Samantha'), ('Glenn Howerton', 'Gary'), ('Glenn Howerton', 'Brad'), ('Stephen Root', ''), ('Laura Dern', 'Delilah'), ('Laura Dern', 'Stacy')]

Categories