regex + beautifulsoup - python

I've isolated a line of HTML procured from BeautifulSoup that i want to run regex on, but I keep getting AttributeError: 'NoneType' object has no attribute 'groups'
I read another stackoverflow question (using regex on beautiful soup tags) but I can't see what I need to do to fix my version of this issue.
this is my relevant part of the code (url is provided):
with rob's correct regex update still throwing dat attribute error:
soup = BeautifulSoup(urlopen(url).read()).find("div",{"id":"page"})
addy = soup.find("p","addy").em.encode_contents()
extracted_entities = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', addy)
extracted_entities.groups()
price = extracted_entities[0]
location = extracted_entities[1]
phone = extracted_entities[2]
addy seems to be what I want, returning:
$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com
$9. 800 W. Randolph St., 312-929-4580, aucheval.tumblr.com
$9.50. 445 N. Clark St., 312-334-3688, rickbayless.com
and so on, when i print it.
what's going on here? thanks in advance, all.

The issue seems to be a stray " in your RegEx pattern that I don't see in your example output.
match = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', addy)
if match:
extracted_entities = match.groups()
else:
raise Exception("RegEx didn't match '%s'" % addy)
Should work:
>>> f = """$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com
... $9. 800 W. Randolph St., 312-929-4580, aucheval.tumblr.com
... $9.50. 445 N. Clark St., 312-334-3688, rickbayless.com"""
>>> l = f.splitlines()
>>> for i in l:
... r = re.match(r'\$([\d.]+)\. ([^,]+), ([\d-]+)', i)
... if r:
... print "GOT IT", r.groups()
... else:
... print "NO GOT IT", i
...
GOT IT ('10', '2109 W. Chicago Ave.', '773-772-0406')
GOT IT ('9', '800 W. Randolph St.', '312-929-4580')
GOT IT ('9.50', '445 N. Clark St.', '312-334-3688')

Related

How do I combine elements from two loops without issues?

When I execute this code...
from bs4 import BeautifulSoup
with open("games.html", "r") as page:
doc = BeautifulSoup(page, "html.parser")
titles = doc.select("a.title")
prices = doc.select("span.price-inner")
for game_soup in doc.find_all("div", {"class": "game-options-wrapper"}):
game_ids = (game_soup.button.get("data-game-id"))
for title, price_official, price_lowest in zip(titles, prices[::2], prices[1::2]):
print(title.text + ',' + str(price_official.text.replace('$', '').replace('~', '')) + ',' + str(
price_lowest.text.replace('$', '').replace('~', '')))
The output is...
153356
80011
130187
119003
73502
156474
96592
154207
155123
152790
165013
110837
Call of Duty: Modern Warfare II (2022),69.99,77.05
Red Dead Redemption 2,14.85,13.79
God of War,28.12,22.03
ELDEN RING,50.36,48.10
Cyberpunk 2077,29.99,28.63
EA SPORTS FIFA 23,41.99,39.04
Warhammer 40,000: Darktide,39.99,45.86
Marvels Spider-Man Remastered,30.71,27.07
Persona 5 Royal,37.79,43.32
The Callisto Protocol,59.99,69.41
Need for Speed Unbound,69.99,42.29
Days Gone,15.00,9.01
But I'm trying to get the value next to the other ones on the same line
Expected output:
Call of Duty: Modern Warfare II (2022),69.99,77.05,153356
Red Dead Redemption 2,14.85,13.79,80011
...
Even when adding game_ids to print(), it spams the same game id for each line.
How can I go about resolving this issue?
HTML file: https://jsfiddle.net/m3hqy54x/
I feel like all 3 details (title, price_official, price_lowest) are probably all in a shared container. It would be better to loop through these containers and select the details as sets from each container to make sure the wight prices and titles are being paired up, but I can't tell you how to do that without seeing at least a snippet from (or all of) "games.html"....
Anyway, assuming that '110837\nCall of Duty: Modern Warfare II (2022)' is from the first title here, you can rewrite your last loop as something like:
for z in zip(titles, prices[::2], prices[1::2]):
z, lw = list(z), ''
for i in len(z):
if i == 0: # title
z[0] = ' '.join(w for w in z[0].text.split('\n', 1)[-1] if w)
if '\n' in z[0].text: lw = z[0].text.split('\n', 1)[0]
continue
z[i] = z[i].text.replace('$', '').replace('~', '')
print(','.join(z+[lw]))
Added EDIT: After seeing the html, this is my suggested solution:
for g in doc.select('div[data-container-game-id]'):
gameId = g.get('data-container-game-id')
title = g.select_one('a.title')
if title: title = ' '.join(w for w in title.text.split() if w)
price_official = g.select_one('.price-wrap > div:first-child span.price')
price_lowest = g.select_one('.price-wrap > div:first-child+div span.price')
if price_official:
price_official = price_official.text.replace('$', '').replace('~', '')
if price_lowest:
price_lowest = price_lowest.text.replace('$', '').replace('~', '')
print(', '.join([title, price_official, price_lowest, gameId]))
prints
Call of Duty: Modern Warfare II (2022), 69.99, 77.05, 153356
Red Dead Redemption 2, 14.85, 13.79, 80011
God of War, 28.12, 22.03, 130187
ELDEN RING, 50.36, 48.10, 119003
Cyberpunk 2077, 29.99, 28.63, 73502
EA SPORTS FIFA 23, 41.99, 39.04, 156474
Warhammer 40,000: Darktide, 39.99, 45.86, 96592
Marvel's Spider-Man Remastered, 30.71, 27.07, 154207
Persona 5 Royal, 37.79, 43.32, 155123
The Callisto Protocol, 59.99, 69.41, 152790
Need for Speed Unbound, 69.99, 42.29, 165013
Days Gone, 15.00, 9.01, 110837
Btw, this might look ok for just four values, but if you have a large amount of details that you want to extract, you might want to consider using a function like this.

Error when creating dictionaries from text files

I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this:
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far:
def load_author_dicts(text_file, similar_authors, awards_authors):
name_of_author = True
awards = False
similar = False
for line in text_file:
if name_of_author:
author = line.split(', ')
nameA = author[1].strip() + ' ' + author[0].strip()
name_of_author = False
awards = True
continue
if awards:
if ',' in line:
awards = False
similar = True
else:
if nameA in awards_authors:
listawards = awards_authors[nameA]
listawards.append(line.strip())
else:
listawards = []
listawards.append(line.strip()
awards_authors[nameA] = listawards
if similar:
if line == '\n':
similar = False
name_of_author = True
else:
sim_author = line.split(', ')
nameS = sim_author[1].strip() + ' ' + sim_author[0].strip()
if nameA in similar_authors:
similar_list = similar_authors[nameA]
similar_list.append(nameS)
else:
similar_list = []
similar_list.append(nameS)
similar_authors[nameA] = similar_list
continue
This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() )
How can I fix this? Maybe with a 'try, except function' in that area?
Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want.
test.txt contains your data
Brabudy, Ray
Hugo Award
Nebula Award
Saturn Award
Ellison, Harlan
Heinlein, Robert
Asimov, Isaac
Clarke, Arthur
Ellison, Harlan
Nebula Award
Hugo Award
Locus Award
Stephenson, Neil
Vonnegut, Kurt
Morgan, Richard
Adams, Douglas
And my code to parse it.
award_parse.py
data = {}
name = ""
awards = []
f = open("test.txt")
for l in f:
# make sure the line is not blank don't process blank lines
if not l.strip() == "":
# if this is a name and we're not already working on an author then set the author
# otherwise treat this as a new author and set the existing author to a key in the dictionary
if "," in l and len(name) == 0:
name = l.strip()
elif "," in l and len(name) > 0:
# check to see if recipient is already in list, add to end of existing list if he/she already
# exists.
if not name.strip() in data:
data[name] = awards
else:
data[name].extend(awards)
name = l.strip()
awards = []
# process any lines that are not blank, and do not have a ,
else:
awards.append(l.strip())
f.close()
for k, v in data.items():
print("%s got the following awards: %s" % (k,v))

remove content between tags in python using regex

I was trying to clean up wikitext. Specifically I was trying to remove all the {{.....}} and <..>...</..> in the wikitext. For example, for this wikitext:
"{{Infobox UK place\n|country = England\n|official_name =
Morcombelake\n|static_image_name = Morecombelake from Golden Cap -
geograph.org.uk - 1184424.jpg\n|static_image_caption = Morcombelake as
seen from Golden Cap\n|coordinates =
{{coord|50.74361|-2.85153|display=inline,title}}\n|map_type =
Dorset\n|population = \n|population_ref = \n|shire_district = [[West
Dorset]]\n|shire_county = [[Dorset]]\n|region = South West
England\n|constituency_westminster = West Dorset\n|post_town =
\n|postcode_district = \n|postcode_area = DT\n|os_grid_reference =
SY405938\n|website = \n}}\n'''Morcombelake''' (also spelled
'''Morecombelake''') is a small village near [[Bridport]] in
[[Dorset]], [[England]], within the ancient parish of [[Whitchurch
Canonicorum]]. [[Golden Cap]], part of the [[Jurassic Coast]] World
Heritage Site, is nearby.{{cite
web|url=http://www.nationaltrust.org.uk/golden-cap/|title=Golden
Cap|publisher=National Trust|accessdate=2014-05-04}}\n\n==
References ==\n{{reflist}}\n\n{{West
Dorset}}\n\n\n{{Dorset-geo-stub}}\n[[Category:Villages in
Dorset]]\n\n== External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n"
How can I use regular expressions in python to produce output like this:
\n'''Morcombelake''' (also spelled '''Morecombelake''') is a small
village near [[Bridport]] in [[Dorset]], [[England]], within the
ancient parish of [[Whitchurch Canonicorum]]. [[Golden Cap]], part of
the [[Jurassic Coast]] World Heritage Site, is nearby.\n\n==
References ==\n\n\n\n\n\n\n[[Category:Villages in Dorset]]\n\n==
External Links
==\n\n*[http://www.goldencapteamofchurches.org.uk/morcombelakechurch.html
Parish Church of St Gabriel]\n\n
As the tags are nested into each other, you can find and remove them in a loop:
n = 1
while n > 0:
s, n = re.subn('{{(?!{)(?:(?!{{).)*?}}|<[^<]*?>', '', s, flags=re.DOTALL)
s is a string containing the wikitext.
There are no the <...> tags in your example, but they should be removed as well.

Append items to dictionary Python

I am trying to write a function in python that opens a file and parses it into a dictionary. I am trying to make the first item in the list block the key for each item in the dictionary data. Then each item is supposed to be the rest of the list block less the first item. For some reason though, when I run the following function, it parses it incorrectly. I have provided the output below. How would I be able to parse it like I stated above? Any help would be greatly appreciated.
Function:
def parseData() :
filename="testdata.txt"
file=open(filename,"r+")
block=[]
for line in file:
block.append(line)
if line in ('\n', '\r\n'):
album=block.pop(1)
data[block[1]]=album
block=[]
print data
Input:
Bob Dylan
1966 Blonde on Blonde
-Rainy Day Women #12 & 35
-Pledging My Time
-Visions of Johanna
-One of Us Must Know (Sooner or Later)
-I Want You
-Stuck Inside of Mobile with the Memphis Blues Again
-Leopard-Skin Pill-Box Hat
-Just Like a Woman
-Most Likely You Go Your Way (And I'll Go Mine)
-Temporary Like Achilles
-Absolutely Sweet Marie
-4th Time Around
-Obviously 5 Believers
-Sad Eyed Lady of the Lowlands
Output:
{'-Rainy Day Women #12 & 35\n': '1966 Blonde on Blonde\n',
'-Whole Lotta Love\n': '1969 II\n', '-In the Evening\n': '1979 In Through the Outdoor\n'}
You can use groupby to group the data using the empty lines as delimiters, use a defaultdict for repeated keys extending the rest of the values from each val returned from groupby after extracting the key/first element.
from itertools import groupby
from collections import defaultdict
d = defaultdict(list)
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
# if k is True we have a section
if k:
# get key "k" which is the first line
# from each section, val will be the remaining lines
k,*v = val
# add or add to the existing key/value pairing
d[k].extend(map(str.rstrip,v))
from pprint import pprint as pp
pp(d)
Output:
{'Bob Dylan\n': ['1966 Blonde on Blonde',
'-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands'],
'Led Zeppelin\n': ['1979 In Through the Outdoor',
'-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl",
'1969 II',
'-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home']}
For python2 the unpack syntax is slightly different:
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, v = next(val), val
d[k].extend(map(str.rstrip, v))
If you want to keep the newlines remove the map(str.rstrip..
If you want the album and songs separately for each artist:
from itertools import groupby
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
with open("file.txt") as f:
for k, val in groupby(f, lambda x: x.strip() != ""):
if k:
k, alb, songs = next(val),next(val), val
d[k.rstrip()][alb.rstrip()] = list(map(str.rstrip, songs))
from pprint import pprint as pp
pp(d)
{'Bob Dylan': {'1966 Blonde on Blonde': ['-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or '
'Later)',
'-I Want You',
'-Stuck Inside of Mobile with the '
'Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
'-Most Likely You Go Your Way '
"(And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': {'1969 II': ['-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': ['-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}
I guess this is what you want?
Even if this is not the format you wanted, there are a few things you might learn from the answer:
use with for file handling
nice to have:
PEP8 compilant code, see http://pep8online.com/
a shebang
numpydoc
if __name__ == '__main__'
And SE does not like a list being continued by code...
#!/usr/bin/env python
""""Parse text files with songs, grouped by album and artist."""
def add_to_data(data, block):
"""
Parameters
----------
data : dict
block : list
Returns
-------
dict
"""
artist = block[0]
album = block[1]
songs = block[2:]
if artist in data:
data[artist][album] = songs
else:
data[artist] = {album: songs}
return data
def parseData(filename='testdata.txt'):
"""
Parameters
----------
filename : string
Path to a text file.
Returns
-------
dict
"""
data = {}
with open(filename) as f:
block = []
for line in f:
line = line.strip()
if line == '':
data = add_to_data(data, block)
block = []
else:
block.append(line)
data = add_to_data(data, block)
return data
if __name__ == '__main__':
data = parseData()
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(data)
which gives:
{ 'Bob Dylan': { '1966 Blonde on Blonde': [ '-Rainy Day Women #12 & 35',
'-Pledging My Time',
'-Visions of Johanna',
'-One of Us Must Know (Sooner or Later)',
'-I Want You',
'-Stuck Inside of Mobile with the Memphis Blues Again',
'-Leopard-Skin Pill-Box Hat',
'-Just Like a Woman',
"-Most Likely You Go Your Way (And I'll Go Mine)",
'-Temporary Like Achilles',
'-Absolutely Sweet Marie',
'-4th Time Around',
'-Obviously 5 Believers',
'-Sad Eyed Lady of the Lowlands']},
'Led Zeppelin': { '1969 II': [ '-Whole Lotta Love',
'-What Is and What Should Never Be',
'-The Lemon Song',
'-Thank You',
'-Heartbreaker',
"-Living Loving Maid (She's Just a Woman)",
'-Ramble On',
'-Moby Dick',
'-Bring It on Home'],
'1979 In Through the Outdoor': [ '-In the Evening',
'-South Bound Saurez',
'-Fool in the Rain',
'-Hot Dog',
'-Carouselambra',
'-All My Love',
"-I'm Gonna Crawl"]}}

How can I do a non-greedy (backtracking) match with OneOrMore etc. in pyparsing?

I am trying to parse a partially standardized street address into it's components using pyparsing. I want to non-greedy match a street name that may be N tokens long.
For example:
444 PARK GARDEN LN
Should be parsed into:
number: 444
street: PARK GARDEN
suffix: LN
How would I do this with PyParsing? Here's my initial code:
from pyparsing import *
def main():
street_number = Word(nums).setResultsName('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY").setResultsName('street_suffix')
street_name = OneOrMore(Word(alphas)).setResultsName('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
if __name__ == '__main__':
main()
but when I try parsing it, the street suffix gets gobbled up by the default greedy parsing behavior.
Use the negation, ~, to check to see if the upcoming street_name is actually a street_suffix.
from pyparsing import *
street_number = Word(nums)('street_number')
street_suffix = oneOf("ST RD DR LN AVE WAY")('street_suffix')
street_name = OneOrMore(~street_suffix + Word(alphas))('street_name')
address = street_number + street_name + street_suffix
result = address.parseString("444 PARK GARDEN LN")
print result.dump()
In addition, you don't have to use setResultsName, you can simply use the syntax above. IMHO it leads to a much cleaner grammar definition.

Categories