RE works in pythex but doesn't work in python - python

I am doing an assignment where I need to scrape information from live sites.
For this I am using https://www.nintendo.com/games/nintendo-switch-bestsellers, and need to scrape the game titles, prices and then the image sources. I have the titles working but the prices and image sources are just retuning empty list, though when put through pythex it is returning the right answer.
Here is my code:
from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen
game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")
# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)
# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">(\$[.0-9]+)</p>', game_html_source)
print(game_prices)
# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)

Parsing HTML with regex has too many pitfalls for reliable processing. BeautifulSoup and other HTML parsers work by building a complete document data structure, which you then navigate to extract the interesting bits - it's thorough and comprehensive, but if there is some erroneous HTML anywhere in the source, even if its in a part you don't care about, it can defeat the parsing process. Pyparsing takes a middle approach - you can define mini-parsers that match just the bits you want, and skip over everything else (this simplifies the post-parsing navigation too). To address some of the variabilities in HTML styles, pyparsing provides a function makeHTMLTags which returns a pair of pyparsing expressions for the opening and closing tags:
foo_start, foo_end = pp.makeHTMLTags('foo')
foo_start will match:
<foo>
<foo/>
<foo class='bar'>
<foo href=something_not_in_quotes>
and many more variations of attributes and whitespace.
The foo_start expression (like all pyparsing expressions) will return a ParseResults object. This makes it easy to access the parts of the parsed tag:
foo_data = foo_start.parseString("<foo img='bar.jpg'>")
print(foo_data.img)
For your Nintendo page scraper, see the annotated source below:
import pyparsing as pp
# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")
# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")
# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")
# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))
# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end
# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr
# not shown - read web page into variable 'html'
# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
if 'title' in match:
print("Title:", match.title)
elif 'price' in match:
print("Price:", match.price)
elif 'src' in match:
print("Img src:", match.src)
else:
print("???", match.dump())
The first few matches printed are:
Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
Title: Hyrule Warriors: Definitive Edition
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
Title: Donkey Kong Country: Tropical Freeze
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
Title: Wizard of Legend
Price: $15.99

Related

How to get all attributes of section of text from a html string in Python?

I have a html string:
Normal<span style="font-weight: bold;">Bold <span style="font-style: italic;">BoldAndItalic</span></span><span style="font-style: italic;">Italic</spa
Which renders out to
NormalBold BoldAndItalicItalic
What I want to get is a python dictionary, that lists all the attributes given, to all the pieces of text, kind of like this:
[
{"text":"Normal","styles":{"color":None,"font-style":None,"font-weight":None}},
{"text":"Bold","styles":{"color":None,"font-style":None,"font-weight":"bold"}},
{"text":" ","styles":{"color":None,"font-style":None,"font-weight":None}},
{"text":"BoldAndItalic","styles":{"color":None,"font-style":"italic","font-weight":"bold"}},
{"text":"Italic","styles":{"color":None,"font-style":"italic","font-weight":None}}
]
Where None could be considered default/not given.
However, when I parse the html string via the BeautifulSoup Library, I cannot find a way to access each section of text individually, I need to access the span tags first, and since there are span tags within other span tags, making a parser myself becomes very difficult, which I cannot seem to figure out.
What I attempted was this:
def stylesandtext(obj):
if obj.decomposed:
return
text = obj.text
styles={"color":None,"font-weight":None}
stylestr = obj.attrs['style'].split(": ")
styles[stylestr[0]] = stylestr[1].replace(";","")
if obj.find('span') !=None:
getsecstyle = stylesandtext(obj.find('span'))['styles']
if getsecstyle['color'] !=None:
styles['color'] = getsecstyle['color']
if getsecstyle['font-weight'] !=None:
styles['font-weight'] = getsecstyle['font-weight']
obj.find('span').decompose()
return {"text":text,"styles":styles}
Where, using BeautifulSoup, for every span tag, I tried to check for inner span tags, got the attributes, and combined it to the Dict storing all the stuff. It kind of worked, but it did not pick any words not in the span tags.
How would I go about getting the attributes to a text section?
Few more details:
There are no other tags than span
I only need to count for a few style attributes: color, font-weight, font-style, text-decoration
There is a attribute called children, for every tag, which separates all the elements inside it(like soup.children orsoup.span.children).
Then I can recursively, in a function, get all attributes and text, which I store in a list.
This is the code I figured out:
import bs4
def get_as_list(obj,extstyle=None):
alldata = []
style = {"color":None,"font-weight":None,"font-style":None,"text-decoration":None}
if extstyle != None:
style=extstyle
if 'style' in obj.attrs:
spanstyleaslist = obj.attrs['style'].split(": ")
#obj.attrs is like {'style': 'color: #55FF55'}
style[spanstyleaslist[0]] = spanstyleaslist[1]
stuffaslist = list(obj.children)
for x in stuffaslist:
if type(x) == bs4.element.NavigableString:
alldata.append({'text':str(x),'styles':style})
else:
alldata.extend(get_as_list(x,style))
return alldata
which I call in a external function, for every element in the soup.children, excluding NavigableStrings.

python: get opening and closing html tags

Question:
How can I find the text for all opening and closing HTML tags with python (3.6).
This needs to be the exact text, keeping spaces and potentially illegal html:
# input
html = """<p>This <a href="book"> book </a > will help you</p attr="e">"""
# desired output
output = ['<p>', '<a href="book">', '</a >', '</p attr="e">']
Attempt at solution:
Apparently this is not possible in Beautifulsoup, this question: How to get the opening and closing tag in beautiful soup from HTML string? links to html.parser
Implementing a custom parser is easy. You can use self.get_starttag_text() to get the text corresponding to the last opened tag. But for some reason, there is no analogous method get_endtag_text().
Which means that my parser produces this output:
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.tags = []
def reset_stored_tags(self):
self.tags = []
def handle_starttag(self, tag, attrs):
self.tags.append(self.get_starttag_text())
def handle_endtag(self, tag):
self.tags.append(self.get_endtag_text())
def handle_startendtag(self, data):
self.tags.append(self.get_starttag_text())
# input
input_doc = """<p>This book will help you</p>"""
parser = MyHTMLParser()
parser.feed(input_doc)
print(parser.tags)
# ['<p>', '<a href="book">', '<a href="book">', '<a href="book">']
The tag argument of the handle_endtag is just a string "a" or "p", not some custom datatype that can provide the whole tag.
You can use recursion and iterate over the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = """<p>This book will help you</p>"""
def attrs(_d):
if _d.name != '[document]':
_attrs = ' '.join(f'{a}="{b}"' for a, b in getattr(_d, 'attrs', {}).items())
yield f'<{_d.name}>' if not _attrs else f'<{_d.name} {_attrs}>'
for i in _d.contents:
if not isinstance(i, str):
yield from attrs(i)
if _d.name != '[document]':
yield f'</{_d.name}>'
print(list(attrs(soup(html, 'html.parser'))))
Output:
['<p>', '', '', '</p>']
Edit: for the invalid HTML, you can use re:
import re
html = """<p>This <a href="book"> book </a > will help you</p attr="e">"""
new_results = re.findall('\<[a-zA-Z]+.*?\>|\</[a-zA-Z]+.*?\>', html)
Output:
['<p>', '<a href="book">', '</a >', '</p attr="e">']
While the answer from #Ajax1234 contains some nice python + beautifulsoup, I found it to be very unstable. Mostly because I need the exact string of the html tag. Each tag found by the method must be present in the html text. This leads to the following problems:
It parses the tag names and attributes from HTML and plugs them
together to form the string of the tag yield f'<{_d.name}>' if not _attrs else f'<{_d.name} {_attrs}>'. This gets rid of extra whitespace in the tag: <p > becomes <p>
It always generates a closing tag, even if there is none in the markup
It fails for attributes that are lists: <p class="a b"> becomes <p class="[a, b]">
The whitespace problem can be partially solved by cleaning the HTML prior to processing it. I used bleach, but that can be too aggressive. Notably, you have to specify a list of accepted tags before you use it.
A better approach is a thin wrapper around html.parser.HTMLParser.
This is something I already started in my question, the difference here is that I automatically add generate a closing tag.
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.tags = []
def handle_starttag(self, tag, attrs):
self.tags.append(self.get_starttag_text())
def handle_endtag(self, tag):
self.tags.append(f"</{tag}>")
parser = MyHTMLParser();
parser.feed("""<p > Argh, whitespace and p is not closed </a>""")
parser.tags # ['<p >', '</a>']
This solved the problems mentioned above, but it has one shortcoming, it doesn't look at the actual text for the closing tag. If there are extra arguments or whitespace in the closing tag, the parsing will not show them.

Beautiful Soup - Get all text, but preserve link html?

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.
I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.
As an example, I would like to convert:
<td>
<font><span>Hello</span><span>World</span></font><br>
<span>Foo Bar <span>Baz</span></span><br>
<span>Example Link: Google</span>
</td>
Into:
Hello World
Foo Bar Baz
Example Link: Google
My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)
for tag in tags:
if (tag.name == 'a'):
print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
else:
print(tag.get_text())
Which returns multiple fragments/duplicates as the parser moves down the tree:
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World
Foo Bar Baz
Baz
Example Link: Google
<a href='https://google.com'>Google</a>
One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.
You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def _all_strings(self, strip=False, types=(NavigableString, CData)):
for descendant in self.descendants:
# return "a" string representation if we encounter it
if isinstance(descendant, Tag) and descendant.name == 'a':
yield str(descendant)
# skip an inner text node inside "a"
if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
continue
# default behavior
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
Demo:
In [1]: data = """
...: <td>
...: <font><span>Hello</span><span>World</span></font><br>
...: <span>Foo Bar <span>Baz</span></span><br>
...: <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
...: t-decoration: underline;">Google</a></span>
...: </td>
...: """
In [2]: soup = MyBeautifulSoup(data, "lxml")
In [3]: print(soup.get_text())
HelloWorld
Foo Bar Baz
Example Link: Google
To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: Google</span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
print(tag.text)
print(tag.find('a'))
If you want the text printed on separate lines you will have to process the spans individually.
for tag in tags:
spans = tag.find_all('span')
for span in spans:
print(span.text)
print(tag.find('a'))

How to extract text from a html table row

This is my string :
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
I have tried below regular expression to extract the text which is in between h5 element tag:
reg = re.search(r'<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>([A-Za-z0-9%s]+)</h5></span></td></tr>' % string.punctuation,content)
It's exactly returns what I wants.
Is there any more pythonic way to get this one ?
Dunno whether this qualifies as more pythonic or not, but it handles it as HTML data.
from lxml import html
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
HtmlData = html.fromstring(content)
ListData = HtmlData.xpath(‘//text()’)
And to get the last element:
ListData[-1]

Python extracting data from HTML using split

A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?
You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr
You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects
If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.
You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]

Categories