Get following node in different ancestor using lxml and xpath - python

I'm writing a text-to-speech program that reads math equations. I have a thread that needs to pull math equations (as MathJax SVG's) and parse them to prose.
Because of how the content is laid out, the math equations can be arbitrarily nested in other elements, like paragraphs, bolds, tables, etc.
Using a reference to the current element, how do I get the next <span class="MathJax_SVG">, which may be embedded in some other parent/ancestor?
I tried to solve it using the following:
nextMath = currentElement.xpath('following::.//span[#class=\'MathJax_SVG\']')
Returns nothing, even though I can confirm visually that there is something following it. I tried removing the period, but lxml complains that my XPath is malformed.
Have you guys ran into this before?
P.S. Here is a test document to show my point:
<html>
<head>
<title>Test Document</title>
</head>
<body>
<h1 id="mainHeading">The Quadratic Formula</h1>
<p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
<p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
<p>Here are some possible values when you use the formula:</p>
<p>
<table>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
</tr>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
</tr>
</table>
</p>
</body>
</html>
Updates
Learned that lxml doesn't support absolute positions. This may be relevant.
Some Testing Code (assuming you saved HTML as test.html)
from lxml import html
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//h1[#id=\'mainHeading\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = 'following::.//span[#class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)
if len(nextElem) > 0:
print 'Next equation:', html.tostring(nextElem[0])
else:
print 'No next equation...'

Do you need to iterate through the document? You could also search for span elements of the class MathJax_SVG directly:
from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[#class='MathJax_SVG']")

I ended up creating my own function to get what I want. I called it getNext(elem, xpathString). If there is a more efficient way to do this, I'm all ears. I'm not confident in its performance.
from lxml import html
def getNext(elem, xpathString):
'''
Gets the next element defined by XPath. The element returned
may be itself.
'''
myElem = elem
nextElem = elem.find(xpathString)
while nextElem is None:
if myElem.getnext() is not None:
myElem = myElem.getnext()
nextElem = myElem.find(xpathString)
else:
if myElem.getparent() is not None:
myElem = myElem.getparent()
else:
break
return nextElem
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//span[#id=\'MathJax_Element_Frame_1\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = './/span[#class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)
if nextElem is not None:
print 'Next equation:', html.tostring(nextElem)
else:
print 'No next equation...'

Related

Multi-level tag existence check in web scraping - improving readability in python

I'm working on a scraper running through many pages built from the same template. Each page holds some information regarding a specific item. In the optimistic case, I want to get all available data, for simplicity let's say it means name, price, and description.
Pages are structured as follows:
<div id="content">
<h1>Product name</h1>
<table id="properties">
<tbody>
<tr id="manufacturer-row">
<th>Manufacturer</th>
<td>Some-Mark</td>
</tr>
</tbody>
</table>
<p>Full description of the product</p>
</div>
Conditions that apply to this case:
Tags are nested so that I need to test for existence of each level,
Some pages will miss some data - empty column in a table is just as possible as missing table,
Some pages will have no content at all,
Empty text in tag is a valid value, but missing tag must be logged,
Missing data is not an exceptional situation.
In effect, I test check the presence of each piece of information, which leads to pretty hardly readable code:
content = soup.select_one("#content")
if content:
product_name_tag = content.select_one("h1")
if product_name_tag:
name = product_name_tag.text
else:
log("Product name tag not found")
table = content.select_one("table")
if table:
manufacturer_tag = table.select_one("#manufacturer-row > td")
if manufacturer_tag:
manufacturer = manufacturer_tag.text
else:
log("Manufacturer tag not found")
else:
log("Table not found")
else:
log("Tag '#content' not found")
return (
name if name in locals() else None,
manufacturer if manufacturer in locals() else None
)
In actual application the code is harder to read as properties I'm looking for are often more nested and I need to check existence of each tag before extracting its text. I was wondering if there is any neat way to handle this problem in terms of code readability and conciseness? My ideas:
Creating a function to extract tag's text if tag exists - would save few lines, but in real application I must use regexes to extract some phrases from text, so a single function would not be enough.
Creating a wrapper to log missing pieces if None was returned rather than under 'else' code - for improved readability.
Putting extraction of each piece of data to separate function, like _get_content_if_available, _get_name_if_available
Neither of those solutions seems to be good and concise enough, so I'd like to ask you for ideas.
I am also wondering if my way of initializing variables only if some conditions are met and then checking whether a variable exists in current context is a good idea.
All depends how do you want to structure your code. My suggestion is use ChainMap from collections. With ChainMap you can specify default values for your tags/keys and just parse the values that aren't missing. That way you will don't have if/else cluttered along your codebase:
data = """<div id="content">
<h1>Product name</h1>
<table id="properties">
<tbody>
<tr id="manufacturer-row">
<th>Manufacturer</th>
<td>Some-Mark</td>
</tr>
</tbody>
</table>
<p>Full description of the product</p>
</div>"""
from bs4 import BeautifulSoup
from collections import ChainMap
def my_parse(soup):
def is_value_missing(k, v):
if v is None:
print(f'Value "{k}" is missing!') # or log it!
return v is None
d = {}
d['product_name_tag'] = soup.select_one("h1")
d['manufacturer_tag'] = soup.select_one("#manufacturer-row td")
d['description'] = soup.select_one("p")
d['other value'] = soup.select_one("nav") # this is missing!
return {k:v.text for k, v in d.items() if is_value_missing(k, v) == False}
soup = BeautifulSoup(data, 'lxml')
c = ChainMap(my_parse(soup), {'product_name_tag': '-default name tag-',
'manufacturer_tag': '-default manufacturer tag-',
'description': '-default description-',
'other value': '-default other value-',
})
print("Product name = ", c['product_name_tag'])
print("Other value = ", c['other value'])
This will print:
Value "other value" is missing!
Product name = Product name
Other value = -default other value-

How to find text with a particular value BeautifulSoup python2.7

I have the following html: I'm trying to get the following numbers saved as variables Available Now,7,148.49,HatchBack,Good. The problem I'm running into is that I'm not able to pull them out independently since they don't have a class attached to it. I'm wondering how to solve this. The following is the html then my futile code to solve this.
</div>
<div class="car-profile-info">
<div class="col-md-12 no-padding">
<div class="col-md-6 no-padding">
<strong>Status:</strong> <span class="statusAvail"> Available Now </span><br/>
<strong>Min. Booking </strong>7 Days ($148.89)<br/>
<strong>Style: </strong>Hatchback<br/>
<strong>Transmission: </strong>Automatic<br/>
<strong>Condition: </strong>Good<br/>
</div>
Python 2.7 Code: - this gives me the entire html!
soup=BeautifulSoup(html)
print soup.find("span",{"class":"statusAvail"}).getText()
for i in soup.select("strong"):
if i.getText()=="Min. Booking ":
print i.parent.getText().replace("Min. Booking ","")
Find all the strong elements under the div element with class="car-profile-info" and, for each element found, get the .next_siblings until you meet the br element:
from bs4 import BeautifulSoup, Tag
for strong in soup.select(".car-profile-info strong"):
label = strong.get_text()
value = ""
for elm in strong.next_siblings:
if getattr(elm, "name") == "br":
break
if isinstance(elm, Tag):
value += elm.get_text(strip=True)
else:
value += elm.strip()
print(label, value)
You can use ".next_sibling" to navigate to the text you want like this:
for i in soup.select("strong"):
if i.get_text(strip=True) == "Min. Booking":
print(i.next_sibling) #this will print: 7 Days ($148.89)
See also http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

Combine multiple tags with lxml

I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...
I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.
Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space

Python to get onclick values

I'm using Python and BeautifulSoup to scrape a web page for a small project of mine. The webpage has multiple entries, each separated by a table row in HTML. My code partially works however a lot of the output is blank and it won't fetch all of the results from the web page or even gather them into the same line.
<html>
<head>
<title>Sample Website</title>
</head>
<body>
<table>
<td class=channel>Artist</td><td class=channel>Title</td><td class=channel>Date</td><td class=channel>Time</td></tr>
<tr><td>35</td><td>Lorem Ipsum</td><td>FooWorld</td><td>12/10/2014</td><td>2:53:17 PM</td></tr>
</table>
</body>
</html>
I want to only extract the values from the onclick action 'searchDB', so for example 'LoremIpsum' and 'FooWorld' are the only two results that I want.
Here is the code that I've written. So far, it properly pulls some of the write values, but sometimes the values are empty.
response = urllib2.urlopen(url)
html = response.read()
soup = bs4.BeautifulSoup(html)
properties = soup.findAll('a', onclick=True)
for eachproperty in properties:
print re.findall("'([a-zA-Z0-9]*)'", eachproperty['onclick'])
What am I doing wrong?
try like this:
>>> import re
>>> for x in soup.find_all('a'): # will give you all a tag
... try:
... if re.match('searchDB',x['onclick']): # if onclick attribute exist, it will match for searchDB, if success will print
... print x['onclick'] # here you can do your stuff instead of print
... except:pass
...
searchDB('LoremIpsum','FooWorld')
instead of print you can save it to some variable like
>>> k = x['onclick']
>>> re.findall("'(\w+)'",k)
['LoremIpsum', 'FooWorld']
\w is equivalent to [a-zA-Z0-9]
Try this
or row in rows[1:]:
cols = row.findAll('td')
link = cols[1].find('a').get('onclick')

Reading web pages with Python

I'm trying to read and handle a web-page in Python which has lines like the following in it:
<div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">
I'm currently only interested in the artist name (AC/DC) and album name (Live). I can read and print them with libxml2dom but I can't figure out how I can distinguish between the links because the node value for every link is None.
One obvious way would be to read the input line at a time but is there a more clever way of handling this html file so that I can create either two separate lists where each index matches the other or a struct with this info?
import urllib
import sgmllib
import libxml2dom
def collect_text(node):
"A function which collects text inside 'node', returning that text."
s = ""
for child_node in node.childNodes:
if child_node.nodeType == child_node.TEXT_NODE:
s += child_node.nodeValue
else:
s += collect_text(child_node)
return s
f = urllib.urlopen("/home/x/Documents/rym_list.html")
s = f.read()
doc = libxml2dom.parseString(s, html=1)
links = doc.getElementsByTagName("a")
for link in links:
print "--\nNode " , artist.childNodes
if artist.localName == "artist":
print "artist"
print collect_text(artist).encode('utf-8')
f.close()
Given the small snippit of HTML, I've no idea whether this would be effective on the full page, but here's how to extract 'AC/DC' and 'Live' using lxml.etree and xpath.
>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[#class="or_q_artist"]/a/text()|//td[#class="or_q_album"]/a/text()')
['AC/DC', 'Live']
See if you can solve the problem in javascript using jQuery style DOM/CSS selectors to get at the elements/text that you want.
If you can then get a copy of BeautifulSoup for python and you should be good to go in a matter of minutes.

Categories