Reading web pages with Python - python

I'm trying to read and handle a web-page in Python which has lines like the following in it:
<div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">
I'm currently only interested in the artist name (AC/DC) and album name (Live). I can read and print them with libxml2dom but I can't figure out how I can distinguish between the links because the node value for every link is None.
One obvious way would be to read the input line at a time but is there a more clever way of handling this html file so that I can create either two separate lists where each index matches the other or a struct with this info?
import urllib
import sgmllib
import libxml2dom
def collect_text(node):
"A function which collects text inside 'node', returning that text."
s = ""
for child_node in node.childNodes:
if child_node.nodeType == child_node.TEXT_NODE:
s += child_node.nodeValue
else:
s += collect_text(child_node)
return s
f = urllib.urlopen("/home/x/Documents/rym_list.html")
s = f.read()
doc = libxml2dom.parseString(s, html=1)
links = doc.getElementsByTagName("a")
for link in links:
print "--\nNode " , artist.childNodes
if artist.localName == "artist":
print "artist"
print collect_text(artist).encode('utf-8')
f.close()

Given the small snippit of HTML, I've no idea whether this would be effective on the full page, but here's how to extract 'AC/DC' and 'Live' using lxml.etree and xpath.
>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[#class="or_q_artist"]/a/text()|//td[#class="or_q_album"]/a/text()')
['AC/DC', 'Live']

See if you can solve the problem in javascript using jQuery style DOM/CSS selectors to get at the elements/text that you want.
If you can then get a copy of BeautifulSoup for python and you should be good to go in a matter of minutes.

Related

Python `beautifulsoup` extraction on urls lacking `class`, other attributes?

Quick question [I am not very familiar with Python's BeautifulSoup()] If I have the following element,
how can I extract/get "1 comment" (or, "2 comments", etc.)? There is no class (or id, or other attributes) in that "a" tag.
<td class="subtext">
1 comment
</td>
How about the following, test with local html file
from bs4 import BeautifulSoup
url = "D:\\Temp\\example.html"
with open(url, "r") as page:
contents = page.read()
soup = BeautifulSoup(contents, 'html.parser')
element = soup.select('td.subtext')
value = element[0].get_text()
print(value)
example.html
<html>
<head></head>
<body>
<td class="subtext">
1 comment
</td>
</body>
</html>
You can use select method to apply a querySelect into your html, and then take the contents of the elements you found:
elements = soup.select(".subtext a")
[x.contents for x in elements]

Extracting data from an inconsistent HTML page using BeautifulSoup4 and Python

I’m trying to extract data from this webpage and I'm having some trouble due to inconsistancies within the page's HTML formatting. I have a list of OGAP IDs and I want to extract the Gene Name and any literature information (PMID #) for each OGAP ID I iterate through. Thanks to other questions on here, and the BeautifulSoup documentation, I've been able to consistantly get the gene name for each ID, but I'm having trouble with the literature part. Here's a couple search terms that highlight the inconsitancies.
HTML sample that works
Search term: OG00131
<tr>
<td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation:
<br> PMID:
20068230
[CAD, ETD MS/MS]; <br>
<br>
</td>
</tr>
HTML sample that doesn't work
Search term: OG00020
<td align="top" bgcolor="#FBFFCC">
<div class="STYLE28">Literature describing O-GlcNAcylation: </div>
<div class="STYLE28">
<div class="STYLE28">PMID:
16408927
[Azide-tag, nano-HPLC/tandem MS]
</div>
<br>
Site has not yet been determined. Use
OGlcNAcScan
to predict the O-GlcNAc site. </div>
</td>
Here's the code I have so far
import urllib2
from bs4 import BeautifulSoup
#define list of genes
#initialize variables
gene_list = []
literature = []
# Test list
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
for i in range(len(gene_listID)):
print gene_listID[i]
# Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided
dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i]
# Opens the URL as a page
page = urllib2.urlopen(dbOGAP)
# Reads the page and parses it through "lxml" format
soup = BeautifulSoup(page, "lxml")
gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text
print gene_name[1:]
gene_list.append(gene_name[1:])
# PubMed IDs are located near the <td> tag with the term "Data and Source"
pmid = soup.find("span", text="Data and Source")
# Based on inspection of the website, need to move up to the parent <td> tag
pmid_p = pmid.parent
# Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag)
pmid_s = pmid_p.next_sibling
#for child in pmid_s.descendants:
# print child
# Now we search down the tree to find the next table data (<td>) tag
pmid_c = pmid_s.find("td")
temp_lit = []
# Next we print the text of the data
#print pmid_c.text
if "No literature is available" in pmid_c.text:
temp_lit.append("No literature is available")
print "Not available"
else:
# and then print out a list of urls for each pubmed ID we have
print "The following is available"
for link in pmid_c.find_all('a'):
# the <a> tag includes more than just the link address.
# for each <a> tag found, print the address (href attribute) and extra bits
# link.string provides the string that appears to be hyperlinked.
# In this case, it is the pubmedID
print link.string
temp_lit.append("PMID: " + link.string + " URL: " + link.get('href'))
literature.append(temp_lit)
print "\n"
So it seems the element is what is throwing the code for a loop. Is there a way to search for any element with the text "PMID" and return the text that comes after it (and url if there is a PMID number)? If not, would I just want to check each child, looking for the text I'm interested in?
I'm using Python 2.7.10
import requests
from bs4 import BeautifulSoup
import re
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"]
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID)
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+')
a_tag = soup.find('a', href=regex)
has_pmid = 'PMID' in a_tag.previous_element
if has_pmid :
print(a_tag.text, a_tag.next_sibling, a_tag.get("href"))
else:
print("Not available")
out:
18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927
Not available
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation
find the first a tag that match the target url, which end with numbers, than check if 'PMID' in it's previous element.
this web is so inconsistancies , and i try many times, hope this would help

Python to get onclick values

I'm using Python and BeautifulSoup to scrape a web page for a small project of mine. The webpage has multiple entries, each separated by a table row in HTML. My code partially works however a lot of the output is blank and it won't fetch all of the results from the web page or even gather them into the same line.
<html>
<head>
<title>Sample Website</title>
</head>
<body>
<table>
<td class=channel>Artist</td><td class=channel>Title</td><td class=channel>Date</td><td class=channel>Time</td></tr>
<tr><td>35</td><td>Lorem Ipsum</td><td>FooWorld</td><td>12/10/2014</td><td>2:53:17 PM</td></tr>
</table>
</body>
</html>
I want to only extract the values from the onclick action 'searchDB', so for example 'LoremIpsum' and 'FooWorld' are the only two results that I want.
Here is the code that I've written. So far, it properly pulls some of the write values, but sometimes the values are empty.
response = urllib2.urlopen(url)
html = response.read()
soup = bs4.BeautifulSoup(html)
properties = soup.findAll('a', onclick=True)
for eachproperty in properties:
print re.findall("'([a-zA-Z0-9]*)'", eachproperty['onclick'])
What am I doing wrong?
try like this:
>>> import re
>>> for x in soup.find_all('a'): # will give you all a tag
... try:
... if re.match('searchDB',x['onclick']): # if onclick attribute exist, it will match for searchDB, if success will print
... print x['onclick'] # here you can do your stuff instead of print
... except:pass
...
searchDB('LoremIpsum','FooWorld')
instead of print you can save it to some variable like
>>> k = x['onclick']
>>> re.findall("'(\w+)'",k)
['LoremIpsum', 'FooWorld']
\w is equivalent to [a-zA-Z0-9]
Try this
or row in rows[1:]:
cols = row.findAll('td')
link = cols[1].find('a').get('onclick')

Get following node in different ancestor using lxml and xpath

I'm writing a text-to-speech program that reads math equations. I have a thread that needs to pull math equations (as MathJax SVG's) and parse them to prose.
Because of how the content is laid out, the math equations can be arbitrarily nested in other elements, like paragraphs, bolds, tables, etc.
Using a reference to the current element, how do I get the next <span class="MathJax_SVG">, which may be embedded in some other parent/ancestor?
I tried to solve it using the following:
nextMath = currentElement.xpath('following::.//span[#class=\'MathJax_SVG\']')
Returns nothing, even though I can confirm visually that there is something following it. I tried removing the period, but lxml complains that my XPath is malformed.
Have you guys ran into this before?
P.S. Here is a test document to show my point:
<html>
<head>
<title>Test Document</title>
</head>
<body>
<h1 id="mainHeading">The Quadratic Formula</h1>
<p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
<p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
<p>Here are some possible values when you use the formula:</p>
<p>
<table>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
</tr>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
</tr>
</table>
</p>
</body>
</html>
Updates
Learned that lxml doesn't support absolute positions. This may be relevant.
Some Testing Code (assuming you saved HTML as test.html)
from lxml import html
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//h1[#id=\'mainHeading\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = 'following::.//span[#class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)
if len(nextElem) > 0:
print 'Next equation:', html.tostring(nextElem[0])
else:
print 'No next equation...'
Do you need to iterate through the document? You could also search for span elements of the class MathJax_SVG directly:
from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[#class='MathJax_SVG']")
I ended up creating my own function to get what I want. I called it getNext(elem, xpathString). If there is a more efficient way to do this, I'm all ears. I'm not confident in its performance.
from lxml import html
def getNext(elem, xpathString):
'''
Gets the next element defined by XPath. The element returned
may be itself.
'''
myElem = elem
nextElem = elem.find(xpathString)
while nextElem is None:
if myElem.getnext() is not None:
myElem = myElem.getnext()
nextElem = myElem.find(xpathString)
else:
if myElem.getparent() is not None:
myElem = myElem.getparent()
else:
break
return nextElem
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//span[#id=\'MathJax_Element_Frame_1\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = './/span[#class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)
if nextElem is not None:
print 'Next equation:', html.tostring(nextElem)
else:
print 'No next equation...'

Extract content within a tag with BeautifulSoup

I'd like to extract the content Hello world. Please note that there are multiples <table> and similar <td colspan="2"> on the page as well:
<table border="0" cellspacing="2" width="800">
<tr>
<td colspan="2"><b>Name: </b>Hello world</td>
</tr>
<tr>
...
I tried the following:
hello = soup.find(text='Name: ')
hello.findPreviousSiblings
But it returned nothing.
In addition, I'm also having problem with the following extracting the My home address:
<td><b>Address:</b></td>
<td>My home address</td>
I'm also using the same method to search for the text="Address: " but how do I navigate down to the next line and extract the content of <td>?
The contents operator works well for extracting text from <tag>text</tag> .
<td>My home address</td> example:
s = '<td>My home address</td>'
soup = BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address
<td><b>Address:</b></td> example:
s = '<td><b>Address:</b></td>'
soup = BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:
use next instead
>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'
next and previous let you move through the document elements in the order they were processed by the parser while sibling methods work with the parse tree
Use the below code to get extract text and content from html tags with python beautifulSoup
s = '<td>Example information</td>' # your raw html
soup = BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html
from bs4 import BeautifulSoup, Tag
def get_tag_html(tag: Tag):
return ''.join([i.decode() if type(i) is Tag else i for i in tag.contents])

Categories