Can't parse a table child with xpath - python

I'm parsing a site with some messy html, they're 130 subsites and the only one that fails is the last one. The part in which fails is the bolded one. I get an empty list when I should be getting 3(parent and 2 childs). All sites have the same structure so I don't have a clue how to solve this.
from lxml.html import parse
# get a list of the urls of the foods to parse
main_site = "http://www.whfoods.com/foodstoc.php"
doc = parse(main_site).getroot()
doc.make_links_absolute()
sites = doc.xpath('/html/body//div[#class="full3col"]/ul/li/a/#href')
for site in sites:
doc = parse(site).getroot()
**table = doc.xpath("descendant::table[1]")[0]**
#food info list
table.xpath("//tr/td/table/tr/td/b/text()")
# food nutrients list
table.xpath("//tr/td/table[1]/tr/td/text()")
This is an html excerpt of the site that fails( click here if you want to see it complete):
<html>
<head>
<body>
<div id=mainpage">
<div id="subcontent">
(40+ <p> tags with things inside)
<p>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<b>Food's name<br>other things</b>
</td>
</tr>
<tr>
Heads of the table(not needed)
</tr>
<tr>
<td>nutrient name</td>
<td>dv</td>
<td>density</td>
<td>rating</td>
</tr>
</tbody>
</table>
<table> Not needed
...
All remaining closing tags

According to validator.w3.org when pointed at http://www.whfoods.com/genpage.php?tname=foodspice&dbid=97:
Line 253, column 147: non SGML character number 150
…ed mushrooms by Liquid Chromatography Mass Spectroscopy. The 230th ACS Natio…
The problem character is between "Chromatography" and "Mass". The page is declared to be encoded in ISO-8859-1, but as often happens in that case, it is lying:
>>> import unicodedata as ucd
>>> ucd.name(chr(150).decode('cp1252'))
'EN DASH'
Perhaps lxml is being picky about this also (Firefox doesn't care).

Related

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

Extracting string from an HTML table from a given tab using Python

I need to extract a string value from the HTML table below. I want to loop through the table from a particular tab, and copy the results horizontally into the command line or some file.
I am pasting only one row of information here.
This table gets updated based on the changes happening on the Gerrits.
The result that I want is all the Gerrit number under a new tab
For example, if I want to get the Gerrit list from the approval queue, the values should display as shown in the image below.
7897423, 2423343, 34242342, 34234, 57575675
<ul>
<li><span>Review Queue</span></li>
<li><span>Approval Queue</span></li>
<li><span>Verification Queue</span></li>
<li><span>Merge Queue</span></li>
<li><span>Open Queue</span></li>
<li><span>Failed verification</span></li>
</ul>
<div id="tab1">
<h1>Review Queue</h1>
<table class="tablesorter" id="dashboardTable">
<thead>
<tr>
<th></th>
<th>Gerrit</th>
<th>Owner</th>
<th>CR(s)</th>
<th>Project</th>
<th>Dev Branch/PL</th>
<th>Subject</th>
<th>Status</th>
<th>Days in Queue</th>
</tr>
</thead>
<tbody>
<tr>
<td><input type="checkbox" /></td>
<td> 1696771 </td>
<td> ponga </td>
<td> 1055680 </td>
<td>platform/hardware/kiosk/</td>
<td> hidden-userspace.aix.2.0.dev </td>
<td>display: information regarding display</td>
<td> some info here </td>
<td> 2 </td>
</tr>
What stops you from leveraging BeautifulSoup for this?
Lets say you have already read the html (using sgmllib or any other library) into a string variable named html_contents.
Since you are not mentioning which column you want to get data from, I am extracting the gerrit number column.
You can simply do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tr in soup.tbody:
if tr.name == 'tr':
print tr.contents[3].contents[1].string
Above you can loop in all the tr tags inside the tbody and (assuming all the td tags contained inside the tr have the same structure) their value is extracted, in this case the value of the a tag inside.
Read the quick start, it will make your life easier on parsing HTML documents.

Pull out information from between tags without unique classes or IDs

I'm trying to scrape the content of a particular website and render the output so that it can be further manipulated / used in other mediums. The biggest challenge I'm facing is that few of the tags have unique IDs or classes, and some of the content is simply displayed in between tags, e.g., <br></br>TEXT<br></br> (see, for example, "Ranking" in the sample HTML below).
Somehow, I've created working code - even if commensurate with the skill of someone in the fourth grade - but this is the furthest I've gotten, and I was hoping to get some help on how to continue to pull out the relevant information. Ultimately, I'm looking to pull any plain text within tags, and plain text in between tags. The only exception is that, whenever there's an img of word_icon_b.gif or word_icon_R.gif, then text "active" or "inactive" gets substituted.
Below is the code I've managed to cobble together:
from bs4 import BeautifulSoup
import urllib2
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
table = soup.findAll("td",{"class": ["row_col_1", "row_col_2"]})
print table
There are several tables in the page, but none have a unique ID or class, so I didn't reference them, and instead just pulled out the TDs - since these are the only tags with unique classes which sandwich the information I'm looking for. The HTML that the above pulls from is as follows:
<tr>
<td class="row_col_1" valign="top" nowrap="" scope="row">
5/22/2014
</td>
<td class="row_col_1" valign="top" scope="row">
100%
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
<a target="_top" href="/NY/2014/N252726.DOC">
<img width="18" height="14" border="0" alt="Click here to download word document n252726.doc" src="images/word_icon_b.gif"></img>
</a>
<a target="_top" href="index.asp?ru=n252726&qu=ranking&vw=detail">
NY N252726
</a>
<br></br>
Ranking
<br></br>
<a href="javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp?Phase=List_items&lookfor=847720');">
8477.20.mnop
</a>
<br></br>
<a target="_new" href="http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items&lookfor=847759">
8477.59.abcd
</a>
<br></br>
</td>
<td class="row_col_1" valign="top" scope="row">
The ranking of a long-fanged monkey sock puppet that is coding-ly challenged
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
</td>
</tr>
The reason why I have ["row_col_1", "row_col_2"] is because the data served up is presented as <td class="row_col_1" valign="top" nowrap="" scope="row"> for the odd rows, and <td class="row_col_2" valign="top" nowrap="" scope="row"> for the even rows. I have no control over the HTML that attempting to I'm pulling from.
Also, the base links, such as javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp? and http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items& will always remain the same (though the specific links will change, e.g., current.asp?Phase=List_items&lookfor=847759" may next be on the next page as current.asp?Phase=List_items&lookfor=101010">).
EDIT: #Martijn: I'm hoping to have returned to me the following items from the HTML: 1) 5/22/2014, 2) 100%, 3) the image name, word_icon_b.gif (to substitute text for it) 4) NY N252726 (and the preceding link), 5) Ranking, 6) 8477.20.mnop (and the preceding link), 7) 8477.59.abcd (and the preceding link), and 8) 'The ranking of a long-fanged monkey sock puppet that is coding-ly challenged.'
I'd like the output to be wrapped in XML tags, but this is not excessively important I imagine these tags can just be inserted into the bs4 code.
If you would like to try lxml library and xpath, this is a hint on how your code might look like. You should probably make a better selection of the desired <tr>s, than I did without seeing the html code. Also you should handle any potential IndexErrors, etc..
from lxml import html
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
x = html.fromstring(pageHtml)
all_rows = x.xpath(".//tr")
results = []
for row in all_rows:
date = row.xpath(".//td[contains(#class, 'row_col')/text()]")[0]
location = row.xpath(".//a[contains(#href, 'index.asp')/text()]")[0]
rank = row.xpath(".//a[contains(#href, 'javascript:disclaimer(')]/text()")[0]
results.append({'date': date, 'location': location, 'rank': rank})

Add parent tags with beautiful soup

I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)

Using beautifulsoup, how to I reference table rows in html page

I have a html page that looks like:
<html>
..
<form post="/products.hmlt" ..>
..
<table ...>
<tr>...</tr>
<tr>
<td>part info</td>
..
</tr>
</table>
..
</form>
..
</html>
I tried:
form = soup.findAll('form')
table = form.findAll('table') # table inside form
But I get an error saying:
ResultSet object has no attribute 'findAll'
I guess the call to findAll doesn't return a 'beautifulsoup' object? what can I do then?
Update
There are many tables on this page, but only 1 table INSIDE the tag shown above.
findAll returns a list, so extract the element first:
form = soup.findAll('form')[0]
table = form.findAll('table')[0] # table inside form
Of course, you should do some error checking (i.e. make sure it's not empty) before indexing into the list.
I like ars's answer, and certainly agree w/ the need for error-checking;
especially if this is going to be used in any kind of production code.
Here's perhaps a more verbose / explicit way of finding the data you seek:
from BeautifulSoup import BeautifulSoup as bs
html = '''<html><body><table><tr><td>some text</td></tr></table>
<form><table><tr><td>some text we care about</td></tr>
<tr><td>more text we care about</td></tr>
</table></form></html></body>'''
soup = bs(html)
for tr in soup.form.findAll('tr'):
print tr.text
# output:
# some text we care about
# more text we care about
For reference here is the cleaned-up HTML:
>>> print soup.prettify()
<html>
<body>
<table>
<tr>
<td>
some text
</td>
</tr>
</table>
<form>
<table>
<tr>
<td>
some text we care about
</td>
</tr>
<tr>
<td>
more text we care about
</td>
</tr>
</table>
</form>
</body>
</html>

Categories