Using beautifulsoup, how to I reference table rows in html page - python

I have a html page that looks like:
<html>
..
<form post="/products.hmlt" ..>
..
<table ...>
<tr>...</tr>
<tr>
<td>part info</td>
..
</tr>
</table>
..
</form>
..
</html>
I tried:
form = soup.findAll('form')
table = form.findAll('table') # table inside form
But I get an error saying:
ResultSet object has no attribute 'findAll'
I guess the call to findAll doesn't return a 'beautifulsoup' object? what can I do then?
Update
There are many tables on this page, but only 1 table INSIDE the tag shown above.

findAll returns a list, so extract the element first:
form = soup.findAll('form')[0]
table = form.findAll('table')[0] # table inside form
Of course, you should do some error checking (i.e. make sure it's not empty) before indexing into the list.

I like ars's answer, and certainly agree w/ the need for error-checking;
especially if this is going to be used in any kind of production code.
Here's perhaps a more verbose / explicit way of finding the data you seek:
from BeautifulSoup import BeautifulSoup as bs
html = '''<html><body><table><tr><td>some text</td></tr></table>
<form><table><tr><td>some text we care about</td></tr>
<tr><td>more text we care about</td></tr>
</table></form></html></body>'''
soup = bs(html)
for tr in soup.form.findAll('tr'):
print tr.text
# output:
# some text we care about
# more text we care about
For reference here is the cleaned-up HTML:
>>> print soup.prettify()
<html>
<body>
<table>
<tr>
<td>
some text
</td>
</tr>
</table>
<form>
<table>
<tr>
<td>
some text we care about
</td>
</tr>
<tr>
<td>
more text we care about
</td>
</tr>
</table>
</form>
</body>
</html>

Related

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.
Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!
The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))
I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

BeautifulSoup to find text inside the table

Here's what my HTML looks like:
<head> ... </head>
<body>
<div>
<h2>Something really cool here<h2>
<div class="section mylist">
<table id="list_1" class="table">
<thead> ... not important <thead>
<tr id="blahblah1"> <td> ... </td> </tr>
<tr id="blah2"> <td> ... </td> </tr>
<tr id="bl3"> <td> ... </td> </tr>
</table>
</div>
</div>
</body>
Now there are four occurrences of this div in my html file, each table content is different and each h2 text is different. Everything else is relatively the same. What I've been able to do so far is extract out the parent of each h2 - however, now I am not sure how to extract out each tr where in then, I can extract out the td that I really need.
Here is the code I've written so far...
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'html.parser')
currently_watching = soup.find('h2', text='Something really cool here')
parent = currently_watching.parent
I would suggest finding the parent div, which actually encloses the table, and then search for all td tags. Here's how you'd do it:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('myhtml.html'), 'lxml')
div = soup.find('div', class_='section mylist')
for td in div.find_all('td'):
print(td.text)
Searched around a bit and realized that it was my parser that was causing the issue. I installed lxml and everything works fine now.
Why is BeautifulSoup not finding a specific table class?

Extracting string from an HTML table from a given tab using Python

I need to extract a string value from the HTML table below. I want to loop through the table from a particular tab, and copy the results horizontally into the command line or some file.
I am pasting only one row of information here.
This table gets updated based on the changes happening on the Gerrits.
The result that I want is all the Gerrit number under a new tab
For example, if I want to get the Gerrit list from the approval queue, the values should display as shown in the image below.
7897423, 2423343, 34242342, 34234, 57575675
<ul>
<li><span>Review Queue</span></li>
<li><span>Approval Queue</span></li>
<li><span>Verification Queue</span></li>
<li><span>Merge Queue</span></li>
<li><span>Open Queue</span></li>
<li><span>Failed verification</span></li>
</ul>
<div id="tab1">
<h1>Review Queue</h1>
<table class="tablesorter" id="dashboardTable">
<thead>
<tr>
<th></th>
<th>Gerrit</th>
<th>Owner</th>
<th>CR(s)</th>
<th>Project</th>
<th>Dev Branch/PL</th>
<th>Subject</th>
<th>Status</th>
<th>Days in Queue</th>
</tr>
</thead>
<tbody>
<tr>
<td><input type="checkbox" /></td>
<td> 1696771 </td>
<td> ponga </td>
<td> 1055680 </td>
<td>platform/hardware/kiosk/</td>
<td> hidden-userspace.aix.2.0.dev </td>
<td>display: information regarding display</td>
<td> some info here </td>
<td> 2 </td>
</tr>
What stops you from leveraging BeautifulSoup for this?
Lets say you have already read the html (using sgmllib or any other library) into a string variable named html_contents.
Since you are not mentioning which column you want to get data from, I am extracting the gerrit number column.
You can simply do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tr in soup.tbody:
if tr.name == 'tr':
print tr.contents[3].contents[1].string
Above you can loop in all the tr tags inside the tbody and (assuming all the td tags contained inside the tr have the same structure) their value is extracted, in this case the value of the a tag inside.
Read the quick start, it will make your life easier on parsing HTML documents.

How can I prevent closing of tags in bad HTML using BeautifulSoup (python)?

I automatically translate content of HTML pages to different language, so I have to extract all text nodes from different HTML pages that are sometimes badly written (I have no possibility to edit these HTMLs).
By using BeautifulSoup i can extract those texts easily and replace it with translation, but when I display HTML after these operation: html = BeautifulSoup(source_html) - it's sometimes broken because BeautifulSoup automatically closes tags (for instance table tag is closed in wrong place).
Is there a way to prevent BeautifulSoup from closing these tags?
For instance this is my input:
html = "<table><tr><td>some text</td></table>" - closing tr is missing
after soup = BeautufulSoup(html) i get "<table><tr><td>some text</td></tr></table>"
and i want to get the very same html as input...
Is it possible at all?
BeautifulSoup excels in parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags (which may not be what you want). See the section on Parsing HTML in the docs which ends with an example that sounds very similar to your situation.
If you know what's wrong with your tags and understand the rules that BeautifulSoup uses, you may be able to augment you HTML slightly (perhaps remove or move certain tags) to make BeautifulSoup return the output you want.
If you can post a short example, someone might be able to give you more specific help.
Update (some examples)
For example, consider the example given in the docs (linked above):
from BeautifulSoup import BeautifulSoup
html = """
<html>
<form>
<table>
<td><input name="input1">Row 1 cell 1
<tr><td>Row 2 cell 1
</form>
<td>Row 2 cell 2<br>This</br> sure is a long cell
</body>
</html>"""
print BeautifulSoup(html).prettify()
The <table> tag will be closed before </form> to ensure that the table is properly nested within the form, leaving the last <td> hanging.
If we understand the problem, we can get the correct closing tab (</table>) by removing "<form>" before parsing:
>>> html = html.replace("<form>", "")
>>> soup = BeautifulSoup(html)
>>> print soup.prettify()
<html>
<table>
<td>
<input name="input1" />
Row 1 cell 1
</td>
<tr>
<td>
Row 2 cell 1
</td>
<td>
Row 2 cell 2
<br />
This
sure is a long cell
</td>
</tr>
</table>
</html>
If the <form> tag IS important, you can still add it after parsing. For example:
>>> new_form = Tag(soup, "form") # create form element
>>> soup.html.insert(0, new_form) # insert form as child of html
>>> new_form.insert(0, soup.table.extract()) # move table into form
>>> print soup.prettify()
<html>
<form>
<table>
<td>
<input name="input1" />
Row 1 cell 1
</td>
<tr>
<td>
Row 2 cell 1
</td>
<td>
Row 2 cell 2
<br />
This
sure is a long cell
</td>
</tr>
</table>
</form>
</html>

Can't parse a table child with xpath

I'm parsing a site with some messy html, they're 130 subsites and the only one that fails is the last one. The part in which fails is the bolded one. I get an empty list when I should be getting 3(parent and 2 childs). All sites have the same structure so I don't have a clue how to solve this.
from lxml.html import parse
# get a list of the urls of the foods to parse
main_site = "http://www.whfoods.com/foodstoc.php"
doc = parse(main_site).getroot()
doc.make_links_absolute()
sites = doc.xpath('/html/body//div[#class="full3col"]/ul/li/a/#href')
for site in sites:
doc = parse(site).getroot()
**table = doc.xpath("descendant::table[1]")[0]**
#food info list
table.xpath("//tr/td/table/tr/td/b/text()")
# food nutrients list
table.xpath("//tr/td/table[1]/tr/td/text()")
This is an html excerpt of the site that fails( click here if you want to see it complete):
<html>
<head>
<body>
<div id=mainpage">
<div id="subcontent">
(40+ <p> tags with things inside)
<p>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<b>Food's name<br>other things</b>
</td>
</tr>
<tr>
Heads of the table(not needed)
</tr>
<tr>
<td>nutrient name</td>
<td>dv</td>
<td>density</td>
<td>rating</td>
</tr>
</tbody>
</table>
<table> Not needed
...
All remaining closing tags
According to validator.w3.org when pointed at http://www.whfoods.com/genpage.php?tname=foodspice&dbid=97:
Line 253, column 147: non SGML character number 150
…ed mushrooms by Liquid Chromatography Mass Spectroscopy. The 230th ACS Natio…
The problem character is between "Chromatography" and "Mass". The page is declared to be encoded in ISO-8859-1, but as often happens in that case, it is lying:
>>> import unicodedata as ucd
>>> ucd.name(chr(150).decode('cp1252'))
'EN DASH'
Perhaps lxml is being picky about this also (Firefox doesn't care).

Categories