Using BeautifulSoup To Extract Specific TD Table Elements Text? - python

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.
The HTML is structured like so:
<html>
<body>
<table class="mainTable">
<thead>
<tr>
<th>IP</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="hello.html">127.0.0.1<a></td>
<td><img src="uk.gif" />uk</td>
</tr>
<tr>
<td><a href="hello.html">192.168.0.1<a></td>
<td><img src="uk.gif" />us</td>
</tr>
<tr>
<td><a href="hello.html">255.255.255.0<a></td>
<td><img src="uk.gif" />br</td>
</tr>
</tbody>
</table>
The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))
table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)
this outputs:
127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br
What i need is the IP table.tbody.tr.td.a elements text and not the country table.tbody.tr.td.img.a elements.
Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.
Thanks.

This gives you the right list:
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[127.0.0.1<a></a>, <a></a>, 192.168.0.1<a></a>, <a></a>, 255.255.255.0<a></a>, <a></a>]
just apply .text on the elements of this list.
There are multiple empty <a></a> tags in above list because the <a> tags in the html are not closed properly. To get rid of them, you may use
pred = lambda tag: tag.parent.find('img') is None and tag.text
and ultimately:
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
if ip_pat.match(row.text):
print(row.text)

Search just first <td> for each row in tbody:
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
or maybe more readable:
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

First note that the HTML is not well-formed. It is not closing the a tags. There are two <a> tags started here:
<a href="hello.html">127.0.0.1<a>
If you print table you'll see BeautifulSoup is parsing the HTML as:
<td>
127.0.0.1<a></a>
</td>
...
Each a is followed by an empty a.
Given the presence of those extra <a> tags, if you want every third <a> tag, then
for row in table.findAll("a")[::3]:
print(row.get_text())
suffices:
127.0.0.1
192.168.0.1
255.255.255.0
On the other hand, if the occurrence of <a> tags is not so regular and you only want that <a> tags with no previous sibling (such as, but not limited to <img>), then
for row in table.findAll("a"):
sibling = row.findPreviousSibling()
if sibling is None:
print(row.get_text())
would work.
If you have lxml, the criteria can be expressed more succinctly using XPath:
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[#class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)
The XPath used above has the following meaning:
//table select all <table> tags
[#class="mainTable"] that have a class="mainTable" attribute
// from these tags select descendants
td/a which are td tags with a child <a> tag
[not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag
/text() return the text of the <a> tag
It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

Related

Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object.
Currently, parse_thead does the following when the <thead> tag is present:
In BeautifulSoup, I get table objects with doc.find_all('table') and I can use table.find_all('thead')
In lxml, I get table objects with doc.xpath() on an xpath_expr on //table, and I can use table.xpath('.//thead')
and parse_tbody and parse_tfoot work in the same way. (I did not write this code and I am not experienced with either BS or lxml.) However, without a <thead>, parse_thead returns nothing and parse_tbody returns the header and the body together.
I append a wikitable instance below as an example. It lacks <thead> and <tbody>. Instead all rows, header or not, are enclosed in <tr>...</tr>, but header rows have <th> elements and body rows have <td> elements. Without <thead>, it seems like the right criterion for identifying the header is "from the start, put rows into the header until you find a row that has an element that's not <th>".
I'd appreciate suggestions on how I could write parse_thead and parse_tbody. Without much experience here, I would think I could either
Dive into the table object and manually insert thead and tbody tags before parsing it (this seems nice because then I wouldn't have to change any other code that recognizes tables with <thead>), or alternately
Change parse_thead and parse_tbody to recognize the table rows that have only <th> elements. (With either alternative, it seems like I really need to detect the head-body boundary in this way.)
I don't know how to do either of those things and I'd appreciate advice on both which alternative is more sensible and how I might go about it.
(Edit: Examples with no header rows and multiple header rows. I can't assume it has only one header row.)
<table class="wikitable">
<tr>
<th>Rank</th>
<th>Score</th>
<th>Overs</th>
<th><b>Ext</b></th>
<th>b</th>
<th>lb</th>
<th>w</th>
<th>nb</th>
<th>Opposition</th>
<th>Ground</th>
<th>Match Date</th>
</tr>
<tr>
<td>1</td>
<td>437</td>
<td>136.0</td>
<td><b>64</b></td>
<td>18</td>
<td>11</td>
<td>1</td>
<td>34</td>
<td>v West Indies</td>
<td>Manchester</td>
<td>27 Jul 1995</td>
</tr>
</table>
We can use <th> tags to detect headers, in case the table doesn't contain <thead> tags. If all columns of a row are <th> tags then we can assume that it is a header. Based on that I created a function that identifies the header and body.
Code for BeautifulSoup:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.select('tr'):
if all(t.name == 'th' for t in tr.find_all(recursive=False)):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
Code for lxml:
def parse_table(table):
head_body = {'head':[], 'body':[]}
for tr in table.cssselect('tr'):
if all(t.tag == 'th' for t in tr.getchildren()):
head_body['head'] += [tr]
else:
head_body['body'] += [tr]
return head_body
The table parameter is either a Beautiful Soup Tag object or a lxml Element object. head_body is a dictionary that contains two lists of <tr> tags, the header and body rows.
Usage example:
html = '<table><tr><th>heade</th></tr><tr><td>body</td></tr></table>'
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')
table_rows = parse_table(table)
print(table_rows)
#{'head': [<tr><th>header</th></tr>], 'body': [<tr><td>body</td></tr>]}
You should verify if the tr tag contains the th child you want, candidate.th returns None if there's no th inside candidate:
possibleHeaders = soup.find("table").findAll("tr")
Headers = []
for candidate in possibleHeaders:
if candidate.th:
Headers.append(candidate)

Python - beautifulsoup - how to deal with missing closing tags

I would like to scrape the table from html code using beautifulsoup. A snippet of the html is shown below. When using table.findAll('tr') I get the entire table and not only the rows. (probably because the closing tags are missing from the html code?)
<TABLE COLS=9 BORDER=0 CELLSPACING=3 CELLPADDING=0>
<TR><TD><B>Artikelbezeichnung</B>
<TD><B>Anbieter</B>
<TD><B>Menge</B>
<TD><B>Taxe-EK</B>
<TD><B>Taxe-VK</B>
<TD><B>Empf.-VK</B>
<TD><B>FB</B>
<TD><B>PZN</B>
<TD><B>Nachfolge</B>
<TR><TD>ACTIQ 200 Mikrogramm Lutschtabl.m.integr.Appl.
<TD>Orifarm
<TD ID=R> 30 St
<TD ID=R> 266,67
<TD ID=R> 336,98
<TD>
<TD>
<TD>12516714
<TD>
</TABLE>
Here is my python code to show what I am struggling with:
soup = BeautifulSoup(data, "html.parser")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.text)
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = BeautifulSoup(data, "lxml")
table = soup.findAll("table")[0]
rows = table.find_all('tr')
for tr in rows:
print(tr.get_text(strip=True))
Note that lxml added html & body tags because they weren't present in the source (It'll try to create a well formed document as previously state).

Getting value from tag with BeautifulSoup

I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.
For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th nor td tags are unique? (See example HTML).
Say I have tag = soup.find('th') with the value
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> - How can I get the value of '$25 million' from tag?
I thought I could do something like tag.td or tag.text but neither of these are working for me.
Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?
Example HTML Code:
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th>
<td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference">[2]</sup></td>
</tr>
<tr>
<th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th>
<td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference">[3]</sup></td>
</tr>
You can firstly find the node with tag td whose text is Budget and then find its next sibling td and get the text from the node:
soup.find("th", text="Budget").find_next_sibling("td").get_text()
# u'$25 million[2]'
To get every Amount in <td> tags You should use
tags = soup.findAll('td')
and then
for tag in tags:
print tag.get_text() # To get the text i.e. '$25 million'
What you need is find_all() method in BeatifulSoup.
For example:
tdTags = soup.find_all('td',{'class':'reference'})
This means you will find all 'td' tags when class = 'reference'.
You can find whatever td tags you want as long as you find the unique attribute in expected td tags.
Then you can do a for loop to find the content, as #Bijoy said.
The other possible way might be:
split_text = soup.get_text().split('\n')
# The next index from Budget is cost
split_text[split_text.index('Budget')+1]

Storing the unknown Id of an html tag

So I am trying to scrape an html using BeautifulSoup, but I am having problems finding a tag id using Python 3.4. I know what the tag ("tr") is, but the id is constantly changing and I would like to save the id when it changes. For example:
<div class = "thisclass"
<table id = "thistable">
<tbody>
<tr id="what i want">
<td class = "someinfo">
<tbody>
<table>
<div>
I can find the div tag and the table, and I know the tr tag is there, but I want to extract the text next to id, without knowing what the text is going to say.
so far I have this code:
soup = BeautifulSoup(url.read())
divTag = soup.find_all("table",id ="thistable")
i = 0
for i in divTag:
trtag = soup.find("tr", id)
print(trtag)
i = i+1
if anyone could help me solve this problem I would appreciate it.
You can use a css selector:
print([element.get('id') for element in soup.select('table#thistable tr[id]'))

How can I get the first and third td from a table with BeautifulSoup?

I am currently using Python and BeautifulSoup to scrape some website data.
I'm trying to pull cells from a table which is formatted like so:
<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>
The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td> and the third <td>, which would be 1 and 20, respectively.
Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags of each row of the table?
Update:
I figured out the problem. I was using html.parser instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.
I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib got the job done.So try using lxml, html, and also html5lib if you are having problems parsing the entire table.
That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:
<tr>
<td>1
<td>
<td>20
<td>5%</td>
</td>
</td>
</td>
</tr>
BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:
<tr>
<td>1 <!-- an IMPLICITLY (no closing tag) closed td element -->
<td> <!-- as above -->
<td>20 <!-- as above -->
<td>5%</td> <!-- an EXPLICITLY closed td element -->
</td> <!-- an error; ignore this -->
</td> <!-- as above -->
</td> <!-- as above -->
</tr>
... so that, as you say, 1 and 20 are in the first and third td elements (not tags) respectively.
You can actually get at the contents of these td elements like this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0] # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'

Categories