BeautifulSoup Parsing with Bad HTML Tables

BeautifulSoup Parsing with Bad HTML Tables - python

I'm trying to parse tables similar to the following with BeautifulSoup to extract the name, age, and position for each person.
<TABLE width="100%" align="center" cellspacing="0" cellpadding="0" border="0">
<TR>
<TD></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>Age as of</B></FONT></TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD align="center" nowrap><FONT size="2"><B>Name</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>November 1, 1999</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap><FONT size="2"><B>Position</B></FONT></TD>
</TR>
<TR>
<TD align="center" nowrap><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap><HR size="1"></TD>
</TR>
<TR>
<TD align="left" valign="top"><FONT size="2">
Terry S. Jacobs</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">57</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Chairman of the Board, Chief Executive Officer, Treasurer and
director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
William L. Stakelin</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">56</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
President, Chief Operating Officer, Secretary and director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
Joel M. Fairman</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">70</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Vice Chairman and director</FONT></TD>
</TR>
</TABLE>
My current attempt is as follows:
soup = BeautifulSoup(in_file)
out = []
headers = soup.findAll(['td','th'])
for header in headers:
if header.find(text = re.compile(r"^age( )?", re.I)):
out.append(header)
table = out[0].find_parent("table")
rows = table.findAll('tr')
filter_regex = re.compile(r'[\w][\w .,]*', re.I)
data = [[td.find(text=filter_regex) for td in tr.findAll("td")] for tr in rows]
Things work find for the first person, but the bad <tr><td><tr><td>... lines really mess things up from there. I am trying to do this for a few thousand HTML files, each having slightly different table structure. That said, this feature of <tr> and <td> tags not being closed appears quite common across the files.
Anyone have thoughts on how to generalize the above parsing to work with tables that have constructs such as these? Thanks a lot!

You can take advantage of the fact that the valign attribute is set to top in all of the fields you'd like to keep and none of the ones you don't:
soup = BeautifulSoup(in_file)
cells = [cell.text.strip() for cell in soup('td', valign='top')]
Then you can sort this list of cells into a two-dimensional structure. There are three cells per entry, so you can sort it out pretty simply by doing something like this:
entries = []
for i in range(0, len(cells), 3):
entries.append(cells[i:i+3])

In the off chance anyone else get stuck with this issue and stumbles in here, the modern solution is to change which parser you are using. The default parser, 'html.parser' is pretty good when working with close enough HTML with properly closed tags, but the second you have to deal with edge cases (like Example 1 below, which is similar to the OP issue), that still goes right out the window even 8 years later (example 2 below).
In the documentation for BeautifulSoup4 (current version 4.9.3), there is a section detailing parser selection: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Example 1, the raw HTML:
<TABLE >
<TR VALIGN="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br>
<TR VALIGN="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<TR VALIGN="top">
<td> <td><b>Type:</b>
<td> 61
<TR VALIGN="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</TABLE>
Example 2, results when using BeautifulSoup(html, 'html.parser'):
<table>
<tr valign="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br/>
<tr valign="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<tr valign="top">
<td> <td><b>Type:</b>
<td> 61
<tr valign="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</td></td></td></tr></td></td></td></tr></td></td></td></tr></td></td></td></tr></table>
Example 3, results when using BeautifulSoup(html, 'html5lib'):
<table>
<tbody><tr valign="top">
<td> </td><td><b>Title:</b>
</td><td> title is here <i>-subtitle</i><br/>
</td></tr><tr valign="top">
<td>
</td><td><b>Date:</b>
</td><td> Thursday , August 27th, 2020
</td></tr><tr valign="top">
<td> </td><td><b>Type:</b>
</td><td> 61
</td></tr><tr valign="top">
<td>
</td><td><b>Status:</b>
</td><td> ACTIVE - ACTIVE
</td></tr></tbody></table>
There are also parsers that are written externally in C such as 'lxml' that you could potentially use that is much faster according to the documentation.

Related

How to grab the numebr with the end of "0" from the website?

I use BeaustifulSoup to grab some texts on the url"https://nature.altmetric.com/details/114136890",and get such response
# The table is called twitterGeographical_TableChoice
<table>
<tr>
<th>Country</th>
<th class="num">Count</th>
<th class="num percent">As %</th>
</tr>
<tr>
<td>Japan</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Poland</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>Spain</td>
<td class="num">3</td>
<td class="num">12%</td>
</tr>
<tr>
<td>El Salvador</td>
<td class="num">2</td>
<td class="num">8%</td>
</tr>
<tr>
<td>Ecuador</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Mexico</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>Chile</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr>
<td>India</td>
<td class="num">1</td>
<td class="num">4%</td>
</tr>
<tr class="meta">
<td>Unknown</td>
<td class="num">10</td>
<td class="num">40%</td>
</tr>
</table>
Then I want to get the number from it.I use regular expression to get it.
My format is
twitterGeographical_Table_Num_pattern = re.compile('<td class=\"num\">(\d*%)</td>',re.S)
twitterGeographical_Table_Num = twitterGeographical_Table_Num_pattern.findall(twitterGeographical_TableChoice)
But I can only get 4% instead of 40%.I am puzzled.Thanks for your help!

I am not sure why you are going to get the numbers with the regex module when BeautifulSoup has already a lot of approaches for this. Anyway, if you are interested in regex you can use this pattern instead:
<td class=\"num\">((\d+)(%)?)</td>
Then you can get the numbers (percentages, if they are) using the code below:
[x[0] for x in twitterGeographical_Table_Num]
Output
['10', '40%']
Side note: I beg you to consider naming the variables shorter and more clear!:)

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!

I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

Extract 2 pieces of information from html in python

I need help figuring out how to extract Grab and the number following data-b. There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a>. I've been trying to do this with beautiful soup, though it looks like lxml might work better. I can get either all of the <tr>s or only the < a>...< /a> lines that contain Need but not just the <tr>s that contain need in that <a> line.
<tr >
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr >
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
Thanks for any help!

from bs4 import BeautifulSoup
data = '''<tr>
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr>
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
'''
soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
print(a['data-b'])

Options for using BeautifulSoup with basic table - no class ids,

Is there a recommended way for using BeautifulSoup 4 in python when you have a table with no class or attribute values?
I was considering just using Get_Text() to dump the text out but if I wanted to pick individual values out or break the table into more discrete sections how would I go about it ?
<table cellpadding="0" cellspacing="0" id="programmeDescriptor" width="100%">
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="1">
Awards
</th>
</tr>
<tr>
</tr>
<tr>
<td>
Ordinary Bachelor Degree
</td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Programme Code:
</th>
<td width="150">
CodeValue
</td>
</tr>
</table>
</td>
<td width="5">
</td>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Mode of Delivery:
</th>
<td width="150">
Full Time
</td>
</tr>
</table>
</td>
<td width="5">
</td>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
No. of Semesters:
</th>
<td width="150">
6
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
NFQ Level:
</th>
<td width="150">
7
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Embedded Award:
</th>
<td width="150">
No
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th width="160">
Department:
</th>
<td>
Computing
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<h3>
Programme Outcomes
</h3>
<p class="info">
On successful completion of this programme the learner will be able to :
</p>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th width="30">
PO1
</th>
<td class="head" colspan="2">
Knowledge - Breadth
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</tr>
<tr>
<th width="30">
PO2
</th>
<td class="head" colspan="2">
Knowledge - Kind
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO3
</th>
<td class="head" colspan="2">
Skill - Range
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO4
</th>
<td class="head" colspan="2">
Skill - Selectivity
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO5
</th>
<td class="head" colspan="2">
Competence - Context
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<tdSome block of text </td>
</tr>
<tr>
<th width="30">
PO6
</th>
<td class="head" colspan="2">
Competence - Role
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO7
</th>
<td class="head" colspan="2">
Competence - Learning to Learn
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO8
</th>
<td class="head" colspan="2">
Competence - Insight
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• The graduate will demonstrate the ability to specify, design and build an IT system or research & report on a current IT topic
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<h3>
Semester Schedules
</h3>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 1 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td
<a href="index.cfm/page/module/moduleId/3897" target="_blank">
Web & User Experience
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3881" target="_blank">
Software Development 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1645" target="_blank">
Computer Architecture
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2328" target="_blank">
Discrete Mathematics 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3848" target="_blank">
Business & Information Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2054" target="_blank">
Learning to Learn at Third Level
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 1 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3886" target="_blank">
Software Development 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3895" target="_blank">
Object Oriented Systems Analysis
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3875" target="_blank">
Database Fundamentals
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3874" target="_blank">
Operating Systems Fundamentals
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2330" target="_blank">
Statistics
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2527" target="_blank">
Social Media Communications
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 2 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3877" target="_blank">
Web & Mobile Design & Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3876" target="_blank">
Database Design And Programming
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3869" target="_blank">
Software Development 3
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3873" target="_blank">
Software Quality Assurance and Testing
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3629" target="_blank">
Networking 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2477" target="_blank">
Discrete Mathematics 2
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 2 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3862" target="_blank">
Project
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3911" target="_blank">
Object Oriented Analysis & Design 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3877" target="_blank">
Web & Mobile Design & Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3630" target="_blank">
Networking 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3870" target="_blank">
Software Development 4
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2476" target="_blank">
Management Science
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 3 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3911" target="_blank">
Object Oriented Analysis & Design 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3899" target="_blank">
Operating Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1721" target="_blank">
Cloud Services & Distributed Computing
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2580" target="_blank">
Innovation & Entrepreneurship
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3878" target="_blank">
Web Application Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1689" target="_blank">
Algorithms and Data Structures 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2025" target="_blank">
Logic and Problem Solving
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3896" target="_blank">
Advanced Databases
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 3 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2465" target="_blank">
Project
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1728" target="_blank">
Algorithms and Data Structures 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1675" target="_blank">
Network Management
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2025" target="_blank">
Logic and Problem Solving
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3899" target="_blank">
Operating Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2580" target="_blank">
Innovation & Entrepreneurship
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1679" target="_blank">
Object Oriented Analysis & Design 2
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>

First of all, the table, parent of all tables, has an id attribute - let's make it the base for the search:
super_table = soup.find("table", id="programmeDescriptor")
Then, according to what you've mentioned in the comment, it looks like you can distinguish each inner table from one another by it's headers. One option to implement this logic would be to find the header and then use find_parent() to find the parent table:
def get_table_by_header_name(super_table, header):
return super_table.find("th", text=header).find_parent("table")
Usage:
desired_table = get_table_by_header_name(super_table, "Awards")

You can iterate over certain tags. I dont know what would you like to do, but if you want to get the text of every <th> tag, then just iterate over them, and use get_text()

BeautifulSoup scraping nested tables

I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost.
Here is the said table:
<form action="/rr/" class="form">
<table border="0" width="100%" cellpadding="2" cellspacing="0" align="left">
<tr bgcolor="#6699CC">
<td valign="top"><font face="arial"><b>Uesless Data</b></font></td>
<td width="10%"><br /></td>
<td align="right"><font face="arial">Uesless Data</font></td>
</tr>
<tr bgcolor="#DCDCDC">
<td> <input size="12" name="s" value="data:" onfocus=
"this.value = '';" /> <input type="hidden" name="d" value="research" />
<input type="submit" value="Date" /></td>
<td width="10%"><br /></td>
</tr>
</table>
</form>
<table border="0" width="100%">
<tr>
<td></td>
</tr>
</table><br />
<br />
<table border="0" width="100%">
<tr>
<td valign="top" width="99%">
<table cellpadding="2" cellspacing="0" border="0" width="100%">
<tr bgcolor="#A0B8C8">
<td colspan="6"><b>Data to be pulled</b></td>
</tr>
<tr bgcolor="#DCDCDC">
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td><font face="arial"><b>Data to be pulled</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
<td align="center"><font face="arial"><b>Data to be pulled
</b></font></td>
</tr>
<tr>
<td>Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center">Data to be pulled</td>
<td align="center"><br /></td>
</tr>
</table>
</td>
</tr>
</table>
There are quite a few tables, and none of which really have any distinguishing id's or tags. My most recent attempt was:
table = soup.find('table', attrs={'border':'0', 'width': "100%'})
Which is pulling only the first empty table. I feel like the answer is simple, and I am over thinking it.

If you're just looking for all of the tables, rather than the first one, you just want find_all instead of find.
If you're trying to find a particular table, like the one nested inside another one, and the page is using a 90s-style design that makes it impossible to find it via id or other attrs, the only option is to search by structure:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
# Found it!
And of course you can flatten this into a single comprehension if you really want to:
subtable = next(subtable for table in soup.find_all('table')
for subtable in table.find_all('table'))
Notice that I left off the attrs. If every table on the page has a superset of the same attrs, you aren't helping anything by specifying them.
This whole thing is obviously ugly and brittle… but there's really no way not to be brittle with this kind of layout.
Using a different library, like lxml.html, that lets you search by XPath might make it a little more compact, but it's ultimately going to be doing the same thing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Parsing with Bad HTML Tables - python

Related

How to grab the numebr with the end of "0" from the website?

Some <td>'s Cannot Be Found by find_next()

Extract 2 pieces of information from html in python

Options for using BeautifulSoup with basic table - no class ids,

BeautifulSoup scraping nested tables

Categories

Resources