Extract 2 pieces of information from html in python - python

I need help figuring out how to extract Grab and the number following data-b. There are many <tr> in the complete unmodified webpage and I need to filter using the "Need" just before </a>. I've been trying to do this with beautiful soup, though it looks like lxml might work better. I can get either all of the <tr>s or only the < a>...< /a> lines that contain Need but not just the <tr>s that contain need in that <a> line.
<tr >
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr >
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
Thanks for any help!

from bs4 import BeautifulSoup
data = '''<tr>
<td>3</td>
<td>Leave</td><td>Useless</td>
<td class="text-right"> <span class="float2" data-a="24608000.0" data-b="518" data-n="818">Garbage</span></td>
<td class="text-right"> <span class="Float" data-a="3019" data-b="0.0635664" data-n="283">Garbage2</span></td>
<td class="text-right">7.38%</td>
<td class="text-right " >Recently</td>
</tr>
<tr>
<td>4</td>
<td>Grab</td><td>Need</td>
<td class="text-right"> <span class="bloat2" data="22435000.0" data-b="512" data-n="74491.2">More junk</span></td>
<td class="text-right"> <span class="bloat" data-a="301.177" data-b="35.848" data-n="0.5848">More junk2</span></td>
<td class="text-right">Some more</td>
<td class="text-right " >Recently</td>
</tr>
'''
soup = BeautifulSoup(data)
print(soup.findAll('a',{"href":"/local" })[0].text)
for a in soup.findAll('span',{"class":["bloat","bloat2"]}):
print(a['data-b'])

Related

Element changes from display: none to display:block selenium python

I am trying to fill a form online using selenium and at some point I have to fill a date. I can't use send_keys() since it is not allowed by the page. Instead, when I click on the date field, it pops up a datepicker window that prompts to select the year, and I can do this successfully.
After picking the year, the previous window is removed and a new one that prompts to select the month is displayed. This is done by setting the style from display: none to display: block and to the previous year window the style is set from display: block to display: none.
The problem is that even if the new window is_displayed() and is_enabled() methods return True, the elements of the second window, when using is_displayed() on them returns False, even if the is_enabled() method returns True.
I think that I should refresh the dom elements of my driver, but driver.refresh() puts me back in step 0, where I have to pick the year again.
This is my code:
# Code for selecting year (Works)
dateWindow = driver.find_element_by_xpath('/html/body/div[9]/div[3]/table')
rows = dateWindow.find_elements_by_tag_name("tr")
rows[1].find_element_by_xpath('//span[text()="%s"]' % str_year).click()
# Code for selecting month (Does not work)
dateWindow = driver.find_element_by_xpath('/html/body/div[9]/div[2]/table')
rows = dateWindow.find_elements_by_tag_name("tr")
rows[1].find_element_by_xpath('//span[text()="%s"]' % str_month).click()
In the last line, I get this error:
selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable
This is the html of the page before selecting the year:
<div class="datepicker-days" style="display: none;">
<table class=" table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">June 1993</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
<tr>
<th class="dow">Su</th>
<th class="dow">Mo</th>
<th class="dow">Tu</th>
<th class="dow">We</th>
<th class="dow">Th</th>
<th class="dow">Fr</th>
<th class="dow">Sa</th>
</tr>
</thead>
<tbody>
<tr>
<td class="old day">30</td>
<td class="old day">31</td>
<td class="day">1</td>
<td class="day">2</td>
<td class="day">3</td>
<td class="day">4</td>
...
<td class="day">29</td>
<td class="day">30</td>
<td class="new day">1</td>
<td class="new day">2</td>
<td class="new day">3</td>
</tr>
<tr>
<td class="new day">4</td>
<td class="new day">5</td>
<td class="new day">6</td>
<td class="new day">7</td>
<td class="new day">8</td>
<td class="new day">9</td>
<td class="new day">10</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-months" style="display: none;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">1993</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
<span class="month">Jan</span>
<span class="month">Feb</span>
<span class="month">Mar</span>
<span class="month">Apr</span>
<span class="month">May</span>
<span class="month">Jun</span>
<span class="month">Jul</span>
<span class="month">Aug</span>
<span class="month">Sep</span>
<span class="month">Oct</span>
<span class="month">Nov</span>
<span class="month">Dec</span>
</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-years" style="display: block;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">1990-1999</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
<span class="year old">1989</span>
<span class="year">1990</span>
<span class="year">1991</span>
<span class="year">1992</span>
<span class="year">1993</span>
<span class="year active">1994</span>
<span class="year">1995</span>
<span class="year">1996</span>
<span class="year">1997</span>
<span class="year">1998</span>
<span class="year">1999</span>
<span class="year new">2000</span>
</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
This is the html of the page before selecting the month and after selecting the year:
<div class="datepicker-days" style="display: none;">
<table class=" table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">June 1993</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
<tr>
<th class="dow">Su</th>
<th class="dow">Mo</th>
<th class="dow">Tu</th>
<th class="dow">We</th>
<th class="dow">Th</th>
<th class="dow">Fr</th>
<th class="dow">Sa</th>
</tr>
</thead>
<tbody>
<tr>
<td class="old day">30</td>
<td class="old day">31</td>
<td class="day">1</td>
<td class="day">2</td>
<td class="day">3</td>
<td class="day">4</td>
...
<td class="day">29</td>
<td class="day">30</td>
<td class="new day">1</td>
<td class="new day">2</td>
<td class="new day">3</td>
</tr>
<tr>
<td class="new day">4</td>
<td class="new day">5</td>
<td class="new day">6</td>
<td class="new day">7</td>
<td class="new day">8</td>
<td class="new day">9</td>
<td class="new day">10</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-months" style="display: block;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">1993</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
<span class="month">Jan</span>
<span class="month">Feb</span>
<span class="month">Mar</span>
<span class="month">Apr</span>
<span class="month">May</span>
<span class="month">Jun</span>
<span class="month">Jul</span>
<span class="month">Aug</span>
<span class="month">Sep</span>
<span class="month">Oct</span>
<span class="month">Nov</span>
<span class="month">Dec</span>
</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
<div class="datepicker-years" style="display: none;">
<table class="table-condensed">
<thead>
<tr>
<th class="prev" style="visibility: visible;">«</th>
<th colspan="5" class="datepicker-switch">1990-1999</th>
<th class="next" style="visibility: visible;">»</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
<span class="year old">1989</span>
<span class="year">1990</span>
<span class="year">1991</span>
<span class="year">1992</span>
<span class="year">1993</span>
<span class="year active">1994</span>
<span class="year">1995</span>
<span class="year">1996</span>
<span class="year">1997</span>
<span class="year">1998</span>
<span class="year">1999</span>
<span class="year new">2000</span>
</td>
</tr>
</tbody>
<tfoot>
<tr>
<th colspan="7" class="today" style="display: none;">Today</th>
</tr>
<tr>
<th colspan="7" class="clear" style="display: none;">Clear</th>
</tr>
</tfoot>
</table>
</div>
Any ideas? Thanks in advance
The desired element is an dynamic element so while selecting the Month you have to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following Locator Strategies:
Using XPATH:
dateWindow = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[9]/div[2]/table")))
rows = dateWindow.find_elements_by_tag_name("tr")
rows[1].find_element_by_xpath('//span[text()="%s"]' % str_month).click()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

BS4 findall not returning all divs

I was trying to get to the bottom table in the site,but findall() kept returning empty objects so i got all the divs on the same level one by one and noticed that when i try to get the last two it gives me the []
the_page=urllib.request.urlopen("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj=BeautifulSoup(the_page,'html.parser')
test=bsObj.findAll('div',{'class':'page','id':"item-page"})
print(test)
I have gone through the bs4 object that i got and the 2 divs im looking for arent in it.Whats happening?
the div im looking for is in the https://theunderminejournal.com/#eu/sylvanas/item/124105
this is the div im trying to extract
You will need to use selenium instead of the normal requests libraries.
Note that I couldn't post all of the output as the HTML parsed was huge.
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://theunderminejournal.com/#eu/sylvanas/item/124105")
bsObj = BeautifulSoup(driver.page_source,'html.parser')
test = bsObj.find('div', id='item-page')
print(test.prettify())
Output:
<div class="page" id="item-page" style="display: block;">
<div class="item-stats">
<table>
<tr class="available">
<th>
Available Quantity
</th>
<td>
<span>
30,545
</span>
</td>
</tr>
<tr class="spacer">
<td colspan="3">
</td>
</tr>
<tr class="current-price">
<th>
Current Price
</th>
<td>
<span class="money-gold">
27.34
</span>
</td>
</tr>
<tr class="median-price">
<th>
Median Price
</th>
<td>
<span class="money-gold">
30.11
</span>
</td>
</tr>
<tr class="mean-price">
<th>
Mean Price
</th>
<td>
<span class="money-gold">
30.52
</span>
</td>
</tr>
<tr class="standard-deviation">
<th>
Standard Deviation
</th>
<td>
<span class="money-gold">
.
.
.
</span>
</abbr>
</td>
</tr>
</table>
</div>
</div>
</div>

Some <td>'s Cannot Be Found by find_next()

So this is a question about BS4 for scraping, I encountered scraping a website that has barely have any ID on the stuff that was supposed to get scraped for info, so I'm hellbent on using find_next find_next_siblings or any other iterator-ish type of BS4 modules.
The thing is I used this to get some td values from my tables so I used find_next(), it did work on some values but for some reason, for the others it can't detect it.
Here's the html:
<table style="max-width: 350px;" border="0">
<tbody><tr>
<td style="max-width: 215px;">REF. NO.</td>
<td style="max-width: 12px;" align="center"> </td>
<td align="right">000124 </td>
</tr>
<tr>
<td>REF. NO.</td>
<td align="center"> </td>
<td align="right"> </td>
</tr>
<tr>
<td>MANU</td>
<td align="center"> </td>
<td align="right"></td>
</tr>
<tr>
<td>STREAK</td>
<td align="center"> </td>
<td align="right">1075</td>
</tr>
<tr>
<td>PACK</td>
<td align="center"> </td>
<td align="right">1</td>
</tr>
<tr>
<td colspan="3">ON STOCK. </td>
</tr>
.... and so on
So I used this code to get what I want:
div = soup.find('div', {'id': 'infodata'})
table_data = div.find_all('td')
for element in table_data:
if "STREAK" in element.get_text():
price= element.find_next('td').find_next('td').text
print(price+ "price")
else:
print('NOT FOUND!')
I actually copied and paste suff from the HTML to make sure I didn't mistype anything, many times, but still it would always go to not found. But if i try other Table names, I can get them. For example that PACK
By the way, im using two find_next() there because the html has three td's in every <tr>
Please I need your help, why is this working for some words while for some not. Any help is appreciated. Thank you very much!
I would rewrite it like this:
trs = div.find_all('tr')
for tr in trs:
tds = tr.select('td')
if len(tds) > 1 and 'STREAK' in tds[0].get_text().strip():
price = tds[-1].get_text().strip()

Options for using BeautifulSoup with basic table - no class ids,

Is there a recommended way for using BeautifulSoup 4 in python when you have a table with no class or attribute values?
I was considering just using Get_Text() to dump the text out but if I wanted to pick individual values out or break the table into more discrete sections how would I go about it ?
<table cellpadding="0" cellspacing="0" id="programmeDescriptor" width="100%">
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th colspan="1">
Awards
</th>
</tr>
<tr>
</tr>
<tr>
<td>
Ordinary Bachelor Degree
</td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Programme Code:
</th>
<td width="150">
CodeValue
</td>
</tr>
</table>
</td>
<td width="5">
</td>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Mode of Delivery:
</th>
<td width="150">
Full Time
</td>
</tr>
</table>
</td>
<td width="5">
</td>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
No. of Semesters:
</th>
<td width="150">
6
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
NFQ Level:
</th>
<td width="150">
7
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td>
<table cellpadding="5" cellspacing="0" class="borders">
<tr>
<th width="160">
Embedded Award:
</th>
<td width="150">
No
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th width="160">
Department:
</th>
<td>
Computing
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<h3>
Programme Outcomes
</h3>
<p class="info">
On successful completion of this programme the learner will be able to :
</p>
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<th width="30">
PO1
</th>
<td class="head" colspan="2">
Knowledge - Breadth
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</tr>
<tr>
<th width="30">
PO2
</th>
<td class="head" colspan="2">
Knowledge - Kind
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO3
</th>
<td class="head" colspan="2">
Skill - Range
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO4
</th>
<td class="head" colspan="2">
Skill - Selectivity
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO5
</th>
<td class="head" colspan="2">
Competence - Context
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<tdSome block of text </td>
</tr>
<tr>
<th width="30">
PO6
</th>
<td class="head" colspan="2">
Competence - Role
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO7
</th>
<td class="head" colspan="2">
Competence - Learning to Learn
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• Some block of text
</td>
</tr>
<tr>
<th width="30">
PO8
</th>
<td class="head" colspan="2">
Competence - Insight
</td>
</tr>
<tr>
<td class="head" width="30">
</td>
<td class="head" width="30">
(a)
</td>
<td>
• The graduate will demonstrate the ability to specify, design and build an IT system or research & report on a current IT topic
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<h3>
Semester Schedules
</h3>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 1 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td
<a href="index.cfm/page/module/moduleId/3897" target="_blank">
Web & User Experience
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3881" target="_blank">
Software Development 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1645" target="_blank">
Computer Architecture
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2328" target="_blank">
Discrete Mathematics 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3848" target="_blank">
Business & Information Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2054" target="_blank">
Learning to Learn at Third Level
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 1 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3886" target="_blank">
Software Development 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3895" target="_blank">
Object Oriented Systems Analysis
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3875" target="_blank">
Database Fundamentals
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3874" target="_blank">
Operating Systems Fundamentals
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2330" target="_blank">
Statistics
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2527" target="_blank">
Social Media Communications
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 2 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3877" target="_blank">
Web & Mobile Design & Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3876" target="_blank">
Database Design And Programming
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3869" target="_blank">
Software Development 3
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3873" target="_blank">
Software Quality Assurance and Testing
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3629" target="_blank">
Networking 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2477" target="_blank">
Discrete Mathematics 2
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 2 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3862" target="_blank">
Project
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3911" target="_blank">
Object Oriented Analysis & Design 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3877" target="_blank">
Web & Mobile Design & Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3630" target="_blank">
Networking 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3870" target="_blank">
Software Development 4
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2476" target="_blank">
Management Science
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<div class="pageBreak">
</div>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 3 / Semester 1
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3911" target="_blank">
Object Oriented Analysis & Design 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3899" target="_blank">
Operating Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1721" target="_blank">
Cloud Services & Distributed Computing
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2580" target="_blank">
Innovation & Entrepreneurship
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3878" target="_blank">
Web Application Development
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1689" target="_blank">
Algorithms and Data Structures 1
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2025" target="_blank">
Logic and Problem Solving
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3896" target="_blank">
Advanced Databases
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td colspan="2">
<h4>
Stage 3 / Semester 2
</h4>
</td>
</tr>
<tr>
<td colspan="2">
<table cellpadding="5" cellspacing="0" class="borders" width="100%">
<tr>
<td class="head" colspan="2">
Mandatory
</td>
</tr>
<tr>
<th width="50">
Module Code
</th>
<th>
Module Title
</th>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2465" target="_blank">
Project
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1728" target="_blank">
Algorithms and Data Structures 2
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1675" target="_blank">
Network Management
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2025" target="_blank">
Logic and Problem Solving
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/3899" target="_blank">
Operating Systems
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/2580" target="_blank">
Innovation & Entrepreneurship
</a>
</td>
</tr>
<tr>
<td>
Code
</td>
<td>
<a href="index.cfm/page/module/moduleId/1679" target="_blank">
Object Oriented Analysis & Design 2
</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
First of all, the table, parent of all tables, has an id attribute - let's make it the base for the search:
super_table = soup.find("table", id="programmeDescriptor")
Then, according to what you've mentioned in the comment, it looks like you can distinguish each inner table from one another by it's headers. One option to implement this logic would be to find the header and then use find_parent() to find the parent table:
def get_table_by_header_name(super_table, header):
return super_table.find("th", text=header).find_parent("table")
Usage:
desired_table = get_table_by_header_name(super_table, "Awards")
You can iterate over certain tags. I dont know what would you like to do, but if you want to get the text of every <th> tag, then just iterate over them, and use get_text()

BeautifulSoup Parsing with Bad HTML Tables

I'm trying to parse tables similar to the following with BeautifulSoup to extract the name, age, and position for each person.
<TABLE width="100%" align="center" cellspacing="0" cellpadding="0" border="0">
<TR>
<TD></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>Age as of</B></FONT></TD>
<TD></TD>
<TD></TD>
</TR>
<TR>
<TD align="center" nowrap><FONT size="2"><B>Name</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><FONT size="2"><B>November 1, 1999</B></FONT></TD>
<TD></TD>
<TD align="center" nowrap><FONT size="2"><B>Position</B></FONT></TD>
</TR>
<TR>
<TD align="center" nowrap><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap colspan="3"><HR size="1"></TD>
<TD></TD>
<TD align="center" nowrap><HR size="1"></TD>
</TR>
<TR>
<TD align="left" valign="top"><FONT size="2">
Terry S. Jacobs</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">57</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Chairman of the Board, Chief Executive Officer, Treasurer and
director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
William L. Stakelin</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">56</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
President, Chief Operating Officer, Secretary and director</FONT></TD>
</TR>
<TR><TD><TR><TD><TR><TD><TR><TD>
<TR>
<TD align="left" valign="top"><FONT size="2">
Joel M. Fairman</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="right" valign="top" nowrap><FONT size="2">70</FONT></TD>
<TD></TD>
<TD></TD>
<TD align="left" valign="top"><FONT size="2">
Vice Chairman and director</FONT></TD>
</TR>
</TABLE>
My current attempt is as follows:
soup = BeautifulSoup(in_file)
out = []
headers = soup.findAll(['td','th'])
for header in headers:
if header.find(text = re.compile(r"^age( )?", re.I)):
out.append(header)
table = out[0].find_parent("table")
rows = table.findAll('tr')
filter_regex = re.compile(r'[\w][\w .,]*', re.I)
data = [[td.find(text=filter_regex) for td in tr.findAll("td")] for tr in rows]
Things work find for the first person, but the bad <tr><td><tr><td>... lines really mess things up from there. I am trying to do this for a few thousand HTML files, each having slightly different table structure. That said, this feature of <tr> and <td> tags not being closed appears quite common across the files.
Anyone have thoughts on how to generalize the above parsing to work with tables that have constructs such as these? Thanks a lot!
You can take advantage of the fact that the valign attribute is set to top in all of the fields you'd like to keep and none of the ones you don't:
soup = BeautifulSoup(in_file)
cells = [cell.text.strip() for cell in soup('td', valign='top')]
Then you can sort this list of cells into a two-dimensional structure. There are three cells per entry, so you can sort it out pretty simply by doing something like this:
entries = []
for i in range(0, len(cells), 3):
entries.append(cells[i:i+3])
In the off chance anyone else get stuck with this issue and stumbles in here, the modern solution is to change which parser you are using. The default parser, 'html.parser' is pretty good when working with close enough HTML with properly closed tags, but the second you have to deal with edge cases (like Example 1 below, which is similar to the OP issue), that still goes right out the window even 8 years later (example 2 below).
In the documentation for BeautifulSoup4 (current version 4.9.3), there is a section detailing parser selection: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Example 1, the raw HTML:
<TABLE >
<TR VALIGN="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br>
<TR VALIGN="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<TR VALIGN="top">
<td> <td><b>Type:</b>
<td> 61
<TR VALIGN="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</TABLE>
Example 2, results when using BeautifulSoup(html, 'html.parser'):
<table>
<tr valign="top">
<td> <td><b>Title:</b>
<td> title is here <i>-subtitle</i><br/>
<tr valign="top">
<td>
<td><b>Date:</b>
<td> Thursday , August 27th, 2020
<tr valign="top">
<td> <td><b>Type:</b>
<td> 61
<tr valign="top">
<td>
<td><b>Status:</b>
<td> ACTIVE - ACTIVE
</td></td></td></tr></td></td></td></tr></td></td></td></tr></td></td></td></tr></table>
Example 3, results when using BeautifulSoup(html, 'html5lib'):
<table>
<tbody><tr valign="top">
<td> </td><td><b>Title:</b>
</td><td> title is here <i>-subtitle</i><br/>
</td></tr><tr valign="top">
<td>
</td><td><b>Date:</b>
</td><td> Thursday , August 27th, 2020
</td></tr><tr valign="top">
<td> </td><td><b>Type:</b>
</td><td> 61
</td></tr><tr valign="top">
<td>
</td><td><b>Status:</b>
</td><td> ACTIVE - ACTIVE
</td></tr></tbody></table>
There are also parsers that are written externally in C such as 'lxml' that you could potentially use that is much faster according to the documentation.

Categories