Extracting data from a table using Python Beautiful soup - python

I'm trying to parse rows within a table (the departure board times) from the following:
buscms_widget_departureboard_ui_displayStop_Callback("
<div class='\"livetimes\"'>
<table class='\"busexpress-clientwidgets-departures-departureboard\"'>
<thead>
<tr class='\"rowStopName\"'>
<th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
Divinity Road
</th>
<tr>
<tr class='\"textHeader\"'>
<th colspan='\"3\"'>
text 69325694 to 84637 for live times
</th>
<tr>
<tr class='\"rowHeaders\"'>
<th>
service
</th>
<th>
destination
</th>
<th>
time
</th>
<tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</thead>
<tbody>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
5 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
27 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4 (OBC)
</td>
<td class='\"colDestination\"' title='\"Abingdon\"'>
Abingdon
</td>
<td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
22:29
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
65 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
23:09
</td>
</tr>
</tbody>
</table>
</div>
<div class='\"scrollmessage_container\"'>
<div class='\"scrollmessage\"'>
</div>
</div>
<div class='\"services\"'>
<a class='\"service' href='\"#\"' onclick="\"serviceNameClick('');\"" selected\"="">
all
</a>
<a class='\"service\"' href='\"#\"' onclick="\"serviceNameClick('4');\"">
4
</a>
</div>
<div class="dptime">
<span>
times generated at:
</span>
<span>
21:43
</span>
</div>
");
In particular, I'm trying to extract all the departure times - so I'd like to capture the minutes from departure - for example 12 minutes away.
I have the following code:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
I'm not sure how to find the minutes from departure from the above? Is it something like:
minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'})

Could you try this ?
import urllib.request
from bs4 import BeautifulSoup
import re
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))
for elements in minutes:
print(elements.getText())

So I got to my answer with the following code - which was actually quite easy once I had played around with the soup.find_all function:
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
print(link.get_text())
I get the following output:
10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins

Related

How to get a text of certain elements BeautifulSoup Python

I have this kind of html code
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
I need to get the text of every 3rd and 5th td of every tr
Apparently this doesn't work:)
from bs4 import BeautifulSoup
import index
soup = BeautifulSoup(index.index_doc, 'lxml')
for i in soup.find_all('tr')[2:]:
print(i[2].text, i[4].text)
You could use css selectors and pseudo classe :nth-of-type() to select your elements (assumed you need the date, so I selected the 6th td):
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
And to get a list of tuples:
list(zip(data, data[1:]))
Example
from bs4 import BeautifulSoup
html = '''
<tr>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>
Name Name Name
</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
<td class="a">
<p>
<sup>25.01.1980</sup>
</p>
</td>
<td class="a">...</td>
<td class="a">...</td>
</tr>
<tr>...</tr>
<tr>...</tr>
'''
soup = BeautifulSoup(html)
data = [e.get_text(strip=True) for e in soup.select('tr td:nth-of-type(3),tr td:nth-of-type(6)')]
list(zip(data, data[1:]))

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Finding test case and result with BeautifulSoup

I need a good way to find the names of all test cases and the result for every test case in an html file. I'm new to BeautifulSoup and need some good advice.
First I have done this, using BeautifulSoup to read the data and prettify it and put the data in a file:
from bs4 import BeautifulSoup
f = open('myfile','w')
soup = BeautifulSoup(open("C:\DEV\debugkod\data.html"))
fixedSoup = soup.prettify()
fixedSoup = fixedSoup.encode('utf-8')
f.write(fixedSoup)
f.close()
When I check parts in the prettify result in the file it will for example look like this (the file includes 100s of tc's and results):
<a name="1005">
</a>
<div class="Sequence">
<div class="Header">
<table class="Title">
<tr>
<td>
IAA REQPROD 55 InvPwrDownMode - Shut down communication (Sequence)
</td>
<td class="ResultStateIcon">
<img src="Resources/Passed.png"/>
</td>
</tr>
</table>
<table class="DynamicAttributes">
<colgroup>
<col width="20">
<col width="30">
<col width="20">
<col width="30">
</col>
</col>
</col>
</col>
</colgroup>
<tr>
<th>
Start time:
</th>
<td>
2014/09/23 09-24-31
</td>
<th>
Stop time:
</th>
<td>
2014/09/23 09-27-25
</td>
</tr>
<tr>
<th>
Execution duration:
</th>
<td>
173.461 sec.
</td>
*<th>
Name:
</th>
<td>
IAA REQPROD 55 InvPwrDownMode - Shut down communication
</td>*
</tr>
<tr>
<th>
Library link:
</th>
<td>
</td>
<th>
Creation date:
</th>
<td>
2013/4/11, 8-55-57
</td>
</tr>
<tr>
<th>
Modification date:
</th>
<td>
2014/9/23, 9-27-25
</td>
<th>
Author:
</th>
<td>
cnnntd
</td>
</tr>
<tr>
<th>
Hierarchy:
</th>
<td>
IAA. IAA REQPROD 55 InvPwrDownMode - Shut down communication
</td>
<td>
</td>
<td>
</td>
</tr>
</table>
<table class="StaticAttributes">
<colgroup>
<col width="20">
<col width="80">
</col>
</col>
</colgroup>
<tr>
<th>
Description:
</th>
<td>
</td>
</tr>
<tr>
<th>
*Result state:
</th>
<td>
Passed
</td>*
</tr>
</table>
</div>
<div class="BlockReport">
<a name="1007">
In this file I now want to find the info about "Name" and "Result state:". If check the prettify result I can see the tags "Name:" and "Result state:". Hopefully it possible to use them to find testCase name and test result... So the printout should look something like this:
Name = IAA REQPROD 55 InvPwrDownMode - Shut down communication
Result = Passed
etc
Does anyone know how to do this using BeautifulSoup?
Using the html from your second Pastebin link, the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("beautifulsoup2.html"))
names = []
for table in soup.findAll('table', attrs={'class': 'Title'}):
td = table.find('td')
names.append(td.text.encode("ascii", "ignore").strip())
results = []
for table in soup.findAll(attrs={'class': 'StaticAttributes'}):
tds = table.findAll('td')
results.append(tds[1].text.strip())
for name, result in zip(names, results):
print "Name = {}".format(name)
print "Result = {}".format(result)
print
Gives this result:
Name = IEM(Project)
Result = PassedFailedUndefinedError
Name = IEM REQPROD 132765 InvPwrDownMode - Shut down communication SN1(Sequence)
Result = Passed
Name = IEM REQPROD 86434 InvPwrDownMode - Time from shut down to sleep SN2(Sequence)
Result = PassedUndefined
Name = IEM Test(Sequence)
Result = Failed
Name = IEM REQPROD 86434 InvPwrDownMode - Time from shut down to sleep(Sequence)
Result = Error
I added the encode("ascii", "ignore") because otherwise I would get UnicodeDecodeError's. See this answer for how those characters possibly ended up in your html.

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

Categories