Finding test case and result with BeautifulSoup - python

I need a good way to find the names of all test cases and the result for every test case in an html file. I'm new to BeautifulSoup and need some good advice.
First I have done this, using BeautifulSoup to read the data and prettify it and put the data in a file:
from bs4 import BeautifulSoup
f = open('myfile','w')
soup = BeautifulSoup(open("C:\DEV\debugkod\data.html"))
fixedSoup = soup.prettify()
fixedSoup = fixedSoup.encode('utf-8')
f.write(fixedSoup)
f.close()
When I check parts in the prettify result in the file it will for example look like this (the file includes 100s of tc's and results):
<a name="1005">
</a>
<div class="Sequence">
<div class="Header">
<table class="Title">
<tr>
<td>
IAA REQPROD 55 InvPwrDownMode - Shut down communication (Sequence)
</td>
<td class="ResultStateIcon">
<img src="Resources/Passed.png"/>
</td>
</tr>
</table>
<table class="DynamicAttributes">
<colgroup>
<col width="20">
<col width="30">
<col width="20">
<col width="30">
</col>
</col>
</col>
</col>
</colgroup>
<tr>
<th>
Start time:
</th>
<td>
2014/09/23 09-24-31
</td>
<th>
Stop time:
</th>
<td>
2014/09/23 09-27-25
</td>
</tr>
<tr>
<th>
Execution duration:
</th>
<td>
173.461 sec.
</td>
*<th>
Name:
</th>
<td>
IAA REQPROD 55 InvPwrDownMode - Shut down communication
</td>*
</tr>
<tr>
<th>
Library link:
</th>
<td>
</td>
<th>
Creation date:
</th>
<td>
2013/4/11, 8-55-57
</td>
</tr>
<tr>
<th>
Modification date:
</th>
<td>
2014/9/23, 9-27-25
</td>
<th>
Author:
</th>
<td>
cnnntd
</td>
</tr>
<tr>
<th>
Hierarchy:
</th>
<td>
IAA. IAA REQPROD 55 InvPwrDownMode - Shut down communication
</td>
<td>
</td>
<td>
</td>
</tr>
</table>
<table class="StaticAttributes">
<colgroup>
<col width="20">
<col width="80">
</col>
</col>
</colgroup>
<tr>
<th>
Description:
</th>
<td>
</td>
</tr>
<tr>
<th>
*Result state:
</th>
<td>
Passed
</td>*
</tr>
</table>
</div>
<div class="BlockReport">
<a name="1007">
In this file I now want to find the info about "Name" and "Result state:". If check the prettify result I can see the tags "Name:" and "Result state:". Hopefully it possible to use them to find testCase name and test result... So the printout should look something like this:
Name = IAA REQPROD 55 InvPwrDownMode - Shut down communication
Result = Passed
etc
Does anyone know how to do this using BeautifulSoup?

Using the html from your second Pastebin link, the following code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("beautifulsoup2.html"))
names = []
for table in soup.findAll('table', attrs={'class': 'Title'}):
td = table.find('td')
names.append(td.text.encode("ascii", "ignore").strip())
results = []
for table in soup.findAll(attrs={'class': 'StaticAttributes'}):
tds = table.findAll('td')
results.append(tds[1].text.strip())
for name, result in zip(names, results):
print "Name = {}".format(name)
print "Result = {}".format(result)
print
Gives this result:
Name = IEM(Project)
Result = PassedFailedUndefinedError
Name = IEM REQPROD 132765 InvPwrDownMode - Shut down communication SN1(Sequence)
Result = Passed
Name = IEM REQPROD 86434 InvPwrDownMode - Time from shut down to sleep SN2(Sequence)
Result = PassedUndefined
Name = IEM Test(Sequence)
Result = Failed
Name = IEM REQPROD 86434 InvPwrDownMode - Time from shut down to sleep(Sequence)
Result = Error
I added the encode("ascii", "ignore") because otherwise I would get UnicodeDecodeError's. See this answer for how those characters possibly ended up in your html.

Related

Python : Scrape each info in table without class using beautifulsoup4

I'm new to python and i have a problem for scraping with beautifulsoup4 a table containing informations of a book because each tr and td of the table doesnt contain classnames.
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
here is the table in the website:
<table class="table table-striped">
<tr>
<th>
UPC
</th>
<td>
a897fe39b1053632
</td>
</tr>
<tr>
<th>
Product Type
</th>
<td>
Books
</td>
</tr>
<tr>
<th>
Price (excl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Price (incl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Tax
</th>
<td>
£0.00
</td>
</tr>
<tr>
<th>
Availability
</th>
<td>
In stock (22 available)
</td>
</tr>
<tr>
<th>
Number of reviews
</th>
<td>
0
</td>
</tr>
</table>
the only thing i learned is with classnames, for example : book_price = soup.find('td', class_='book-price').
but in this situation i am blocked...
Is there something like find and pair the first th tag with the first td and the second th tag with the second td and so on.
i see something like that :
import requests
from bs4 import BeautifulSoup
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(book_url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('table').prettify()
table_infos = soup.find('table')
for info in table_infos.findAll('tr'):
upc = ...
price = ...
tax = ...
thank you !

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date
If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

Extracting multiple table data using python and beautiful soup

<div class="row margin_30">
<div class="col-md-12 col-sm-12 col-xs-12 col-lg-12">
<div class="table-responsive table-border-radius">
<table class="table table-hover result-table-new1 " style="margin:0">
<thead class="">
<tr class="">
<th style="text-align:center;">Pl</th>
<th>H.No</th>
<th>Horse/Pedigree</th>
<th>Desc</th>
<th>Trainer</th>
<th>Jockey</th>
<th>Wt</th>
<th>Al</th>
<th>Dr</th>
<th>Sh</th>
<th>Won By</th>
<th>Dist Win</th>
<th>Rtg</th>
<th>Odds</th>
<th>Time</th>
</tr>
</thead>
<tbody class="">
<tr class="dividend_tr" >
<td>1 </td>
<td style="text-align: center;">7 </td>
<td class="race_card_td"><h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55234/SILKEN
STRIKER">
SILKEN STRIKER </a></h5>
<h6 class="margin_remove">Sussex(GB)-Flying Rani </h6>
</td>
<td>
4y b g </td>
<td>
Irfan Ghatala </td>
<td>
Anjar Alam </td>
<td>
56 </td>
<td>
- </td>
<td>
6 </td>
<td>
A </td>
<td>
5 1/2 </td>
<td>
</td>
<td>
12 </td>
<td>
</td>
<td>
1:14.57 </td>
</tr>
<tr class="dividend_tr" >
<td>
2 </td>
<td style="text-align: center;">
5 </td>
<td class="race_card_td">
<h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55737/ULTIMATE
POWER">
ULTIMATE POWER </a>
</h5>
<h6 class="margin_remove">
Epicentre(USA)-Methodical </h6>
</td>
<td>
4y b g </td>
<td>
V Lokanath </td>
<td>
Darshan R N </td>
<td>
57 </td>
<td>
-1 </td>
<td>
3 </td>
<td>
A </td>
<td>
5 </td>
<td>
5.5 </td>
<td>
14 </td>
<td>
</td>
<td>
1:15.47 </td>
</tr>
</tbody>
</table>
</div>
I want the following output using Beautiful soup and want to store it in csv file. The actual page [http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS] has multiple tables and many rows. Also, I need to write a function to get data from different pages.
[Result][1]
[1]: https://i.stack.imgur.com/4LYt8.jpg
Any help would be greatful.
It's pretty simple you need find all tables then iterate tr and td as per your requirement. You can use pandas to save the scraped data. i have parse the tables for you (the rest you have to do)...check the code below.
import requests
from bs4 import BeautifulSoup
url = 'http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS'
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')
table = soup.find_all('table', attrs={
'class':'result-table-new1'})
for i in table:
tr = i.find_all('tr')
for td in tr:
print(td.text.replace('\n', ' '))

Table extraction: BeautifulSoup vs. Pandas.read_html

I have an html file taken from this link, but I am not being able to extract any sort of table neither with bs4.BeautifulSoup() nor with pandas.read_html. I understand that each row of my desired table starts with <tr class='odd'>. Despite that, something is not working when I pass soup.find({'class': 'odd'}) or pd.read_html(url, attrs = {'class': 'odd'}). Where is the mistake or what should I do instead?
The beginning of the table apparently starts in requests.get(url).content[8359:].
<table style="background-color:#FFFEEE; border-width:thin; border-collapse:collapse; border-spacing:0; border-style:outset;" rules="groups" >
<colgroup>
<colgroup>
<colgroup>
<colgroup>
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup>
<tbody>
<tr style="vertical-align:middle; background-color:#177A9C">
<th scope="col" style="text-align:center">Ion</th>
<th scope="col" style="text-align:center"> Observed <br /> Wavelength <br /> Vac (nm) </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>g<sub>k</sub>A<sub>ki</sub></i><br /> (10<sup>8</sup> s<sup>-1</sup>) </th>
<th scope="col"> Acc. </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>i</sub></i> <br /> (eV) </th>
<th> </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>k</sub></i> <br /> (eV) </th>
<th scope="col" style="text-align:center" colspan="3"> Lower Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center" colspan="3"> Upper Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center"> <i>g<sub>i</sub></i> </th>
<th scope="col" style="text-align:center"> <b>-</b> </th>
<th scope="col" style="text-align:center"> <i>g<sub>k</sub></i> </th>
<th scope="col" style="text-align:center"> Type </th>
</tr>
</tbody>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr class='odd'>
<td class="lft1"><b>C I</b> </td>
<td class="fix"> 193.090540 </td>
<td class="lft1">1.02e+01 </td>
<td class="lft1"> A</td>
<td class="fix">1.2637284 </td>
<td class="dsh">- </td>
<td class="fix">7.68476771 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i><sup>2</sup> </td>
<td class="lft1"> <sup>1</sup>D </td>
<td class="lft1"> 2 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i>3<i>s</i> </td>
<td class="lft1"> <sup>1</sup>P° </td>
<td class="lft1"> 1 </td>
<td class="rgt"> 5</td>
<td class="dsh">-</td>
<td class="lft1">3 </td>
<td class="cnt"><sup></sup><sub></sub></td>
</tr>
This code can give you a jump start on this project, however, if you're looking for someone to build the whole project, request data, scrape, store, manipulate I would suggest hiring someone or learning how to do it. HERE is the BeautifulSoup Documentation.
Go through (the quickstart guide) it once and you'll pretty much be know all there is on bs4.
import requests
from bs4 import BeautifulSoup
from time import sleep
url = 'https://physics.nist.gov/'
second_part = 'cgi-bin/ASD/lines1.pl?spectra=C%20I%2C%20Ti%20I&limits_type=0&low_w=190&upp_w=250&unit=1&de=0&format=0&line_out=0&no_spaces=on&remove_js=on&en_unit=1&output=0&bibrefs=0&page_size=15&show_obs_wl=1&unc_out=0&order_out=0&max_low_enrg=&show_av=2&max_upp_enrg=&tsb_value=0&min_str=&A_out=1&A8=1&max_str=&allowed_out=1&forbid_out=1&min_accur=&min_intens=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page = requests.get(url+second_part)
soup = BeautifulSoup(page.content, "lxml")
whole_table = soup.find('table', rules='groups')
sub_tbody = whole_table.find_all('tbody')
# the two above lines are used to locate the table and the content
# we then continue to iterate through sub-categories i.e. tbody-s > tr-s > td-s
for tag in sub_tbody:
if tag.find('tr').find('td'):
table_rows = tag.find_all('tr')
for tag2 in table_rows:
if tag2.has_attr('class'):
td_tags = tag2.find_all('td')
print(td_tags[0].text, '<- Is the ion')
print(td_tags[1].text, '<- Wavelength')
print(td_tags[2].text, '<- Some formula gk Aki')
# and so on...
print('--'*40) # unecessary but does print ----------...
else:
pass
You need to search for the tags and then the class. So using the lxml parser;
soup = BeautifulSoup(yourdata, 'lxml')
for i in soup.find_all('tr',attrs={'class':"odd"}):
print(i.text)
From this point you can write this data directly to a file or generate an array (list of lists - your rows) then put into pandas etc etc.

Extracting data from a table using Python Beautiful soup

I'm trying to parse rows within a table (the departure board times) from the following:
buscms_widget_departureboard_ui_displayStop_Callback("
<div class='\"livetimes\"'>
<table class='\"busexpress-clientwidgets-departures-departureboard\"'>
<thead>
<tr class='\"rowStopName\"'>
<th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
Divinity Road
</th>
<tr>
<tr class='\"textHeader\"'>
<th colspan='\"3\"'>
text 69325694 to 84637 for live times
</th>
<tr>
<tr class='\"rowHeaders\"'>
<th>
service
</th>
<th>
destination
</th>
<th>
time
</th>
<tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</thead>
<tbody>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
5 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
27 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4 (OBC)
</td>
<td class='\"colDestination\"' title='\"Abingdon\"'>
Abingdon
</td>
<td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
22:29
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
65 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
23:09
</td>
</tr>
</tbody>
</table>
</div>
<div class='\"scrollmessage_container\"'>
<div class='\"scrollmessage\"'>
</div>
</div>
<div class='\"services\"'>
<a class='\"service' href='\"#\"' onclick="\"serviceNameClick('');\"" selected\"="">
all
</a>
<a class='\"service\"' href='\"#\"' onclick="\"serviceNameClick('4');\"">
4
</a>
</div>
<div class="dptime">
<span>
times generated at:
</span>
<span>
21:43
</span>
</div>
");
In particular, I'm trying to extract all the departure times - so I'd like to capture the minutes from departure - for example 12 minutes away.
I have the following code:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
I'm not sure how to find the minutes from departure from the above? Is it something like:
minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'})
Could you try this ?
import urllib.request
from bs4 import BeautifulSoup
import re
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))
for elements in minutes:
print(elements.getText())
So I got to my answer with the following code - which was actually quite easy once I had played around with the soup.find_all function:
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
print(link.get_text())
I get the following output:
10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins

Categories