Parsing webpage with robobrowser and beautifulsoup - python

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks

Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

Related

Can't scrape some data out of a table in a customized way

I'm trying to parse tabular content out of some html elements and arrange them in customized manner so that I can write them accordingly in a csv file later.
The table looks almost exactly like this.
Html elements are like (truncated):
<tr>
<td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
Complete Care at Linwood, LLC
</td>
</tr>
<tr>
<td style="width: 55%">
The Health Center At Galloway
</td>
</tr>
<tr>
<td align="center" colspan="4" class="header">BERGEN</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
The Actors Fund Homes
</td>
</tr>
<tr>
<td style="width: 55%">
Actors Fund Home, The
</td>
</tr>
I've tried so far:
for item in soup.select("tr"):
try:
header = item.select_one("td.header").text
except AttributeError:
header = ""
try:
item_name = item.select_one("td > a").text
except AttributeError:
item_name = ""
print(item_name,header)
Output it produces:
ATLANTIC
Complete Care at Linwood, LLC
The Health Center At Galloway
BERGEN
The Actors' Fund Homes
Actors Fund Home, The
Output I would like to have:
Complete Care at Linwood, LLC ATLANTIC
The Health Center At Galloway ATLANTIC
The Actors' Fund Homes BERGEN
Actors Fund Home, The BERGEN
This should produce the output the way you wanted to have.
for item in soup.select("tr"):
if item.select_one("td.header"):
header = item.select_one("td.header").text
elif item.select_one("td > a"):
item_name = item.select_one("td > a").text
print(item_name,header)
Hope its help you.
import os
import csv
html = """<tr>
<td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
Complete Care at Linwood, LLC
</td>
</tr>
<tr>
<td style="width: 55%">
The Health Center At Galloway
</td>
</tr>
<tr>
<td align="center" colspan="4" class="header">BERGEN</td>
</tr>
<tr>
<td class="black10bold">Facility</td>
<td class="black10bold">Type</td>
<td class="black10bold">Funding</td>
</tr>
<tr>
<td style="width: 55%">
The Actors Fund Homes
</td>
</tr>
<tr>
<td style="width: 55%">
Actors Fund Home, The
</td>
</tr>"""
soup = BeautifulSoup(html, 'lxml')
output_rows = []
for table_row in soup.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
print(output_rows)
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Determining the Class Inside a TD Tag

Using python beautifulsoup I am trying to find all of the <tr> tags of an HTML page. However I would like to filter out any <tr> tag that has a certain class inside one of the <td> tags.
I have tried to filter rows out that have the class "Warning" within the <td> tag with the below code.
soup = BeautifulSoup(data, 'html.parser')
print(soup.find_all('tr', class_=lambda c: 'Warning' not in c))
I know it is not filtering out the "warning class" because I am using <tr> inside the find_all function but if I try to use td it gives me a TypeError: argument of type 'NoneType' is not iterable.
Any thoughts are appreciated.
from bs4 import BeautifulSoup
data = '''
<tr role="row" class="odd red" data-id="32">
<td role="gridcell" class="Warning">33</td>
<td role="gridcell">Ralph</td>
<td role="gridcell">List 2</td>
<td role="gridcell">FE</td>
<td role="gridcell">07/12/1996</td>
</tr>
<tr role="row" class="even red" data-id="33">
<td role="gridcell">34</td>
<td role="gridcell">Mary</td>
<td role="gridcell">List 2</td>
<td role="gridcell">SOTLTM</td>
<td role="gridcell">08/12/1996</td>
</tr>
<tr role="row" class="odd red" data-id="34">
<td role="gridcell">35</td>
<td role="gridcell">Tom</td>
<td role="gridcell">List 2</td>
<td role="gridcell">SOTLTM</td>
<td role="gridcell">09/12/1996</td>
</tr>
'''
soup = BeautifulSoup(data, 'html.parser')
print(soup.find_all('td', class_=lambda c: 'Warning' not in c))
class= is not an attribute on most of your <td> elements. This causes c to be set to None in your lambda, so you can automatically let those through the filter with a conditional test:
print(soup.find_all('td', class_=lambda c: not c or 'Warning' not in c))
# ^^^^^^^^
Output
[<td role="gridcell">Ralph</td>,
<td role="gridcell">List 2</td>,
<td role="gridcell">FE</td>,
<td role="gridcell">07/12/1996</td>,
<td role="gridcell">34</td>,
<td role="gridcell">Mary</td>,
<td role="gridcell">List 2</td>,
<td role="gridcell">SOTLTM</td>,
<td role="gridcell">08/12/1996</td>,
<td role="gridcell">35</td>,
<td role="gridcell">Tom</td>,
<td role="gridcell">List 2</td>,
<td role="gridcell">SOTLTM</td>,
<td role="gridcell">09/12/1996</td>]
Moving from there, we can apply this conditional on your primary concern, which is filter the <tr> elements according to their children:
soup = BeautifulSoup(data, 'html.parser')
for tr in soup.find_all('tr'):
if not bool(tr.find_all('td', class_=lambda c: c and 'Warning' in c)):
print(tr) # or print(tr.find_all('td')) if you'd like to
# access only the children of the filtered <tr>s
Output
<tr class="even red" data-id="33" role="row">
<td role="gridcell">34</td>
<td role="gridcell">Mary</td>
<td role="gridcell">List 2</td>
<td role="gridcell">SOTLTM</td>
<td role="gridcell">08/12/1996</td>
</tr>
<tr class="odd red" data-id="34" role="row">
<td role="gridcell">35</td>
<td role="gridcell">Tom</td>
<td role="gridcell">List 2</td>
<td role="gridcell">SOTLTM</td>
<td role="gridcell">09/12/1996</td>
</tr>

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

How to get text content of multiple <td> tags inside a table using PyQuery?

How to select attribute's text field from given book-details table field where values are in text or in text field?
<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
You can use following code snippet to get Publisher and ISBN-13 data:
from pyquery import PyQuery
html = """<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
"""
doc = PyQuery(html)
for td in doc("table.fk-specs-type2").find("td.specs-key"):
print td.text, td.getnext().text
It should print following two lines
Publisher HARPER COLLINS INDIA
ISBN-13 9789350291924

Categories