scraping tables with beautifulsoup - python

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?

If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

Related

BeautifulSoup4 extract multiple data from TD tags within TR

Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks
Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

Using BeautifulSoup to pick up text in table, on webpages

I want to use BeautifulSoup to pick up the ‘Model Type’ values on company’s webpages which from codes like below:
it forms 2 tables shown on the webpage, side by side.
updated source code of the webpage
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>
I am using following however it doesn’t get the ‘VIP QB662FG’ wanted:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.thewebpage.com"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
find_it = soup.find_all(text=re.compile("Model Type "))
the_value = find_it[0].findNext('td').contents[0]
print the_value
in what way I can get it? I'm using Python 2.7.
You are looking for the next row, then the next cell in the same position. The latter is tricky; we could assume it is always the 3rd column:
header_text = soup.find(text=re.compile("Model Type "))
value = header_cell.find_next('tr').select('td:nth-of-type(3)')[0].get_text()
If you just ask for the next td, you get the Design Year column instead.
There could well be better methods to get to your one cell; if we assume there is only one tr row with the class row1, for example, the following would get your value in one step:
value = soup.select('tr.row1 td:nth-of-type(3)')[0].get_text()
Find all tr's and output it's third child unless it's first row
import bs4
data = """
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD>
"""
soup = bs4.BeautifulSoup(data)
#table = soup.find('tr', {'class':'tableheader'}).parent
table = soup.find('table', {'class':'tableforms'})
for i,tr in enumerate(table.findChildren()):
if i>0:
for idx,td in enumerate(tr.findChildren()):
if idx==2:
print td.get_text().replace('(Registered)','').strip()
I think you can do as follows :
from bs4 import BeautifulSoup
html = """<TD colSpan=3>Desinger </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Gender </TD>
<TD class=row1 width="20%" align=left>Male </TD></TR>
<TR>
<TD class=row2bold width="5%"> </TD>
<TD class=row2bold width="30%" align=left>Born Country </TD>
<TD class=row1 width="20%" align=left>DE </TD></TR></TBODY></TABLE></TD>
<TD height="100%" vAlign=top>
<TABLE class=tableforms>
<TBODY>
<TR class=tableheader>
<TD colSpan=4>Remarks </TD></TR>
<TR class=tableheader>
<TD width="12%"> </TD>
<TD style="TEXT-ALIGN: left" width="12%">Group </TD>
<TD style="TEXT-ALIGN: left" width="15%">Model Type </TD>
<TD style="TEXT-ALIGN: left" width="15%">Design Year </TD></TR>
<TR class=row1>
<TD width="10%"> </TD>
<TD class=row1>South West</TD>
<TD>VIP QB662FG (Registered) </TD>
<TD>2013 (Registered) </TD></TR></TBODY></TABLE></TD></TR>"""
soup = BeautifulSoup(html, "html.parser")
soup = soup.find('table',{'class':'tableforms'})
dico = {}
l1 = soup.findAll('tr')[1].findAll('td')
l2 = soup.findAll('tr')[2].findAll('td')
for i in range(len(l1)):
dico[l1[i].getText().strip()] = l2[i].getText().replace('(Registered)','').strip()
print dico['Model Type']
It prints : u'VIP QB662FG'

BeautifulSoup help, how to extract content from not proper tags text in html file?

<tr>
<td nowrap> good1 </td>
<td class = "td_left" nowrap=""> 1 </td>
</tr>
<tr0>
<td nowrap> good2 </td>
<td class = "td_left" nowrap=""> </td>
</tr0>
How to using python parse it? please help.
I want to get the result as list ['good1',1,'good2',None]
Find all tr tags and get all tds from it:
from bs4 import BeautifulSoup
page = """<tr>
<td nowrap> good1 </td>
<td nowrap class = "td_left"> 1 </td>
</tr>
<tr>
<td nowrap> good2 </td>
<td nowrap class = "td_left"> 2 </td>
</tr>"""
soup = BeautifulSoup(page)
rows = soup.body.find_all('tr')
print [td.text.strip() for row in rows for td in row.find_all('td')]
prints:
[u'good1', u'1', u'good2', u'2']
Note, strip() helps to get rid of leading and trailing whitespaces.
Hope that helps.

python beautiful soup extract data

I am parsing a html document using a Beautiful Soup 4.0.
Here is an example of table in document
<tr>
<td class="nob"></td>
<td class="">Time of price</td>
<td class=" pullElement pullData-DE000BWB14W0.teFull">08/06/2012</td>
<td class=" pullElement pullData-DE000BWB14W0.PriceTimeFull">11:43:08 </td>
<td class="nob"></td>
</tr>
<tr>
<td class="nob"></td>
<td class="">Daily volume (units)</td>
<td colspan="2" class=" pullElement pullData-DE000BWB14W0.EWXlume">0</td>
<td class="nob"></td>
<t/r>
I would like to extract 08/06/2012 and 11:43:08 DAily volume, 0 etc.
This is my code to find specific table and all data of it
html = file("some_file.html")
soup = BeautifulSoup(html)
t = soup.find(id="ctnt-2308")
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]
I get a list of data that needs to be organized
Any suggestions to do it in a simple way??
Thank you
list(soup.stripped_strings)
will give you all the string in that soup (removing all trailing spaces)

Categories