Navigating html table lxml

Navigating html table lxml - python

I have some html which looks like:
<html>
<body>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tr>
<td align="left" colspan="4">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> Next</td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> Last Page</td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
<td colspan="9" align="right">
Add Checked to Favorites
<br>
Add Checked to Excluded
</td>
</tr>
<tr>
<td rowspan="2"></td><td rowspan="2"></td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href=""/></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b>Position</b></td>
<td colspan="2" align="center" valign="bottom" height="16"><b>Ratings</b><br><img src="/images/shim_333333.gif" width="130" height="1" alt="" hspace="5"></td> <td rowspan="2"> </td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b>Birth Date</b></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b>States</b></td>
<td rowspan="2"> </td><td rowspan="2"></td> <td rowspan="2" colspan="3" align="right" valign="bottom">Clear All </td> </tr>
<tr>
<td align="center"><b>In-State<br>Rating</b></td>
<td align="center"><b>Out of State<br>Rating</b></td>
</tr>
<tr>
<td colspan="13" valign="bottom"><img src="/images/shim.gif" width="100%" height="1" alt=""></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td></td><td><b style="">X</b></td>
<td nowrap><p>Cruise, Tom </p></td>
<td nowrap>Actor </td>
<td align="center"><img src="/images/stars_2_sm_green.gif" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/stars_4_sm.gif" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>1948 </td>
<td nowrap>CA</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="198720" style="height:15px"></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p>Schwarzenegger, Arnold </p></td>
<td nowrap>Governor </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="61184" style="height:15px"></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p>Obama, Barack </p></td>
<td nowrap>President </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="225747" style="height:15px"></td>
</tr>
<tr height="15">
<td align="right" colspan="14">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> Next</td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> Last Page</td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
</tr> <tr>
<td colspan="12" valign="bottom" nowrap><br>
<b style="">X</bfdgdfgb style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="c">X</b>dfgfdg<b style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> <b style="">F</b>: A dsd "<b style="">F</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
dfgdfg"<b style="">F</b>"Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="">E</b>gfhbgdfg"<b style="">E</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit
</td>
</tr><tr><td colspan="20">
<table cellpadding="0" cellspacing="0" border="0" width="100%" align="center">
<tr>
<td colspan="2"><img src="/images/shim.gif" width="100%" height="5" alt=""></td>
</tr>
<tr>
<td valign="top">States: </td>
<td>CA=California; ND=North Dakota</td>
</tr>
</table>
</td></tr>
</table></body>
</html>
Looking for similar questions, I was able to construct (noting that the table is always 17th in the full html code):
data = open("employeetest.htm",'r').read()
root = lh.fromstring(data)
rows = root.xpath("//table")[17].findall("tr")
data = list()
for row in rows:
data.append([c.text_content() for c in row.getchildren()])
print data
This produces a very messy list. My end goal is just to get
[['Cruise, Tom', 'Actor', '1948', 'CA'], ['Schwarzenegger, Arnold', 'Governor', 'No Current Date', '-'], ...]
However, all this information contained in the table produces a lot of strange elements. I know I can clean the resultant \xa0 by replacing with a single space. I'm not really sure how to navigate this further. Thanks!

Not sure what the ... should be in your expected output but to get the data in the first three sublists, you can narrow down the search looking for trs that have a nowrap attribute and only one attribute altogether:
from lxml import html
root = html.fromstring(h)
rows = root.xpath("//tr[td[#nowrap and text() and count(#*)=1]]")
data = list()
for row in rows:
print(row.xpath(".//td[#nowrap]//text()"))
Output:
['Cruise, Tom', u'\xa0', u'Actor\xa0', u'1948\xa0', 'CA']
['Schwarzenegger, Arnold', u'\xa0', u'Governor\xa0', u'No Current Date\xa0', '-']
['Obama, Barack', u'\xa0', u'President\xa0', u'No Current Date\xa0', '-']

You will have to traverse the html document and get a more refined XPath. Additionally, you face the challenge of related data in different elements requiring two XPath expressions. This will require some manipulation to get the final related results together:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace('&nbsp', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]

Related

How to find a value in a table with no identifiers? (Python, Selenium)

I have a webpage with a table with many rows. A user will give me a number (15308) which can be found in the top line with the first <td> tag, and this is the only information I will have. I want to be able to use this number to find the data between the <th></th> tag (more specifically the 0), but only for the table row. For example, I attached two table rows and I want the <th> data using the number 15308, but not the <th> data from the table row that has the number 15309 in it's first <td>. Any help is appreciated!
Desired Output: 0
<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>

Use Following code :
userValue='15308'
all_td_th_of_row = driver.find_elements_by_xpath("//td[normalize-space()='" + userValue + "']//following-sibling::td|th")
i = 0
while i<len(all_td_th_of_row) :
print(all_td_th_of_row[i].text)
i=i+1

Something I have always found beautiful, using beauitfulsoup:
Using the xpath="1" as an attribute:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find('th', attrs={'xpath': '1'})
print(xpathTh.text.strip())
OUTPUT:
0
EDIT:
To get all the values from the attrib:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 2</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
1
2
EDIT 2:
Considering you only want the xpath value if the anchor tag inside the td (inside tr) has a value of 15308:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>22222</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
EDIT 3:
Continuing from comments:
line = '''<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('td')[7]
print("For the value: {}, The result is {}".format(toFind, xpathTh.find_next('th').text.strip()))
OUTPUT:
For the value: 15308, The result is 0

Remove text outside of tags with bs4

I want to delete the text Página consultada el: but I don't know how because it's outside any tag.
I've tried with this but nothing changes:
for b in soup.find('br'):
if( b.nextSibling == 'Página consultada el:'):
b.nextSibling.replaceWith('')
if(b.previousSibling == 'Página consultada el:'):
b.previousSibling.replaceWith('')
This is the html of the part I want to remove:
<br/>
<br/>Página consultada el:
<br/>
<strong>27/01/2018 21:42:14</strong>
Whole html:
<html xmlns="http://www.w3.org/1999/xhtml">
<body><strong></strong>
<center><strong></strong>
<br/><br/><br/><br/>
<center>
</center>
<table border="1" cellpadding="0" cellspacing="0" style="width:400px">
<tbody>
<tr>
<td align="CENTER">
<p>Turno: Matutino</p>
</td>
<td align="CENTER"> Grupo: 401 </td>
</tr>
<tr>
<td align="CENTER" colspan="2">
<p>Profesor tutor: <br/> MONICA OSORNIO PEREZ.</p>
</td>
</tr>
</tbody>
</table>
<br/><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td align="CENTER" style="width:70px;">
<p>Hora:</p>
</td>
<td align="CENTER" style="width:186px;">Lunes </td>
<td align="CENTER" style="width:186px;">Martes </td>
<td align="CENTER" style="width:186px;">Miércoles </td>
<td align="CENTER" style="width:186px;">Jueves </td>
<td align="CENTER" style="width:186px;">Viernes </td>
</tr>
<tr>
<td align="CENTER">
<p>7:00<br/>a<br/>7:50</p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(A): A204<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>7:50<br/>a<br/>8:40</p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>8:40<br/>a<br/>9:30</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA CC2 <br/></p>
</td>
<td align="CENTER">
<p> HISTORIA III B116<br/></p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(B): A205<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>9:30<br/>a<br/>10:20</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>10:20<br/>a<br/>11:10</p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA B108<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>11:10<br/>a<br/>12:00</p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A103<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:00<br/>a<br/>12:50</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:50<br/>a<br/>13:40</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>13:40<br/>a<br/>14:30</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> ED FISICA IV GIM <br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>14:30<br/>a<br/>15:20</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
</tbody>
</table><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td style="width:165px;">
<p>Asignatura:</p>
</td>
<td style="width:335px;">Nombre del Profesor:</td>
<td style="width:165px;">Asignatura:</td>
<td style="width:335px;">Nombre del Profesor:</td>
</tr>
<tr>
<td>
<p>ORI.EDU.IV(A):</p>
</td>
<td>BECERRA ALCANTARA IVONNE </td>
<td>
<p>INGLES IV(B):</p>
</td>
<td>CARRILLO SANCHEZ JACOBO </td>
</tr>
<tr>
<td>
<p>LENG. ESP.</p>
</td>
<td>ESTRADA GASCA SCARLETT </td>
<td>
<p>FISICA III</p>
</td>
<td>FLORES FLORES ANA </td>
</tr>
<tr>
<td>
<p>HISTORIA III</p>
</td>
<td>GONZALEZ GARCIA ANGELICA ARACELI </td>
<td>
<p>DIBUJO II(A):</p>
</td>
<td>JIMENEZ GENCHI ERIKA PAOLA </td>
</tr>
<tr>
<td>
<p>LOGICA</p>
</td>
<td>NAVARRO LOZANO JULIANA V. </td>
<td>
<p>MATEMAT. IV</p>
</td>
<td>OLVERA PE¥A ALEJANDRO </td>
</tr>
<tr>
<td>
<p>GEOGRAFIA</p>
</td>
<td>OSORNIO PEREZ MONICA </td>
<td>
<p>ORI.EDU.IV(B):</p>
</td>
<td>PINEDA VALLEJO MARIA GABRIELA </td>
</tr>
<tr>
<td>
<p>INGLES IV(A):</p>
</td>
<td>REYES CRUZ KIMBERLY </td>
<td>
<p>ED FISICA IV</p>
</td>
<td>SANCHEZ LUGO EDGARDO JAIME </td>
</tr>
<tr>
<td>
<p>INFORMATICA</p>
</td>
<td>SOTOMAYOR GUERRA JUAN CARLOS </td>
<td>
<p>DIBUJO II(B):</p>
</td>
<td>VILLANUEVA VILCHIS MONICA EDITH </td>
</tr>
<tr>
<td>
<p></p>
</td>
<td></td>
<td>
<p></p>
</td>
<td></td>
</tr>
</tbody>
</table>
<br/><br/>Página consultada el:<br/><strong>27/01/2018 21:42:14</strong>
</center>
</body>
</html>

This might accomplish what you need:
html = re.sub(r'</table>\n<br/><br/>.+<br/>', '</table>\n<br/><br/><br/>', html)
That removes the text "Página consultada el:" from html.

How to parse this html structure using BeautifulSoup?

I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>

Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"

How to extract using beautifulsoup python [duplicate]

This question already has answers here:
python beautifulsoup extracting text
(2 answers)
Closed 9 years ago.
I am only interested to use beautifulsoup to extract all the value of 3-hr PSI Readings from 12AM to 11.59PM. Such as the latest bold text of 82 at 5pm.
Example of website is at http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Can anyone teach me how ? Thanks in advance !
<!-- start content -->
<h1 class="title" id="top">
PSI Readings over the last 24 Hours</h1>
<script type="text/javascript">
var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours';
function changetime(ddl) {
var strTime = ddl.options[ddl.selectedIndex].value;
if (strTime != null) {
var npage = baseUrl + "/time/" + strTime + "#psi24";
window.location = npage;
}
}
</script>
<h1 id="psi24">
24-hr PSI Readings on 24 Jun 2013
</h1>
<p>
View reading for:
<select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);">
<option value="0000">12AM</option>
<option value="0100">1AM</option>
<option value="0200">2AM</option>
<option value="0300">3AM</option>
<option value="0400">4AM</option>
<option value="0500">5AM</option>
<option value="0600">6AM</option>
<option value="0700">7AM</option>
<option value="0800">8AM</option>
<option value="0900">9AM</option>
<option value="1000">10AM</option>
<option value="1100">11AM</option>
<option value="1200">12PM</option>
<option value="1300">1PM</option>
<option value="1400">2PM</option>
<option value="1500">3PM</option>
<option value="1600">4PM</option>
<option selected="selected" value="1700">5PM</option>
</select>
</p>
<table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%">
<thead>
<tr>
<th width="33%">
<center><strong>Region</strong></center>
</th>
<th width="33%">
<center><strong>PSI</strong></center>
</th>
<th width="34%">
<center><strong>24-hr PM2.5 Concentration (Âµg/m<sup>3</sup>)</strong></center>
</th>
</tr>
</thead>
<tr>
<td align="center">North
</td>
<td align="center">
61
</td>
<td align="center">
47
</td>
</tr>
<tr>
<td align="center">South
</td>
<td align="center">
62
</td>
<td align="center">
46
</td>
</tr>
<tr>
<td align="center">East
</td>
<td align="center">
55
</td>
<td align="center">
39
</td>
</tr>
<tr>
<td align="center">West
</td>
<td align="center">
87
</td>
<td align="center">
83
</td>
</tr>
<tr>
<td align="center">Central
</td>
<td align="center">
58
</td>
<td align="center">
40
</td>
</tr>
<tr>
<td align="center">Overall Singapore
</td>
<td align="center">
55-87
</td>
<td align="center">
39-83
</td>
</tr>
</table>
<div>
</div>
<div>
<h1>3-hr PSI Readings from 12AM to 11.59PM on
24 Jun 2013</h1>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12AM</strong>
</td>
<td align="center" width="7%"><strong>1AM</strong>
</td>
<td align="center" width="7%"><strong>2AM</strong>
</td>
<td align="center" width="7%"><strong>3AM</strong>
</td>
<td align="center" width="7%"><strong>4AM</strong>
</td>
<td align="center" width="7%"><strong>5AM</strong>
</td>
<td align="center" width="7%"><strong>6AM</strong>
</td>
<td align="center" width="7%"><strong>7AM</strong>
</td>
<td align="center" width="7%"><strong>8AM</strong>
</td>
<td align="center" width="7%"><strong>9AM</strong>
</td>
<td align="center" width="7%"><strong>10AM</strong>
</td>
<td align="center" width="7%"><strong>11AM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
76
</td>
<td align="center">
70
</td>
<td align="center">
64
</td>
<td align="center">
59
</td>
<td align="center">
54
</td>
<td align="center">
51
</td>
<td align="center">
48
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
49
</td>
<td align="center">
52
</td>
</tr>
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12PM</strong>
</td>
<td align="center" width="7%"><strong>1PM</strong>
</td>
<td align="center" width="7%"><strong>2PM</strong>
</td>
<td align="center" width="7%"><strong>3PM</strong>
</td>
<td align="center" width="7%"><strong>4PM</strong>
</td>
<td align="center" width="7%"><strong>5PM</strong>
</td>
<td align="center" width="7%"><strong>6PM</strong>
</td>
<td align="center" width="7%"><strong>7PM</strong>
</td>
<td align="center" width="7%"><strong>8PM</strong>
</td>
<td align="center" width="7%"><strong>9PM</strong>
</td>
<td align="center" width="7%"><strong>10PM</strong>
</td>
<td align="center" width="7%"><strong>11PM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
54
</td>
<td align="center">
59
</td>
<td align="center">
65
</td>
<td align="center">
72
</td>
<td align="center">
79
</td>
<td align="center">
<strong style="font-size:14px;">82</strong>
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
</tr>
</table>
</div>
<div class="sfContentBlock">
<p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p>
</div>
<div>
</div>
<div class="backToTop">
Back to Top
</div>
</div>
</div>
<!-- end content -->

Though you should have shown that you've tried to do it yourself, but here is the code:
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])
data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]
pprint(data)
prints:
{'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.

thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Navigating html table lxml - python

Related

How to find a value in a table with no identifiers? (Python, Selenium)

Remove text outside of tags with bs4

How to parse this html structure using BeautifulSoup?

How to extract using beautifulsoup python [duplicate]

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Categories

Resources