How to extract using beautifulsoup python [duplicate] - python

This question already has answers here:
python beautifulsoup extracting text
(2 answers)
Closed 9 years ago.
I am only interested to use beautifulsoup to extract all the value of 3-hr PSI Readings from 12AM to 11.59PM. Such as the latest bold text of 82 at 5pm.
Example of website is at http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Can anyone teach me how ? Thanks in advance !
<!-- start content -->
<h1 class="title" id="top">
PSI Readings over the last 24 Hours</h1>
<script type="text/javascript">
var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours';
function changetime(ddl) {
var strTime = ddl.options[ddl.selectedIndex].value;
if (strTime != null) {
var npage = baseUrl + "/time/" + strTime + "#psi24";
window.location = npage;
}
}
</script>
<h1 id="psi24">
24-hr PSI Readings on 24 Jun 2013
</h1>
<p>
View reading for:
<select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);">
<option value="0000">12AM</option>
<option value="0100">1AM</option>
<option value="0200">2AM</option>
<option value="0300">3AM</option>
<option value="0400">4AM</option>
<option value="0500">5AM</option>
<option value="0600">6AM</option>
<option value="0700">7AM</option>
<option value="0800">8AM</option>
<option value="0900">9AM</option>
<option value="1000">10AM</option>
<option value="1100">11AM</option>
<option value="1200">12PM</option>
<option value="1300">1PM</option>
<option value="1400">2PM</option>
<option value="1500">3PM</option>
<option value="1600">4PM</option>
<option selected="selected" value="1700">5PM</option>
</select>
</p>
<table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%">
<thead>
<tr>
<th width="33%">
<center><strong>Region</strong></center>
</th>
<th width="33%">
<center><strong>PSI</strong></center>
</th>
<th width="34%">
<center><strong>24-hr PM2.5 Concentration (µg/m<sup>3</sup>)</strong></center>
</th>
</tr>
</thead>
<tr>
<td align="center">North
</td>
<td align="center">
61
</td>
<td align="center">
47
</td>
</tr>
<tr>
<td align="center">South
</td>
<td align="center">
62
</td>
<td align="center">
46
</td>
</tr>
<tr>
<td align="center">East
</td>
<td align="center">
55
</td>
<td align="center">
39
</td>
</tr>
<tr>
<td align="center">West
</td>
<td align="center">
87
</td>
<td align="center">
83
</td>
</tr>
<tr>
<td align="center">Central
</td>
<td align="center">
58
</td>
<td align="center">
40
</td>
</tr>
<tr>
<td align="center">Overall Singapore
</td>
<td align="center">
55-87
</td>
<td align="center">
39-83
</td>
</tr>
</table>
<div>
</div>
<div>
<h1>3-hr PSI Readings from 12AM to 11.59PM on
24 Jun 2013</h1>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12AM</strong>
</td>
<td align="center" width="7%"><strong>1AM</strong>
</td>
<td align="center" width="7%"><strong>2AM</strong>
</td>
<td align="center" width="7%"><strong>3AM</strong>
</td>
<td align="center" width="7%"><strong>4AM</strong>
</td>
<td align="center" width="7%"><strong>5AM</strong>
</td>
<td align="center" width="7%"><strong>6AM</strong>
</td>
<td align="center" width="7%"><strong>7AM</strong>
</td>
<td align="center" width="7%"><strong>8AM</strong>
</td>
<td align="center" width="7%"><strong>9AM</strong>
</td>
<td align="center" width="7%"><strong>10AM</strong>
</td>
<td align="center" width="7%"><strong>11AM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
76
</td>
<td align="center">
70
</td>
<td align="center">
64
</td>
<td align="center">
59
</td>
<td align="center">
54
</td>
<td align="center">
51
</td>
<td align="center">
48
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
49
</td>
<td align="center">
52
</td>
</tr>
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12PM</strong>
</td>
<td align="center" width="7%"><strong>1PM</strong>
</td>
<td align="center" width="7%"><strong>2PM</strong>
</td>
<td align="center" width="7%"><strong>3PM</strong>
</td>
<td align="center" width="7%"><strong>4PM</strong>
</td>
<td align="center" width="7%"><strong>5PM</strong>
</td>
<td align="center" width="7%"><strong>6PM</strong>
</td>
<td align="center" width="7%"><strong>7PM</strong>
</td>
<td align="center" width="7%"><strong>8PM</strong>
</td>
<td align="center" width="7%"><strong>9PM</strong>
</td>
<td align="center" width="7%"><strong>10PM</strong>
</td>
<td align="center" width="7%"><strong>11PM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
54
</td>
<td align="center">
59
</td>
<td align="center">
65
</td>
<td align="center">
72
</td>
<td align="center">
79
</td>
<td align="center">
<strong style="font-size:14px;">82</strong>
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
</tr>
</table>
</div>
<div class="sfContentBlock">
<p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p>
</div>
<div>
</div>
<div class="backToTop">
Back to Top
</div>
</div>
</div>
<!-- end content -->

Though you should have shown that you've tried to do it yourself, but here is the code:
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])
data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]
pprint(data)
prints:
{'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}

Related

Get <td> text using python selenium

<html>
<body>
<table style="border:0">
<tbody>
<tr class="">
<td class="pr10">Mon</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Tue</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="bold">
<td class="pr10">Wed</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Thu</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Fri</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sat</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sun</td>
<td class="pl10">11am – 11pm</td>
</tr>
</tbody>
</table>
</html>
</body>
Try 1:
driver.find_elements_by_xpath("//*[#class='pr10']")
Try 2:
driver.find_element_by_xpath("//tr[td='Mon']/td").text
But it not fetching the text "Mon" "11am - 11pm"
text_area = driver.find_elements_by_xpath("//*[#class='pr10']")
for items2 in text_area:
print(items2.text)
try this instead:
text_area = driver.find_elements_by_xpath("""//*[#id="body"]/table/tbody/tr[1]/td[1]""")
print([elm.get_attribute('innerHTML') for elm in text_area])

How to find a value in a table with no identifiers? (Python, Selenium)

I have a webpage with a table with many rows. A user will give me a number (15308) which can be found in the top line with the first <td> tag, and this is the only information I will have. I want to be able to use this number to find the data between the <th></th> tag (more specifically the 0), but only for the table row. For example, I attached two table rows and I want the <th> data using the number 15308, but not the <th> data from the table row that has the number 15309 in it's first <td>. Any help is appreciated!
Desired Output: 0
<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>
Use Following code :
userValue='15308'
all_td_th_of_row = driver.find_elements_by_xpath("//td[normalize-space()='" + userValue + "']//following-sibling::td|th")
i = 0
while i<len(all_td_th_of_row) :
print(all_td_th_of_row[i].text)
i=i+1
Something I have always found beautiful, using beauitfulsoup:
Using the xpath="1" as an attribute:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find('th', attrs={'xpath': '1'})
print(xpathTh.text.strip())
OUTPUT:
0
EDIT:
To get all the values from the attrib:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 2</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
1
2
EDIT 2:
Considering you only want the xpath value if the anchor tag inside the td (inside tr) has a value of 15308:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>22222</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
EDIT 3:
Continuing from comments:
line = '''<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('td')[7]
print("For the value: {}, The result is {}".format(toFind, xpathTh.find_next('th').text.strip()))
OUTPUT:
For the value: 15308, The result is 0

Remove text outside of tags with bs4

I want to delete the text Página consultada el: but I don't know how because it's outside any tag.
I've tried with this but nothing changes:
for b in soup.find('br'):
if( b.nextSibling == 'Página consultada el:'):
b.nextSibling.replaceWith('')
if(b.previousSibling == 'Página consultada el:'):
b.previousSibling.replaceWith('')
This is the html of the part I want to remove:
<br/>
<br/>Página consultada el:
<br/>
<strong>27/01/2018 21:42:14</strong>
Whole html:
<html xmlns="http://www.w3.org/1999/xhtml">
<body><strong></strong>
<center><strong></strong>
<br/><br/><br/><br/>
<center>
</center>
<table border="1" cellpadding="0" cellspacing="0" style="width:400px">
<tbody>
<tr>
<td align="CENTER">
<p>Turno: Matutino</p>
</td>
<td align="CENTER"> Grupo: 401 </td>
</tr>
<tr>
<td align="CENTER" colspan="2">
<p>Profesor tutor: <br/> MONICA OSORNIO PEREZ.</p>
</td>
</tr>
</tbody>
</table>
<br/><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td align="CENTER" style="width:70px;">
<p>Hora:</p>
</td>
<td align="CENTER" style="width:186px;">Lunes </td>
<td align="CENTER" style="width:186px;">Martes </td>
<td align="CENTER" style="width:186px;">Miércoles </td>
<td align="CENTER" style="width:186px;">Jueves </td>
<td align="CENTER" style="width:186px;">Viernes </td>
</tr>
<tr>
<td align="CENTER">
<p>7:00<br/>a<br/>7:50</p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(A): A204<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>7:50<br/>a<br/>8:40</p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>8:40<br/>a<br/>9:30</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA CC2 <br/></p>
</td>
<td align="CENTER">
<p> HISTORIA III B116<br/></p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(B): A205<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>9:30<br/>a<br/>10:20</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>10:20<br/>a<br/>11:10</p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA B108<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>11:10<br/>a<br/>12:00</p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A103<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:00<br/>a<br/>12:50</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:50<br/>a<br/>13:40</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>13:40<br/>a<br/>14:30</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> ED FISICA IV GIM <br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>14:30<br/>a<br/>15:20</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
</tbody>
</table><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td style="width:165px;">
<p>Asignatura:</p>
</td>
<td style="width:335px;">Nombre del Profesor:</td>
<td style="width:165px;">Asignatura:</td>
<td style="width:335px;">Nombre del Profesor:</td>
</tr>
<tr>
<td>
<p>ORI.EDU.IV(A):</p>
</td>
<td>BECERRA ALCANTARA IVONNE </td>
<td>
<p>INGLES IV(B):</p>
</td>
<td>CARRILLO SANCHEZ JACOBO </td>
</tr>
<tr>
<td>
<p>LENG. ESP.</p>
</td>
<td>ESTRADA GASCA SCARLETT </td>
<td>
<p>FISICA III</p>
</td>
<td>FLORES FLORES ANA </td>
</tr>
<tr>
<td>
<p>HISTORIA III</p>
</td>
<td>GONZALEZ GARCIA ANGELICA ARACELI </td>
<td>
<p>DIBUJO II(A):</p>
</td>
<td>JIMENEZ GENCHI ERIKA PAOLA </td>
</tr>
<tr>
<td>
<p>LOGICA</p>
</td>
<td>NAVARRO LOZANO JULIANA V. </td>
<td>
<p>MATEMAT. IV</p>
</td>
<td>OLVERA PE¥A ALEJANDRO </td>
</tr>
<tr>
<td>
<p>GEOGRAFIA</p>
</td>
<td>OSORNIO PEREZ MONICA </td>
<td>
<p>ORI.EDU.IV(B):</p>
</td>
<td>PINEDA VALLEJO MARIA GABRIELA </td>
</tr>
<tr>
<td>
<p>INGLES IV(A):</p>
</td>
<td>REYES CRUZ KIMBERLY </td>
<td>
<p>ED FISICA IV</p>
</td>
<td>SANCHEZ LUGO EDGARDO JAIME </td>
</tr>
<tr>
<td>
<p>INFORMATICA</p>
</td>
<td>SOTOMAYOR GUERRA JUAN CARLOS </td>
<td>
<p>DIBUJO II(B):</p>
</td>
<td>VILLANUEVA VILCHIS MONICA EDITH </td>
</tr>
<tr>
<td>
<p></p>
</td>
<td></td>
<td>
<p></p>
</td>
<td></td>
</tr>
</tbody>
</table>
<br/><br/>Página consultada el:<br/><strong>27/01/2018 21:42:14</strong>
</center>
</body>
</html>
This might accomplish what you need:
html = re.sub(r'</table>\n<br/><br/>.+<br/>', '</table>\n<br/><br/><br/>', html)
That removes the text "Página consultada el:" from html.

How to parse this html structure using BeautifulSoup?

I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

Categories