I have a webpage with a table with many rows. A user will give me a number (15308) which can be found in the top line with the first <td> tag, and this is the only information I will have. I want to be able to use this number to find the data between the <th></th> tag (more specifically the 0), but only for the table row. For example, I attached two table rows and I want the <th> data using the number 15308, but not the <th> data from the table row that has the number 15309 in it's first <td>. Any help is appreciated!
Desired Output: 0
<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>
Use Following code :
userValue='15308'
all_td_th_of_row = driver.find_elements_by_xpath("//td[normalize-space()='" + userValue + "']//following-sibling::td|th")
i = 0
while i<len(all_td_th_of_row) :
print(all_td_th_of_row[i].text)
i=i+1
Something I have always found beautiful, using beauitfulsoup:
Using the xpath="1" as an attribute:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find('th', attrs={'xpath': '1'})
print(xpathTh.text.strip())
OUTPUT:
0
EDIT:
To get all the values from the attrib:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 2</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
1
2
EDIT 2:
Considering you only want the xpath value if the anchor tag inside the td (inside tr) has a value of 15308:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>22222</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
EDIT 3:
Continuing from comments:
line = '''<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('td')[7]
print("For the value: {}, The result is {}".format(toFind, xpathTh.find_next('th').text.strip()))
OUTPUT:
For the value: 15308, The result is 0
I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"
This question already has answers here:
python beautifulsoup extracting text
(2 answers)
Closed 9 years ago.
I am only interested to use beautifulsoup to extract all the value of 3-hr PSI Readings from 12AM to 11.59PM. Such as the latest bold text of 82 at 5pm.
Example of website is at http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Can anyone teach me how ? Thanks in advance !
<!-- start content -->
<h1 class="title" id="top">
PSI Readings over the last 24 Hours</h1>
<script type="text/javascript">
var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours';
function changetime(ddl) {
var strTime = ddl.options[ddl.selectedIndex].value;
if (strTime != null) {
var npage = baseUrl + "/time/" + strTime + "#psi24";
window.location = npage;
}
}
</script>
<h1 id="psi24">
24-hr PSI Readings on 24 Jun 2013
</h1>
<p>
View reading for:
<select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);">
<option value="0000">12AM</option>
<option value="0100">1AM</option>
<option value="0200">2AM</option>
<option value="0300">3AM</option>
<option value="0400">4AM</option>
<option value="0500">5AM</option>
<option value="0600">6AM</option>
<option value="0700">7AM</option>
<option value="0800">8AM</option>
<option value="0900">9AM</option>
<option value="1000">10AM</option>
<option value="1100">11AM</option>
<option value="1200">12PM</option>
<option value="1300">1PM</option>
<option value="1400">2PM</option>
<option value="1500">3PM</option>
<option value="1600">4PM</option>
<option selected="selected" value="1700">5PM</option>
</select>
</p>
<table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%">
<thead>
<tr>
<th width="33%">
<center><strong>Region</strong></center>
</th>
<th width="33%">
<center><strong>PSI</strong></center>
</th>
<th width="34%">
<center><strong>24-hr PM2.5 Concentration (µg/m<sup>3</sup>)</strong></center>
</th>
</tr>
</thead>
<tr>
<td align="center">North
</td>
<td align="center">
61
</td>
<td align="center">
47
</td>
</tr>
<tr>
<td align="center">South
</td>
<td align="center">
62
</td>
<td align="center">
46
</td>
</tr>
<tr>
<td align="center">East
</td>
<td align="center">
55
</td>
<td align="center">
39
</td>
</tr>
<tr>
<td align="center">West
</td>
<td align="center">
87
</td>
<td align="center">
83
</td>
</tr>
<tr>
<td align="center">Central
</td>
<td align="center">
58
</td>
<td align="center">
40
</td>
</tr>
<tr>
<td align="center">Overall Singapore
</td>
<td align="center">
55-87
</td>
<td align="center">
39-83
</td>
</tr>
</table>
<div>
</div>
<div>
<h1>3-hr PSI Readings from 12AM to 11.59PM on
24 Jun 2013</h1>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12AM</strong>
</td>
<td align="center" width="7%"><strong>1AM</strong>
</td>
<td align="center" width="7%"><strong>2AM</strong>
</td>
<td align="center" width="7%"><strong>3AM</strong>
</td>
<td align="center" width="7%"><strong>4AM</strong>
</td>
<td align="center" width="7%"><strong>5AM</strong>
</td>
<td align="center" width="7%"><strong>6AM</strong>
</td>
<td align="center" width="7%"><strong>7AM</strong>
</td>
<td align="center" width="7%"><strong>8AM</strong>
</td>
<td align="center" width="7%"><strong>9AM</strong>
</td>
<td align="center" width="7%"><strong>10AM</strong>
</td>
<td align="center" width="7%"><strong>11AM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
76
</td>
<td align="center">
70
</td>
<td align="center">
64
</td>
<td align="center">
59
</td>
<td align="center">
54
</td>
<td align="center">
51
</td>
<td align="center">
48
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
47
</td>
<td align="center">
49
</td>
<td align="center">
52
</td>
</tr>
<tr>
<td align="center" width="16%">
<strong>Time</strong>
</td>
<td align="center" width="7%"><strong>12PM</strong>
</td>
<td align="center" width="7%"><strong>1PM</strong>
</td>
<td align="center" width="7%"><strong>2PM</strong>
</td>
<td align="center" width="7%"><strong>3PM</strong>
</td>
<td align="center" width="7%"><strong>4PM</strong>
</td>
<td align="center" width="7%"><strong>5PM</strong>
</td>
<td align="center" width="7%"><strong>6PM</strong>
</td>
<td align="center" width="7%"><strong>7PM</strong>
</td>
<td align="center" width="7%"><strong>8PM</strong>
</td>
<td align="center" width="7%"><strong>9PM</strong>
</td>
<td align="center" width="7%"><strong>10PM</strong>
</td>
<td align="center" width="7%"><strong>11PM</strong>
</td>
</tr>
<tr>
<td align="center">
<strong>3-hr PSI</strong>
</td>
<td align="center">
54
</td>
<td align="center">
59
</td>
<td align="center">
65
</td>
<td align="center">
72
</td>
<td align="center">
79
</td>
<td align="center">
<strong style="font-size:14px;">82</strong>
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
<td align="center">
-
</td>
</tr>
</table>
</div>
<div class="sfContentBlock">
<p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p>
</div>
<div>
</div>
<div class="backToTop">
Back to Top
</div>
</div>
</div>
<!-- end content -->
Though you should have shown that you've tried to do it yourself, but here is the code:
from pprint import pprint
import urllib2
from bs4 import BeautifulSoup as soup
url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours"
web_soup = soup(urllib2.urlopen(url))
table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0]
table_rows = []
for row in table.find_all('tr'):
table_rows.append([td.text.strip() for td in row.find_all('td')])
data = {}
for tr_index, tr in enumerate(table_rows):
if tr_index % 2 == 0:
for td_index, td in enumerate(tr):
data[td] = table_rows[tr_index + 1][td_index]
pprint(data)
prints:
{'10AM': '49',
'10PM': '-',
'11AM': '52',
'11PM': '-',
'12AM': '76',
'12PM': '54',
'1AM': '70',
'1PM': '59',
'2AM': '64',
'2PM': '65',
'3AM': '59',
'3PM': '72',
'4AM': '54',
'4PM': '79',
'5AM': '51',
'5PM': '82',
'6AM': '48',
'6PM': '79',
'7AM': '47',
'7PM': '-',
'8AM': '47',
'8PM': '-',
'9AM': '47',
'9PM': '-',
'Time': '3-hr PSI'}
Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>