How to parse this html structure using BeautifulSoup? - python
I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"
Related
Extracting certain values from a row if a cell meets a certain condition
I have this HTML file which was obtained from a website that has financial data. <table class="tableFile2" summary="Results"> <tr> <td nowrap="nowrap"> 13F-HR </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-05-15 </td> <td nowrap="nowrap"> <a href="URL"> 028-10098 </a> <br/> 19827821 </td> </tr> <tr class="blueRow"> <td nowrap="nowrap"> 13F-HR </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-14 </td> <td nowrap="nowrap"> <a href="URL"> 028-10098 </a> <br/> 19606811 </td> </tr> <tr> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> <tr class="blueRow"> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> <tr> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> </table> I am trying to extract only rows where one of the cells contains the word 13F. Once I get the correct rows, I want to be able to save the date and the href into a list for later processing. Currently I managed to build my scraper to successfully locate a specific table, but I am having trouble filtering specific rows based off of my criteria. Currently when I try to add a conditional it seems to ignore it and continue to include rows all rows. r = requests.get(url) soup = BeautifulSoup(open("data/testHTML.html"), 'html.parser') table = soup.find('table', {"class": "tableFile2"}) rows = table.findChildren("tr") for row in rows: cell = row.findNext("td") if cell.text.find('13F'): print(row) Ideally I am trying to get an output similar to this [13F-HR, URL, 2019-05-15],[13F-HR, URL, 2019-02-14]
Use regular expression re to find the text of cell. from bs4 import BeautifulSoup import re data='''<table class="tableFile2" summary="Results"> <tr> <td nowrap="nowrap"> 13F-HR </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-05-15 </td> <td nowrap="nowrap"> <a href="URL"> 028-10098 </a> <br/> 19827821 </td> </tr> <tr class="blueRow"> <td nowrap="nowrap"> 13F-HR </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-14 </td> <td nowrap="nowrap"> <a href="URL"> 028-10098 </a> <br/> 19606811 </td> </tr> <tr> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> <tr class="blueRow"> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> <tr> <td nowrap="nowrap"> SC 13G/A </td> <td nowrap="nowrap"> <a href="URL" id="documentsbutton"> Documents </a> </td> <td> 2019-02-13 </td> <td> </td> </tr> </table>''' soup=BeautifulSoup(data,'html.parser') table = soup.find('table', {"class": "tableFile2"}) rows=table.find_all('tr') final_items=[] for row in rows: items = [] cell=row.find('td',text=re.compile('13F')) if cell: items.append(cell.text.strip()) items.append(cell.find_next('a')['href']) items.append(cell.find_next('a').find_next('td').text.strip()) final_items.append(items) print(final_items) Output: [['13F-HR', 'URL', '2019-05-15'], ['13F-HR', 'URL', '2019-02-14']]
Optimized solution: ... for tr in soup.select('table.tableFile2 tr'): tds = tr.findChildren('td') if '13F' in tds[0].text: print([td.text.strip() for td in tds[:3]]) The output: ['13F-HR', 'Documents', '2019-05-15'] ['13F-HR', 'Documents', '2019-02-14']
Remove text outside of tags with bs4
I want to delete the text Página consultada el: but I don't know how because it's outside any tag. I've tried with this but nothing changes: for b in soup.find('br'): if( b.nextSibling == 'Página consultada el:'): b.nextSibling.replaceWith('') if(b.previousSibling == 'Página consultada el:'): b.previousSibling.replaceWith('') This is the html of the part I want to remove: <br/> <br/>Página consultada el: <br/> <strong>27/01/2018 21:42:14</strong> Whole html: <html xmlns="http://www.w3.org/1999/xhtml"> <body><strong></strong> <center><strong></strong> <br/><br/><br/><br/> <center> </center> <table border="1" cellpadding="0" cellspacing="0" style="width:400px"> <tbody> <tr> <td align="CENTER"> <p>Turno: Matutino</p> </td> <td align="CENTER"> Grupo: 401 </td> </tr> <tr> <td align="CENTER" colspan="2"> <p>Profesor tutor: <br/> MONICA OSORNIO PEREZ.</p> </td> </tr> </tbody> </table> <br/><br/> <table border="1" cellpadding="0" cellspacing="0" style="width:1000px"> <tbody> <tr> <td align="CENTER" style="width:70px;"> <p>Hora:</p> </td> <td align="CENTER" style="width:186px;">Lunes </td> <td align="CENTER" style="width:186px;">Martes </td> <td align="CENTER" style="width:186px;">Miércoles </td> <td align="CENTER" style="width:186px;">Jueves </td> <td align="CENTER" style="width:186px;">Viernes </td> </tr> <tr> <td align="CENTER"> <p>7:00<br/>a<br/>7:50</p> </td> <td align="CENTER"> <p> ORI.EDU.IV(A): A204<br/></p> </td> <td align="CENTER"> <p> MATEMAT. IV B108<br/></p> </td> <td align="CENTER"> <p> LENG. ESP. B108<br/></p> </td> <td align="CENTER"> <p> MATEMAT. IV B108<br/></p> </td> <td align="CENTER"> <p> MATEMAT. IV B108<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>7:50<br/>a<br/>8:40</p> </td> <td align="CENTER"> <p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p> </td> <td align="CENTER"> <p> MATEMAT. IV B108<br/></p> </td> <td align="CENTER"> <p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p> </td> <td align="CENTER"> <p> MATEMAT. IV B108<br/></p> </td> <td align="CENTER"> <p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>8:40<br/>a<br/>9:30</p> </td> <td align="CENTER"> <p> LENG. ESP. B108<br/></p> </td> <td align="CENTER"> <p> INFORMATICA CC2 <br/></p> </td> <td align="CENTER"> <p> HISTORIA III B116<br/></p> </td> <td align="CENTER"> <p> ORI.EDU.IV(B): A205<br/></p> </td> <td align="CENTER"> <p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>9:30<br/>a<br/>10:20</p> </td> <td align="CENTER"> <p> LENG. ESP. B108<br/></p> </td> <td align="CENTER"> <p> GEOGRAFIA A102<br/></p> </td> <td align="CENTER"> <p> FISICA III A303<br/></p> </td> <td align="CENTER"> <p> GEOGRAFIA A102<br/></p> </td> <td align="CENTER"> <p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>10:20<br/>a<br/>11:10</p> </td> <td align="CENTER"> <p> HISTORIA III B108<br/></p> </td> <td align="CENTER"> <p> INFORMATICA B108<br/></p> </td> <td align="CENTER"> <p> FISICA III A303<br/></p> </td> <td align="CENTER"> <p> FISICA III LACE<br/></p> </td> <td align="CENTER"> <p> </p> </td> </tr> <tr> <td align="CENTER"> <p>11:10<br/>a<br/>12:00</p> </td> <td align="CENTER"> <p> LOGICA B108<br/></p> </td> <td align="CENTER"> <p> LENG. ESP. B108<br/></p> </td> <td align="CENTER"> <p> GEOGRAFIA A103<br/></p> </td> <td align="CENTER"> <p> FISICA III LACE<br/></p> </td> <td align="CENTER"> <p> LOGICA B108<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>12:00<br/>a<br/>12:50</p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> LENG. ESP. B108<br/></p> </td> <td align="CENTER"> <p> LOGICA B108<br/></p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> HISTORIA III B108<br/></p> </td> </tr> <tr> <td align="CENTER"> <p>12:50<br/>a<br/>13:40</p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> </tr> <tr> <td align="CENTER"> <p>13:40<br/>a<br/>14:30</p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> ED FISICA IV GIM <br/></p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> </tr> <tr> <td align="CENTER"> <p>14:30<br/>a<br/>15:20</p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> <td align="CENTER"> <p> </p> </td> </tr> </tbody> </table><br/> <table border="1" cellpadding="0" cellspacing="0" style="width:1000px"> <tbody> <tr> <td style="width:165px;"> <p>Asignatura:</p> </td> <td style="width:335px;">Nombre del Profesor:</td> <td style="width:165px;">Asignatura:</td> <td style="width:335px;">Nombre del Profesor:</td> </tr> <tr> <td> <p>ORI.EDU.IV(A):</p> </td> <td>BECERRA ALCANTARA IVONNE </td> <td> <p>INGLES IV(B):</p> </td> <td>CARRILLO SANCHEZ JACOBO </td> </tr> <tr> <td> <p>LENG. ESP.</p> </td> <td>ESTRADA GASCA SCARLETT </td> <td> <p>FISICA III</p> </td> <td>FLORES FLORES ANA </td> </tr> <tr> <td> <p>HISTORIA III</p> </td> <td>GONZALEZ GARCIA ANGELICA ARACELI </td> <td> <p>DIBUJO II(A):</p> </td> <td>JIMENEZ GENCHI ERIKA PAOLA </td> </tr> <tr> <td> <p>LOGICA</p> </td> <td>NAVARRO LOZANO JULIANA V. </td> <td> <p>MATEMAT. IV</p> </td> <td>OLVERA PE¥A ALEJANDRO </td> </tr> <tr> <td> <p>GEOGRAFIA</p> </td> <td>OSORNIO PEREZ MONICA </td> <td> <p>ORI.EDU.IV(B):</p> </td> <td>PINEDA VALLEJO MARIA GABRIELA </td> </tr> <tr> <td> <p>INGLES IV(A):</p> </td> <td>REYES CRUZ KIMBERLY </td> <td> <p>ED FISICA IV</p> </td> <td>SANCHEZ LUGO EDGARDO JAIME </td> </tr> <tr> <td> <p>INFORMATICA</p> </td> <td>SOTOMAYOR GUERRA JUAN CARLOS </td> <td> <p>DIBUJO II(B):</p> </td> <td>VILLANUEVA VILCHIS MONICA EDITH </td> </tr> <tr> <td> <p></p> </td> <td></td> <td> <p></p> </td> <td></td> </tr> </tbody> </table> <br/><br/>Página consultada el:<br/><strong>27/01/2018 21:42:14</strong> </center> </body> </html>
This might accomplish what you need: html = re.sub(r'</table>\n<br/><br/>.+<br/>', '</table>\n<br/><br/><br/>', html) That removes the text "Página consultada el:" from html.
How to extract using beautifulsoup python [duplicate]
This question already has answers here: python beautifulsoup extracting text (2 answers) Closed 9 years ago. I am only interested to use beautifulsoup to extract all the value of 3-hr PSI Readings from 12AM to 11.59PM. Such as the latest bold text of 82 at 5pm. Example of website is at http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Can anyone teach me how ? Thanks in advance ! <!-- start content --> <h1 class="title" id="top"> PSI Readings over the last 24 Hours</h1> <script type="text/javascript"> var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours'; function changetime(ddl) { var strTime = ddl.options[ddl.selectedIndex].value; if (strTime != null) { var npage = baseUrl + "/time/" + strTime + "#psi24"; window.location = npage; } } </script> <h1 id="psi24"> 24-hr PSI Readings on 24 Jun 2013 </h1> <p> View reading for: <select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);"> <option value="0000">12AM</option> <option value="0100">1AM</option> <option value="0200">2AM</option> <option value="0300">3AM</option> <option value="0400">4AM</option> <option value="0500">5AM</option> <option value="0600">6AM</option> <option value="0700">7AM</option> <option value="0800">8AM</option> <option value="0900">9AM</option> <option value="1000">10AM</option> <option value="1100">11AM</option> <option value="1200">12PM</option> <option value="1300">1PM</option> <option value="1400">2PM</option> <option value="1500">3PM</option> <option value="1600">4PM</option> <option selected="selected" value="1700">5PM</option> </select> </p> <table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%"> <thead> <tr> <th width="33%"> <center><strong>Region</strong></center> </th> <th width="33%"> <center><strong>PSI</strong></center> </th> <th width="34%"> <center><strong>24-hr PM2.5 Concentration (µg/m<sup>3</sup>)</strong></center> </th> </tr> </thead> <tr> <td align="center">North </td> <td align="center"> 61 </td> <td align="center"> 47 </td> </tr> <tr> <td align="center">South </td> <td align="center"> 62 </td> <td align="center"> 46 </td> </tr> <tr> <td align="center">East </td> <td align="center"> 55 </td> <td align="center"> 39 </td> </tr> <tr> <td align="center">West </td> <td align="center"> 87 </td> <td align="center"> 83 </td> </tr> <tr> <td align="center">Central </td> <td align="center"> 58 </td> <td align="center"> 40 </td> </tr> <tr> <td align="center">Overall Singapore </td> <td align="center"> 55-87 </td> <td align="center"> 39-83 </td> </tr> </table> <div> </div> <div> <h1>3-hr PSI Readings from 12AM to 11.59PM on 24 Jun 2013</h1> <table border="0" cellpadding="4" cellspacing="1" width="100%"> <tr> <td align="center" width="16%"> <strong>Time</strong> </td> <td align="center" width="7%"><strong>12AM</strong> </td> <td align="center" width="7%"><strong>1AM</strong> </td> <td align="center" width="7%"><strong>2AM</strong> </td> <td align="center" width="7%"><strong>3AM</strong> </td> <td align="center" width="7%"><strong>4AM</strong> </td> <td align="center" width="7%"><strong>5AM</strong> </td> <td align="center" width="7%"><strong>6AM</strong> </td> <td align="center" width="7%"><strong>7AM</strong> </td> <td align="center" width="7%"><strong>8AM</strong> </td> <td align="center" width="7%"><strong>9AM</strong> </td> <td align="center" width="7%"><strong>10AM</strong> </td> <td align="center" width="7%"><strong>11AM</strong> </td> </tr> <tr> <td align="center"> <strong>3-hr PSI</strong> </td> <td align="center"> 76 </td> <td align="center"> 70 </td> <td align="center"> 64 </td> <td align="center"> 59 </td> <td align="center"> 54 </td> <td align="center"> 51 </td> <td align="center"> 48 </td> <td align="center"> 47 </td> <td align="center"> 47 </td> <td align="center"> 47 </td> <td align="center"> 49 </td> <td align="center"> 52 </td> </tr> <tr> <td align="center" width="16%"> <strong>Time</strong> </td> <td align="center" width="7%"><strong>12PM</strong> </td> <td align="center" width="7%"><strong>1PM</strong> </td> <td align="center" width="7%"><strong>2PM</strong> </td> <td align="center" width="7%"><strong>3PM</strong> </td> <td align="center" width="7%"><strong>4PM</strong> </td> <td align="center" width="7%"><strong>5PM</strong> </td> <td align="center" width="7%"><strong>6PM</strong> </td> <td align="center" width="7%"><strong>7PM</strong> </td> <td align="center" width="7%"><strong>8PM</strong> </td> <td align="center" width="7%"><strong>9PM</strong> </td> <td align="center" width="7%"><strong>10PM</strong> </td> <td align="center" width="7%"><strong>11PM</strong> </td> </tr> <tr> <td align="center"> <strong>3-hr PSI</strong> </td> <td align="center"> 54 </td> <td align="center"> 59 </td> <td align="center"> 65 </td> <td align="center"> 72 </td> <td align="center"> 79 </td> <td align="center"> <strong style="font-size:14px;">82</strong> </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> </tr> </table> </div> <div class="sfContentBlock"> <p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p> </div> <div> </div> <div class="backToTop"> Back to Top </div> </div> </div> <!-- end content -->
Though you should have shown that you've tried to do it yourself, but here is the code: from pprint import pprint import urllib2 from bs4 import BeautifulSoup as soup url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours" web_soup = soup(urllib2.urlopen(url)) table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0] table_rows = [] for row in table.find_all('tr'): table_rows.append([td.text.strip() for td in row.find_all('td')]) data = {} for tr_index, tr in enumerate(table_rows): if tr_index % 2 == 0: for td_index, td in enumerate(tr): data[td] = table_rows[tr_index + 1][td_index] pprint(data) prints: {'10AM': '49', '10PM': '-', '11AM': '52', '11PM': '-', '12AM': '76', '12PM': '54', '1AM': '70', '1PM': '59', '2AM': '64', '2PM': '65', '3AM': '59', '3PM': '72', '4AM': '54', '4PM': '79', '5AM': '51', '5PM': '82', '6AM': '48', '6PM': '79', '7AM': '47', '7PM': '-', '8AM': '47', '8PM': '-', '9AM': '47', '9PM': '-', 'Time': '3-hr PSI'}
Can't parse a second table with beautifulsoup even if the first one works?
I am trying to parse tables using beautifulsoup. The first one on my page was easy but I cannot parse a similar table on the same page. I do not understand why. Here is the code. Thanks in advance for your help. import urllib2 from bs4 import BeautifulSoup url = urllib2.urlopen("https://dl.dropboxusercontent.com/u/956261/poftext.html") contentHTML = url.read() soup = BeautifulSoup(contentHTML) tableUserDetails = soup.find("table", {"class" : "user-details"}) i = 0 tableUserDetailsList = [] for row in tableUserDetails.findAll('tr'): for col in row.findAll('td'): contentTd = col.contents[0].string.strip() if contentTd: print "TD Number %d : %s" % (i, contentTd) tableUserDetailsList.append(contentTd) i += 1 # This first table is OK print tableUserDetailsList # But now this one tableUserDetails = soup.find("table", {"class" : "secondpart"}) i = 0 tableUserDetailsList = [] for row in tableUserDetails.findAll('tr'): for col in row.findAll('td'): contentTd = col.contents[0].string.strip() if contentTd: print "TD Number %d : %s" % (i, contentTd) tableUserDetailsList.append(contentTd) i += 1 print tableUserDetailsList # The list is empty :( Here is a simplified version of the HTML code that I am trying to parse: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> French.Kiss Sorties, Sport, Voyages, Nouvelles Expériences</title> </head> <body style='background-color: #fff;' leftMargin='0' topMargin='0' marginwidth='0' marginheight='0' link='#1E55D6' vlink='#1E55D6' TEXT='#6551b0'> <table class="user-details"> <tr> <td class="headline txtBlue size15" style="width:80px"> About </td> <td style="width:10px"> </td> <td class="txtGrey size15"> Fume occasionnellement with Silhouette mince </td> <td width="25px;"> </td> <td class="headline txtBlue size15"> City </td> <td class="txtGrey size15"> Paris Ile-de-France </td> </tr> <tr> <td class="headline txtBlue size15"> Details </td> <td style="width:10px"> </td> <td class="txtGrey size15"> 26 year old Un homme, 185cm, Sans religion </td> <td> </td> <td class="headline txtBlue size15"> Ethnicity </td> <td class="txtGrey size15"> Caucasienne Balance with Châtains </td> </tr> <tr> <td class="headline txtBlue size15"> Intent </td> <td style="width:10px"> </td> <td class="txtGrey size15"> French.Kiss Cherche une relation amoureuse. </td> <td> </td> <td class="headline txtBlue size15" style="width:90px"> Education </td> <td class="txtGrey size15"> Diplôme universitaire/Licence </td> </tr> <tr> <td class="headline txtBlue size15"> Personnalité </td> <td style="width:10px"> </td> <td class="txtGrey size15"> </td> <td> </td> <td> <span class="headline txtBlue size15">Profession </span> </td> <td> <span class="txtGrey size15"> Visioconférence</span> </td> </tr> </table> <table width="85%" class="secondpart"> <tr height="25px"> <td width="200px"> <span class="headline txtBlue size14">I am Seeking a</span> </td> <td width="300px"> <span class="txtGrey size14"> Une femme</span> </td> <td width="25px"> </td> <td width="200px"> <span class="headline txtBlue size14">For</span> </td> <td width="200px"> <span class="txtGrey size14"> Sorties</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span> </td> <td> <span class="txtGrey size14"><a href='needs_test.aspx'> <a href="needs_view.aspx?id=38028200">View his relationship needs</a></a></span> </td> <td> </td> <td> <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span> </td> <td> <span class="txtGrey size14"><a href='poftest.aspx'> <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View his chemistry results</a></a></span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Do you drink?</span> </td> <td> <span class="txtGrey size14"> Occasionnellement</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you want children?</span> </td> <td> <span class="txtGrey size14"> Non divulgué</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Marital Status</span> </td> <td> <span class="txtGrey size14"> Célibataire</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you do drugs?</span> </td> <td> <span class="txtGrey size14"> Non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Pets </span> </td> <td> <span class="txtGrey size14"> Aucun</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Eye Color</span> </td> <td> <span class="txtGrey size14"> Bruns</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Do you have a car? </span> </td> <td> <span class="txtGrey size14"> N/A</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you have children?</span> </td> <td> <span class="txtGrey size14"> Non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Longest Relationship</span> </td> <td> <span class="txtGrey size14"> Plus de 2 ans</span> </td> <td> </td> <td> </td> <td> </td> </tr> </table> </body> </html> tableUserDetails.content, tableUserDetails and tableUserDetailsList for both tables: * FIRST TABLE * print tableUserDetails.content = none print tableUserDetails = <table class="user-details"> <tr> <td class="headline txtBlue size15" style="width:80px"> About </td> <td style="width:10px"> </td> <td class="txtGrey size15"> Fume occasionnellement with Silhouette mince </td> <td width="25px;"> </td> <td class="headline txtBlue size15"> City </td> <td class="txtGrey size15"> Paris Ile-de-France </td> </tr> <tr> <td class="headline txtBlue size15"> Details </td> <td style="width:10px"> </td> <td class="txtGrey size15"> 26 year old Un homme, 185cm, Sans religion </td> <td> </td> <td class="headline txtBlue size15"> Ethnicity </td> <td class="txtGrey size15"> Caucasienne Balance with Châtains </td> </tr> <tr> <td class="headline txtBlue size15"> Intent </td> <td style="width:10px"> </td> <td class="txtGrey size15"> French.Kiss Cherche une relation amoureuse. </td> <td> </td> <td class="headline txtBlue size15" style="width:90px"> Education </td> <td class="txtGrey size15"> Diplôme universitaire/Licence </td> </tr> <tr> <td class="headline txtBlue size15"> Personnalité </td> <td style="width:10px"> </td> <td class="txtGrey size15"> </td> <td> </td> <td> <span class="headline txtBlue size15">Profession </span> </td> <td> <span class="txtGrey size15"> Visioconférence</span> </td> </tr> </table> print tableUserDetailsList = [u'About', u'Fume occasionnellement with Silhouette mince', u'City', u'Paris Ile-de-France', u'Details', u'26 year old Un homme, 185cm, Sans religion', u'Ethnic ity', u'Caucasienne Balance with Ch\xe2tains', u'Intent', u'French.Kiss Cherche une relation amoureuse.', u'Education', u'Dipl\xf4me universitaire/Licence', u'P ersonnalit\xe9'] * SECOND TABLE * print tableUserDetails.content = none print tableUserDetails = <table width="85%" class="secondpart"> <tr height="25px"> <td width="200px"> <span class="headline txtBlue size14">I am Seeking a</span> </td> <td width="300px"> <span class="txtGrey size14"> Une femme</span> </td> <td width="25px"> </td> <td width="200px"> <span class="headline txtBlue size14">For</span> </td> <td width="200px"> <span class="txtGrey size14"> Sorties</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span> </td> <td> <span class="txtGrey size14"><a href='needs_test.aspx'> <a href="needs_view.aspx?id=38028200">View his relationship needs</a></a></span> </td> <td> </td> <td> <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span> </td> <td> <span class="txtGrey size14"><a href='poftest.aspx'> <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View his chemistry results</a></a></span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Do you drink?</span> </td> <td> <span class="txtGrey size14"> Occasionnellement</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you want children?</span> </td> <td> <span class="txtGrey size14"> Non divulgué</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Marital Status</span> </td> <td> <span class="txtGrey size14"> Célibataire</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you do drugs?</span> </td> <td> <span class="txtGrey size14"> Non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Pets </span> </td> <td> <span class="txtGrey size14"> Aucun</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Eye Color</span> </td> <td> <span class="txtGrey size14"> Bruns</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Do you have a car? </span> </td> <td> <span class="txtGrey size14"> N/A</span> </td> <td> </td> <td> <span class="headline txtBlue size14">Do you have children?</span> </td> <td> <span class="txtGrey size14"> Non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtBlue size14">Longest Relationship</span> </td> <td> <span class="txtGrey size14"> Plus de 2 ans</span> </td> <td> </td> <td> </td> <td> </td> </tr> </table> print tableUserDetailsList = []
This works: tableUserDetailsList = [] for row in tableUserDetails.findAll('tr'): for col in row.findAll('td'): contents = list(col.stripped_strings) if contents: contentTd = contents[0] print "TD Number %d : %s" % (i, contentTd) tableUserDetailsList.append(contentTd) i += 1 The problem was that your second table contains spans. The line break before the span was also interpreted as content and returned in the col.contents list. It also works for the first table. As Anubhav commented, you should really consider iterating over the tables and not having the same code twice.
Instead using table = soup.find('table') Use table = soup.find_all('table') This will return a list of tables in your html, and you can then pick the correct one from the list.
beautiful soup get children that are Tags (not Navigable Strings) from a Tag
Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag. I'm currently accomplishing this using list comprehension: rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag] but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work: rows=table.tbody.find_all(True, recursive=False) Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable: rows=table.tbody.find_all('tr') Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names I believe this is a better way than iterating through all the children of a Tag. Worked with the following input: <table cellspacing="0" cellpadding="0"> <thead> <tr class="title-row"> <th class="title" colspan="100"> <div style="position:relative;"> President <span class="pct-rpt"> 99% reporting </span> </div> </th> </tr> <tr class="header-row"> <th class="photo first"> </th> <th class="candidate "> Candidate </th> <th class="party "> Party </th> <th class="votes "> Votes </th> <th class="pct "> Pct. </th> <th class="change "> Change from ‘08 </th> <th class="evotes last"> Electoral Votes </th> </tr> </thead> <tbody> <tr class=""> <td class="photo first"> <div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div> </td> <td class="candidate "> <div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div> </td> <td class="party "> Dem. </td> <td class="votes "> 2,916,811 </td> <td class="pct "> 57.3% </td> <td class="change "> -4.6% </td> <td class="evotes last"> 20 </td> </tr> <tr class=""> <td class="photo first"> </td> <td class="candidate "> <div class="not-winner">Mitt Romney</div> </td> <td class="party "> Rep. </td> <td class="votes "> 2,090,116 </td> <td class="pct "> 41.1% </td> <td class="change "> +4.3% </td> <td class="evotes last"> 0 </td> </tr> <tr class=""> <td class="photo first"> </td> <td class="candidate "> <div class="not-winner">Gary Johnson</div> </td> <td class="party "> Lib. </td> <td class="votes "> 54,798 </td> <td class="pct "> 1.1% </td> <td class="change "> – </td> <td class="evotes last"> 0 </td> </tr> <tr class="last-row"> <td class="photo first"> </td> <td class="candidate "> div class="not-winner">Jill Stein</div> </td> <td class="party "> Green </td> <td class="votes "> 29,336 </td> <td class="pct "> 0.6% </td> <td class="change "> – </td> <td class="evotes last"> 0 </td> </tr> <tr> <td class="footer" colspan="100"> President Map | President Big Board | Exit Polls </td> </tr> </tbody> </table>