Page Number and Total Pages in Header When Printing HTML to PDF - python
Background: I have a large HTML file that has 8 different pages. Some of the pages in the HTML can be larger than the 11in container size and the 11in stipulated in the #page CSS due to a lot of data in some of the tables.
What am I trying to do?: I am trying to send in context data (written in Django / Python) to each table of unknown length. Once the data has been entered then I will use weasyprint to create the pdf. At the top of every page the page number and total number of pages should be added dynamically.
The issue: When I print to PDF the header on the pages with a lot of rows (ones that are >11in) add the header but the header shows the same page number for of the split pages. In the example below it is on page 2 that is split into two pages and when you print to pdf the header on page 3 and 4 are incorrect.
What have I tried?:
Basically everything I could think of. At first I thought about just using paged media but I couldn't figure out how to put this complex of a header on the pages using that method.
Is it possible with just HTML and CSS to do what I want? If it isn't then I may have to figure out a way to add the headers using JS and then saving the HTML before sending it into weasyprint (since weasyprint doesn't support JS). Any suggestions would be appreciated.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
*{
margin: 0;
padding: 0;
}
#page{
size: 8.5in 11in;
}
table.container{
page-break-after: always;
}
td{
padding: 0;
margin: 0;
}
table tbody tr{
vertical-align: top;
page-break-after: always;
}
.container{
height: 11in;
width: 8.5in;
border: 1px solid black;
padding: 10px;
margin: 10px auto ;
}
.container thead{
height: 225px;
vertical-align:top;
}
.top{
display: flex;
align-items:center;
}
.header{
page-break-before: always;
}
.bottom{
font-family: sans-serif;
font-size: 10px;
}
.main{
margin: 0 30px;
}
.thead{
display: flex;
background-color: rgb(123, 199, 157);
color: white;
font-size: 14px;
border-bottom: 1px solid black;
border: 1px solid black;
}
.thead .col-one{
flex:3;
}
.thead .col-two{
flex: 1;
}
.thead .col-three{
flex: 2;
text-align: center;
}
.thead p{
padding: 10px;
}
.tbody{
display: flex;
color: rgb(23, 184, 109);
font-size: 12px;
font-family: sans-serif;
}
.main-data{
padding: 10px 15px ;
}
.sub-data{
padding: 0 30px 0 ;
}
.test i{
padding-top: 13.3px ;
}
.result i{
padding-top: 10px;
}
.result p{
padding: 1px;
}
.tbody tr td:nth-child(2), tr td:nth-child(3){
padding-left: 230px;
}
.tbody td{
padding: 2px 0;
}
.tbody{
border: 1px solid black;
}
.entry p{
color: rgb(78, 208, 143);
}
.entry{
font-family: sans-serif;
font-size: 13px;
padding: 50px 0 0 ;
}
.sub{
padding: 5px;
}
body{
counter-reset: page pages my-counter 0;
}
.header{
display: table-header-group;
}
.footer{
display: table-footer-group;
}
#media print{
.tbody tr td:nth-child(2), tr td:nth-child(3){
padding-left: 220px;
}
.thead{
display: table-header-group
color: black;
background-color: rgb(69, 213, 129);
}
.thead tr { page-break-inside: avoid; }
}
#page {
#bottom-right {
content: counter(page) " of " counter(pages);
}
}
.dot::after {
content: " : "counter(page) " of " counter(pages);
counter-increment: page 1;
}
</style>
</head>
<body>
<table class="container">
<thead>
<tr>
<td>
<div class="haeder">
<div class="top">
<div class="address">
<h2>Company title</h2>
<p>contact info</p>
</div>
<div class="customer">
<p>Report ID: </p>
</div>
</div>
<div class="head-line">
<h1>Document Title</h1>
<p>Page No<span class="dot"></span> </p>
</div>
<div class="details">
<div class="lab-info">
<p>Info: <span>Lab Number</span></p>
<p>Info: <span>Sample ID from Sample Received</span></p>
</div>
<div class="date">
<p>Date Received: </p>
<p>Date Reported: </p>
</div>
</div>
</td>
</tr>
</thead>
<tfoot>
<tr>
<td>
<div class="footer" >
<div class="bottom">
<div class="first-col">
<p>Some Info</p>
<p>Some more info</p>
</div>
</div>
</div>
</td>
</tr>
</tfoot>
<tbody>
<tr>
<td>
<div class="main" >
<div class="thead">
<p class="col-one" >Table Name</p>
<p class="col-two">Result</p>
<p class="col-three">Result 2</p>
</div>
<div class="tbody">
<div class="main">
<table>
<tr>
<td>Test</td>
<td></td>
<td></td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td></td>
<td></td>
</tr>
</table>
</div>
<div class="result">
</div>
<div class="test">
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
<table class="container">
<thead>
<tr>
<td>
<div class="haeder">
<div class="top">
<div class="address">
<h2>Company title</h2>
<p>contact info</p>
</div>
<div class="customer">
<p>Report ID: </p>
</div>
</div>
<div class="head-line">
<h1>Document Title</h1>
<p>Page No<span class="dot"</p>
</div>
<div class="details">
<div class="lab-info">
<p>Info: <span>Lab Number</span></p>
<p>Info: <span>Sample ID from Sample Received</span></p>
</div>
<div class="date">
<p>Date Received: </p>
<p>Date Reported: </p>
</div>
</div>
</td>
</tr>
</thead>
<tfoot>
<tr>
<td>
<div class="footer" >
<div class="bottom">
<div class="first-col">
<p>Some Info</p>
<p>Some more info</p>
</div>
</div>
</div>
</td>
</tr>
</tfoot>
<tbody>
<tr>
<td>
<div class="main" >
<div class="thead">
<p class="col-one" >Table Name</p>
<p class="col-two">Result</p>
<p class="col-three">Result 2</p>
</div>
<div class="tbody">
<div class="main">
<table>
<tr>
<td>Test</td>
<td></td>
<td></td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td> %</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
</tr>
<tr>
<td class="sup">Test</td>
<td>%</td>
<td></td>
</tr>
<tr>
<td class="sup">Test</td>
<td></td>
<td></td>
</tr>
</table>
</div>
<div class="result">
</div>
<div class="test">
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</body>
</html>
Related
Get <td> text using python selenium
<html> <body> <table style="border:0"> <tbody> <tr class=""> <td class="pr10">Mon</td> <td class="pl10">11am – 11pm</td> </tr> <tr class=""> <td class="pr10">Tue</td> <td class="pl10">11am – 11pm</td> </tr> <tr class="bold"> <td class="pr10">Wed</td> <td class="pl10">11am – 11pm</td> </tr> <tr class=""> <td class="pr10">Thu</td> <td class="pl10">11am – 11pm</td> </tr> <tr class=""> <td class="pr10">Fri</td> <td class="pl10">11am – 11pm</td> </tr> <tr class=""> <td class="pr10">Sat</td> <td class="pl10">11am – 11pm</td> </tr> <tr class=""> <td class="pr10">Sun</td> <td class="pl10">11am – 11pm</td> </tr> </tbody> </table> </html> </body> Try 1: driver.find_elements_by_xpath("//*[#class='pr10']") Try 2: driver.find_element_by_xpath("//tr[td='Mon']/td").text But it not fetching the text "Mon" "11am - 11pm" text_area = driver.find_elements_by_xpath("//*[#class='pr10']") for items2 in text_area: print(items2.text)
try this instead: text_area = driver.find_elements_by_xpath("""//*[#id="body"]/table/tbody/tr[1]/td[1]""") print([elm.get_attribute('innerHTML') for elm in text_area])
Subtracting Columns in DataFrames based on results of another column
Say I have a Pandas DataFrame like the following table: CarFuel Volume Mazda 311.3 Mazda 310.4 F-15014.3 F-1509.7 <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-yw4l{vertical-align:top} </style> <table class="tg" style="undefined;table-layout: fixed; width: 122px"> <colgroup> <col style="width: 56px"> <col style="width: 66px"> </colgroup> <tr> <th class="tg-yw4l">Car</th> <th class="tg-yw4l">Fuel Volume</th> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">25.01</td> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">22.47</td> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">19.56</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">9.87</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">6.32</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">1.32</td> </tr> </table> I want to create another column based on the difference of fuel volume, but only if the fuel volume being subtracted is from the same model of car. So the resulting DataFrame would look like the following: <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 0px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 0px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-yw4l{vertical-align:top} </style> <table class="tg" style="undefined;table-layout: fixed; width: 144px"> <colgroup> <col style="width: 52px"> <col style="width: 62px"> <col style="width: 30px"> </colgroup> <tr> <th class="tg-yw4l">Car</th> <th class="tg-yw4l">Fuel Volume</th> <th class="tg-yw4l">Difference in Fuel</th> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">25.01</td> <td class="tg-yw4l">NaN</td> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">22.47</td> <td class="tg-yw4l">2.54</td> </tr> <tr> <td class="tg-yw4l">F-150</td> <td class="tg-yw4l">19.56</td> <td class="tg-yw4l">2.91</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">9.87</td> <td class="tg-yw4l">NaN</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">6.32</td> <td class="tg-yw4l">3.55</td> </tr> <tr> <td class="tg-yw4l">F-250</td> <td class="tg-yw4l">1.32</td> <td class="tg-yw4l">5</td> </tr> </table>
I think you need groupby with diff, last add abs: df['Difference in Fuel'] = df.groupby('Car')['Fuel Volume'].diff().abs() print (df) Car Fuel Volume Difference in Fuel 0 F-150 25.01 NaN 1 F-150 22.47 2.54 2 F-150 19.56 2.91 3 F-250 9.87 NaN 4 F-250 6.32 3.55 5 F-250 1.32 5.00
How to parse this html structure using BeautifulSoup?
I would like to parse this TABLE line by line and save to a csv file. What I have done so far, return nothing in the csv file: Django: data_scrapper makes a request from Yahoo Finance. def button_clicked(request): headers = [] rows = [] gen_table = data_scrapper(symbol) soup = BeautifulSoup(gen_table) table = soup.find_all('table') for table in soup.find_all('table'): headers.extend([header.text for header in table.find_all('th')]) for row in soup.find_all('tr'): rows.extend([val.text for val in row.find_all('td')]) response = HttpResponse(content_type='text/csv') response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol) writer = csv.writer(response) writer.writerow(headers) writer.writerows(row for row in rows if row) return response html: <TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0"> <TR> <TD> <TABLE width="100%" cellpadding="2" cellspacing="0" border="0"> <TR class="yfnc_modtitle1" style="border-top:none;"> <td colspan="2" style="border-top:2px solid #000;"> <small> <span class="yfi-module-title">Period Ending</span> </small> </td> <th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th> <th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th> <th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th> </TR> <tr> <td colspan="2"> <strong> Total Revenue </strong> </td> <td align="right"> <strong> 4,479,648 </strong> </td> <td align="right"> <strong> 3,777,068 </strong> </td> <td align="right"> <strong> 3,209,782 </strong> </td> </tr> <tr> <td colspan="2">Cost of Revenue</td> <td align="right">3,160,470 </td> <td align="right">2,656,189 </td> <td align="right">2,284,485 </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; border-top:3px solid #333;"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td colspan="2"> <strong> Gross Profit </strong> </td> <td align="right"> <strong> 1,319,178 </strong> </td> <td align="right"> <strong> 1,120,879 </strong> </td> <td align="right"> <strong> 925,297 </strong> </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; "> <span style="display:block; width:5px; height:10px;"></span> </td> </tr> <tr> <td> <spacer type="block" height="1" width="1" /> </td> <td class="yfnc_d" colspan="4">Operating Expenses</td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Research Development</td> <td align="right">148,458 </td> <td align="right">139,193 </td> <td align="right">127,361 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Selling General and Administrative</td> <td align="right">456,030 </td> <td align="right">403,772 </td> <td align="right">319,511 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Non Recurring</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Others</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td> <spacer type="block" height="1" width="1" /> </td> <td colspan="5" style="height:0; padding:0; " class="yfnc_d"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Total Operating Expenses</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; "> <span style="display:block; width:5px; height:10px;"></span> </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; border-top:3px solid #333;"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td colspan="2"> <strong> Operating Income or Loss </strong> </td> <td align="right"> <strong> 714,690 </strong> </td> <td align="right"> <strong> 577,914 </strong> </td> <td align="right"> <strong> 478,425 </strong> </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; "> <span style="display:block; width:5px; height:10px;"></span> </td> </tr> <tr> <td> <spacer type="block" height="1" width="1" /> </td> <td class="yfnc_d" colspan="4">Income from Continuing Operations</td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Total Other Income/Expenses Net</td> <td align="right">(10)</td> <td align="right">5,139 </td> <td align="right">7,529 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Earnings Before Interest And Taxes</td> <td align="right">710,556 </td> <td align="right">580,639 </td> <td align="right">485,775 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Interest Expense</td> <td align="right">11,239 </td> <td align="right">6,210 </td> <td align="right">5,932 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Income Before Tax</td> <td align="right">699,317 </td> <td align="right">574,429 </td> <td align="right">479,843 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Income Tax Expense</td> <td align="right">245,288 </td> <td align="right">193,360 </td> <td align="right">167,533 </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Minority Interest</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td> <spacer type="block" height="1" width="1" /> </td> <td colspan="5" style="height:0; padding:0; " class="yfnc_d"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Net Income From Continuing Ops</td> <td align="right">454,029 </td> <td align="right">381,069 </td> <td align="right">312,310 </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; "> <span style="display:block; width:5px; height:10px;"></span> </td> </tr> <tr> <td> <spacer type="block" height="1" width="1" /> </td> <td class="yfnc_d" colspan="4">Non-recurring Events</td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Discontinued Operations</td> <td align="right"> - </td> <td align="right">(3,777)</td> <td align="right"> - </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Extraordinary Items</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Effect Of Accounting Changes</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td width="30" class="yfnc_tabledata1"> <spacer type="block" width="30" height="1" /> </td> <td>Other Items</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; "> <span style="display:block; width:5px; height:10px;"></span> </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; border-top:3px solid #333;"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td colspan="2"> <strong> Net Income </strong> </td> <td align="right"> <strong> 454,029 </strong> </td> <td align="right"> <strong> 377,292 </strong> </td> <td align="right"> <strong> 312,310 </strong> </td> </tr> <tr> <td colspan="2">Preferred Stock And Other Adjustments</td> <td align="right"> - </td> <td align="right"> - </td> <td align="right"> - </td> </tr> <tr> <td colspan="5" style="height:0;padding:0; border-top:3px solid #333;"> <span style="display:block; width:5px; height:1px;"></span> </td> </tr> <tr> <td colspan="2"> <strong> Net Income Applicable To Common Shares </strong> </td> <td align="right"> <strong> 454,029 </strong> </td> <td align="right"> <strong> 377,292 </strong> </td> <td align="right"> <strong> 312,310 </strong> </td> </tr> </TABLE> </TD> </TR> </TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful Python 3.4 from bs4 import BeautifulSoup import re import csv def button_clicked(request, filename): soup = BeautifulSoup(request) table = soup.find('table').find('table') t_rows = table.find_all('tr') with open(filename, 'w') as csvfile: spamwriter = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) for t_row in t_rows: rec_as_str = t_row.getText() rec_as_str = rec_as_str.strip() rec_as_str = rec_as_str.replace('\xa0', '') rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str) if len(rec_as_str) > 0: a_list = rec_as_str.split("|") spamwriter.writerow(a_list) Creates a file that looks like: Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012" Total Revenue,"4,479,648","3,777,068","3,209,782" Cost of Revenue,"3,160,470","2,656,189","2,284,485" Gross Profit,"1,319,178","1,120,879","925,297" Operating Expenses Research Development,"148,458","139,193","127,361" Selling General and Administrative,"456,030","403,772","319,511" Non Recurring,-,-,- Others,-,-,- Total Operating Expenses,-,-,- Operating Income or Loss,"714,690","577,914","478,425" Income from Continuing Operations Total Other Income/Expenses Net,(10),"5,139","7,529" Earnings Before Interest And Taxes,"710,556","580,639","485,775" Interest Expense,"11,239","6,210","5,932" Income Before Tax,"699,317","574,429","479,843" Income Tax Expense,"245,288","193,360","167,533" Minority Interest,-,-,- Net Income From Continuing Ops,"454,029","381,069","312,310" Non-recurring Events Discontinued Operations,-,"(3,777)",- Extraordinary Items,-,-,- Effect Of Accounting Changes,-,-,- Other Items,-,-,- Net Income,"454,029","377,292","312,310" Preferred Stock And Other Adjustments,-,-,- Net Income Applicable To Common Shares,"454,029","377,292","312,310"
How to extract using beautifulsoup python [duplicate]
This question already has answers here: python beautifulsoup extracting text (2 answers) Closed 9 years ago. I am only interested to use beautifulsoup to extract all the value of 3-hr PSI Readings from 12AM to 11.59PM. Such as the latest bold text of 82 at 5pm. Example of website is at http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours. Can anyone teach me how ? Thanks in advance ! <!-- start content --> <h1 class="title" id="top"> PSI Readings over the last 24 Hours</h1> <script type="text/javascript"> var baseUrl = '/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours'; function changetime(ddl) { var strTime = ddl.options[ddl.selectedIndex].value; if (strTime != null) { var npage = baseUrl + "/time/" + strTime + "#psi24"; window.location = npage; } } </script> <h1 id="psi24"> 24-hr PSI Readings on 24 Jun 2013 </h1> <p> View reading for: <select class="default" id="ContentPlaceHolderContent_C001_DDLTime" name="ctl00$ContentPlaceHolderContent$C001$DDLTime" onchange="changetime(this);"> <option value="0000">12AM</option> <option value="0100">1AM</option> <option value="0200">2AM</option> <option value="0300">3AM</option> <option value="0400">4AM</option> <option value="0500">5AM</option> <option value="0600">6AM</option> <option value="0700">7AM</option> <option value="0800">8AM</option> <option value="0900">9AM</option> <option value="1000">10AM</option> <option value="1100">11AM</option> <option value="1200">12PM</option> <option value="1300">1PM</option> <option value="1400">2PM</option> <option value="1500">3PM</option> <option value="1600">4PM</option> <option selected="selected" value="1700">5PM</option> </select> </p> <table border="0" cellpadding="4" cellspacing="1" class="text_psinormal" width="100%"> <thead> <tr> <th width="33%"> <center><strong>Region</strong></center> </th> <th width="33%"> <center><strong>PSI</strong></center> </th> <th width="34%"> <center><strong>24-hr PM2.5 Concentration (µg/m<sup>3</sup>)</strong></center> </th> </tr> </thead> <tr> <td align="center">North </td> <td align="center"> 61 </td> <td align="center"> 47 </td> </tr> <tr> <td align="center">South </td> <td align="center"> 62 </td> <td align="center"> 46 </td> </tr> <tr> <td align="center">East </td> <td align="center"> 55 </td> <td align="center"> 39 </td> </tr> <tr> <td align="center">West </td> <td align="center"> 87 </td> <td align="center"> 83 </td> </tr> <tr> <td align="center">Central </td> <td align="center"> 58 </td> <td align="center"> 40 </td> </tr> <tr> <td align="center">Overall Singapore </td> <td align="center"> 55-87 </td> <td align="center"> 39-83 </td> </tr> </table> <div> </div> <div> <h1>3-hr PSI Readings from 12AM to 11.59PM on 24 Jun 2013</h1> <table border="0" cellpadding="4" cellspacing="1" width="100%"> <tr> <td align="center" width="16%"> <strong>Time</strong> </td> <td align="center" width="7%"><strong>12AM</strong> </td> <td align="center" width="7%"><strong>1AM</strong> </td> <td align="center" width="7%"><strong>2AM</strong> </td> <td align="center" width="7%"><strong>3AM</strong> </td> <td align="center" width="7%"><strong>4AM</strong> </td> <td align="center" width="7%"><strong>5AM</strong> </td> <td align="center" width="7%"><strong>6AM</strong> </td> <td align="center" width="7%"><strong>7AM</strong> </td> <td align="center" width="7%"><strong>8AM</strong> </td> <td align="center" width="7%"><strong>9AM</strong> </td> <td align="center" width="7%"><strong>10AM</strong> </td> <td align="center" width="7%"><strong>11AM</strong> </td> </tr> <tr> <td align="center"> <strong>3-hr PSI</strong> </td> <td align="center"> 76 </td> <td align="center"> 70 </td> <td align="center"> 64 </td> <td align="center"> 59 </td> <td align="center"> 54 </td> <td align="center"> 51 </td> <td align="center"> 48 </td> <td align="center"> 47 </td> <td align="center"> 47 </td> <td align="center"> 47 </td> <td align="center"> 49 </td> <td align="center"> 52 </td> </tr> <tr> <td align="center" width="16%"> <strong>Time</strong> </td> <td align="center" width="7%"><strong>12PM</strong> </td> <td align="center" width="7%"><strong>1PM</strong> </td> <td align="center" width="7%"><strong>2PM</strong> </td> <td align="center" width="7%"><strong>3PM</strong> </td> <td align="center" width="7%"><strong>4PM</strong> </td> <td align="center" width="7%"><strong>5PM</strong> </td> <td align="center" width="7%"><strong>6PM</strong> </td> <td align="center" width="7%"><strong>7PM</strong> </td> <td align="center" width="7%"><strong>8PM</strong> </td> <td align="center" width="7%"><strong>9PM</strong> </td> <td align="center" width="7%"><strong>10PM</strong> </td> <td align="center" width="7%"><strong>11PM</strong> </td> </tr> <tr> <td align="center"> <strong>3-hr PSI</strong> </td> <td align="center"> 54 </td> <td align="center"> 59 </td> <td align="center"> 65 </td> <td align="center"> 72 </td> <td align="center"> 79 </td> <td align="center"> <strong style="font-size:14px;">82</strong> </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> <td align="center"> - </td> </tr> </table> </div> <div class="sfContentBlock"> <p class="table-caption">Hourly updates of 3-hr PSI readings are provided from 12am to 11:59pm. The 3hr PSI readings are calculated based on PM10 concentrations only</p> </div> <div> </div> <div class="backToTop"> Back to Top </div> </div> </div> <!-- end content -->
Though you should have shown that you've tried to do it yourself, but here is the code: from pprint import pprint import urllib2 from bs4 import BeautifulSoup as soup url = "http://app2.nea.gov.sg/anti-pollution-radiation-protection/air-pollution/psi/psi-readings-over-the-last-24-hours" web_soup = soup(urllib2.urlopen(url)) table = web_soup.find(name="div", attrs={'class': 'c1'}).find_all(name="div")[2].find_all('table')[0] table_rows = [] for row in table.find_all('tr'): table_rows.append([td.text.strip() for td in row.find_all('td')]) data = {} for tr_index, tr in enumerate(table_rows): if tr_index % 2 == 0: for td_index, td in enumerate(tr): data[td] = table_rows[tr_index + 1][td_index] pprint(data) prints: {'10AM': '49', '10PM': '-', '11AM': '52', '11PM': '-', '12AM': '76', '12PM': '54', '1AM': '70', '1PM': '59', '2AM': '64', '2PM': '65', '3AM': '59', '3PM': '72', '4AM': '54', '4PM': '79', '5AM': '51', '5PM': '82', '6AM': '48', '6PM': '79', '7AM': '47', '7PM': '-', '8AM': '47', '8PM': '-', '9AM': '47', '9PM': '-', 'Time': '3-hr PSI'}
beautiful soup get children that are Tags (not Navigable Strings) from a Tag
Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag. I'm currently accomplishing this using list comprehension: rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag] but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work: rows=table.tbody.find_all(True, recursive=False) Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable: rows=table.tbody.find_all('tr') Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names I believe this is a better way than iterating through all the children of a Tag. Worked with the following input: <table cellspacing="0" cellpadding="0"> <thead> <tr class="title-row"> <th class="title" colspan="100"> <div style="position:relative;"> President <span class="pct-rpt"> 99% reporting </span> </div> </th> </tr> <tr class="header-row"> <th class="photo first"> </th> <th class="candidate "> Candidate </th> <th class="party "> Party </th> <th class="votes "> Votes </th> <th class="pct "> Pct. </th> <th class="change "> Change from ‘08 </th> <th class="evotes last"> Electoral Votes </th> </tr> </thead> <tbody> <tr class=""> <td class="photo first"> <div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div> </td> <td class="candidate "> <div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div> </td> <td class="party "> Dem. </td> <td class="votes "> 2,916,811 </td> <td class="pct "> 57.3% </td> <td class="change "> -4.6% </td> <td class="evotes last"> 20 </td> </tr> <tr class=""> <td class="photo first"> </td> <td class="candidate "> <div class="not-winner">Mitt Romney</div> </td> <td class="party "> Rep. </td> <td class="votes "> 2,090,116 </td> <td class="pct "> 41.1% </td> <td class="change "> +4.3% </td> <td class="evotes last"> 0 </td> </tr> <tr class=""> <td class="photo first"> </td> <td class="candidate "> <div class="not-winner">Gary Johnson</div> </td> <td class="party "> Lib. </td> <td class="votes "> 54,798 </td> <td class="pct "> 1.1% </td> <td class="change "> – </td> <td class="evotes last"> 0 </td> </tr> <tr class="last-row"> <td class="photo first"> </td> <td class="candidate "> div class="not-winner">Jill Stein</div> </td> <td class="party "> Green </td> <td class="votes "> 29,336 </td> <td class="pct "> 0.6% </td> <td class="change "> – </td> <td class="evotes last"> 0 </td> </tr> <tr> <td class="footer" colspan="100"> President Map | President Big Board | Exit Polls </td> </tr> </tbody> </table>