The code below is not able to scrape the class_=datarow. i have tried to use read_html as well, but there are tables inside this table class="table_equities", which makes the read_html not working for me. I have no idea how to get the table.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import openpyxl
import scrapy
path = 'C:/Users/pacc_/OneDrive/Desktop/Eric/Investment/Trading record python.xlsx'
page_link = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
page_response = requests.get(page_link, timeout=2)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(class_='table_equities')
print(data)
My result:
<table class="table_equities">
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
</table>
Target url structure:
<table class="table_equities">
<tbody>
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text">Stock Code</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text">Name</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text">Nominal Price</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text">Turnover (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text">Market Cap (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text">P/E</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text">Dividend Yield (%)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text">Intraday Movement</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
<tr class="datarow">
<td class="code"><a>909</a></td>
<td class="name"><a>MING YUAN CLOUD</a></td>
<td class="price"><bdo>HK$30.700</bdo>
<br>
<div><span>0.000</span> (<span>0.00%</span>)</div>
</td>
<td class="turnover">8.35B</td>
<td class="market">57.44B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0909.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (909)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>700</a></td>
<td class="name"><a>TENCENT</a></td>
<td class="price"><bdo>HK$503.500</bdo>
<br>
<div class="downval"><span>-1.500</span> (<span>-0.30%</span>)</div>
</td>
<td class="turnover">6.69B</td>
<td class="market">4,824.90B</td>
<td class="pe">46.13x</td>
<td class="dividend">0.24%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0700.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (700)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9988</a></td>
<td class="name"><a>BABA-SW</a></td>
<td class="price"><bdo>HK$258.000</bdo>
<br>
<div class="downval"><span>-3.000</span> (<span>-1.15%</span>)</div>
</td>
<td class="turnover">3.97B</td>
<td class="market">5,584.43B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9988.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9988)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3690</a></td>
<td class="name"><a>MEITUAN-W</a></td>
<td class="price"><bdo>HK$232.000</bdo>
<br>
<div class="downval"><span>-6.600</span> (<span>-2.77%</span>)</div>
</td>
<td class="turnover">3.75B</td>
<td class="market">1,364.39B</td>
<td class="pe">547.17x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3690.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3690)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3333</a></td>
<td class="name"><a>EVERGRANDE</a></td>
<td class="price"><bdo>HK$13.780</bdo>
<br>
<div class="downval"><span>-1.440</span> (<span>-9.46%</span>)</div>
</td>
<td class="turnover">2.80B</td>
<td class="market">180.00B</td>
<td class="pe">9.59x</td>
<td class="dividend">5.15%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3333.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3333)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1810</a></td>
<td class="name"><a>XIAOMI-W</a></td>
<td class="price"><bdo>HK$19.720</bdo>
<br>
<div class="downval"><span>-0.120</span> (<span>-0.60%</span>)</div>
</td>
<td class="turnover">2.69B</td>
<td class="market">475.78B</td>
<td class="pe">42.69x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1810.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1810)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2318</a></td>
<td class="name"><a>PING AN</a></td>
<td class="price"><bdo>HK$80.350</bdo>
<br>
<div class="downval"><span>-0.100</span> (<span>-0.12%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">598.41B</td>
<td class="pe">8.61x</td>
<td class="dividend">2.89%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2318.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2318)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>388</a></td>
<td class="name"><a>HKEX</a></td>
<td class="price"><bdo>HK$355.800</bdo>
<br>
<div class="downval"><span>-1.800</span> (<span>-0.50%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">451.09B</td>
<td class="pe">47.50x</td>
<td class="dividend">1.88%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0388.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (388)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1918</a></td>
<td class="name"><a>SUNAC</a></td>
<td class="price"><bdo>HK$28.950</bdo>
<br>
<div class="downval"><span>-1.600</span> (<span>-5.24%</span>)</div>
</td>
<td class="turnover">1.36B</td>
<td class="market">134.94B</td>
<td class="pe">4.44x</td>
<td class="dividend">4.64%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1918.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1918)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>981</a></td>
<td class="name"><a>SMIC</a></td>
<td class="price"><bdo>HK$18.580</bdo>
<br>
<div class="downval"><span>-0.760</span> (<span>-3.93%</span>)</div>
</td>
<td class="turnover">1.35B</td>
<td class="market">143.03B</td>
<td class="pe">59.55x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0981.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (981)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1299</a></td>
<td class="name"><a>AIA</a></td>
<td class="price"><bdo>HK$77.650</bdo>
<br>
<div class="upval"><span>+1.000</span> (<span>+1.30%</span>)</div>
</td>
<td class="turnover">1.34B</td>
<td class="market">939.05B</td>
<td class="pe">18.03x</td>
<td class="dividend">1.65%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1299.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1299)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2300</a></td>
<td class="name"><a>AMVIG HOLDINGS</a></td>
<td class="price"><bdo>HK$2.140</bdo>
<br>
<div class="upval"><span>+0.700</span> (<span>+48.61%</span>)</div>
</td>
<td class="turnover">1.31B</td>
<td class="market">1.98B</td>
<td class="pe">6.35x</td>
<td class="dividend">5.33%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2300.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2300)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1398</a></td>
<td class="name"><a>ICBC</a></td>
<td class="price"><bdo>HK$3.990</bdo>
<br>
<div class="downval"><span>-0.030</span> (<span>-0.75%</span>)</div>
</td>
<td class="turnover">1.28B</td>
<td class="market">346.30B</td>
<td class="pe">4.32x</td>
<td class="dividend">7.20%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1398.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1398)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>5</a></td>
<td class="name"><a>HSBC HOLDINGS</a></td>
<td class="price"><bdo>HK$28.200</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-1.40%</span>)</div>
</td>
<td class="turnover">1.20B</td>
<td class="market">583.52B</td>
<td class="pe">12.21x</td>
<td class="dividend">2.78%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0005.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (5)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>939</a></td>
<td class="name"><a>CCB</a></td>
<td class="price"><bdo>HK$5.020</bdo>
<br>
<div class="downval"><span>-0.040</span> (<span>-0.79%</span>)</div>
</td>
<td class="turnover">1.10B</td>
<td class="market">1,206.89B</td>
<td class="pe">4.36x</td>
<td class="dividend">6.97%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0939.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (939)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9618</a></td>
<td class="name"><a>JD-SW</a></td>
<td class="price"><bdo>HK$282.200</bdo>
<br>
<div class="downval"><span>-6.200</span> (<span>-2.15%</span>)</div>
</td>
<td class="turnover">1.08B</td>
<td class="market">883.22B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9618.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9618)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1658</a></td>
<td class="name"><a>PSBC</a></td>
<td class="price"><bdo>HK$3.160</bdo>
<br>
<div class="upval"><span>+0.060</span> (<span>+1.94%</span>)</div>
</td>
<td class="turnover">1.05B</td>
<td class="market">62.74B</td>
<td class="pe">4.01x</td>
<td class="dividend">7.23%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1658.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1658)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9633</a></td>
<td class="name"><a>NONGFU SPRING</a></td>
<td class="price"><bdo>HK$35.150</bdo>
<br>
<div class="downval"><span>-2.850</span> (<span>-7.50%</span>)</div>
</td>
<td class="turnover">1.01B</td>
<td class="market">174.92B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9633.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9633)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1211</a></td>
<td class="name"><a>BYD COMPANY</a></td>
<td class="price"><bdo>HK$103.100</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-0.39%</span>)</div>
</td>
<td class="turnover">844.11M</td>
<td class="market">94.33B</td>
<td class="pe">188.21x</td>
<td class="dividend">0.06%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1211.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1211)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>708</a></td>
<td class="name"><a>EVERG VEHICLE</a></td>
<td class="price"><bdo>HK$16.820</bdo>
<br>
<div class="downval"><span>-2.460</span> (<span>-12.76%</span>)</div>
</td>
<td class="turnover">789.41M</td>
<td class="market">148.29B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0708.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (708)"></td>
</tr>
</tbody>
</table>
The table is being loaded dynamically using javascript. If you inspect element and use the networking tab you can see the XHR calls being made to get the data. I would try looking at these and then scrape from the api directly.
the web source is like this:
<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>
I want to get the following numbers for every date:
say for example I have to get the numbers 12269.25, 12283.70, 12202.10 and 12214.55 for a particular date (2019-12-24). Then proceed for the next date given.
I am facing difficulty because I need to select next 4 lines(whose xpath is not exatly related much as shown above) following each date in the page. The dates can range from single date to 100-200 dates.
Can anybody please help with webdriver code snippet for the same.
Thanks a lot
Can this meet your needs
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>'''
doc = SimplifiedDoc(html)
table = doc.getElement(tag='table',value='tblchart')
trs = table.trs.notContains('<th') # get tr
for tr in trs:
tds = tr.tds # get all td
data = [td.text for td in tds]
print (data[0],data[1],data[2],data[3],data[4])
I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"
Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>