Get <td> text using python selenium - python

<html>
<body>
<table style="border:0">
<tbody>
<tr class="">
<td class="pr10">Mon</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Tue</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="bold">
<td class="pr10">Wed</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Thu</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Fri</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sat</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sun</td>
<td class="pl10">11am – 11pm</td>
</tr>
</tbody>
</table>
</html>
</body>
Try 1:
driver.find_elements_by_xpath("//*[#class='pr10']")
Try 2:
driver.find_element_by_xpath("//tr[td='Mon']/td").text
But it not fetching the text "Mon" "11am - 11pm"
text_area = driver.find_elements_by_xpath("//*[#class='pr10']")
for items2 in text_area:
print(items2.text)

try this instead:
text_area = driver.find_elements_by_xpath("""//*[#id="body"]/table/tbody/tr[1]/td[1]""")
print([elm.get_attribute('innerHTML') for elm in text_area])

Related

Python web scraping problems - result different from soruce code

The code below is not able to scrape the class_=datarow. i have tried to use read_html as well, but there are tables inside this table class="table_equities", which makes the read_html not working for me. I have no idea how to get the table.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import openpyxl
import scrapy
path = 'C:/Users/pacc_/OneDrive/Desktop/Eric/Investment/Trading record python.xlsx'
page_link = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
page_response = requests.get(page_link, timeout=2)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(class_='table_equities')
print(data)
My result:
<table class="table_equities">
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
</table>
Target url structure:
<table class="table_equities">
<tbody>
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text">Stock Code</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text">Name</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text">Nominal Price</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text">Turnover (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text">Market Cap (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text">P/E</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text">Dividend Yield (%)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text">Intraday Movement</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
<tr class="datarow">
<td class="code"><a>909</a></td>
<td class="name"><a>MING YUAN CLOUD</a></td>
<td class="price"><bdo>HK$30.700</bdo>
<br>
<div><span>0.000</span> (<span>0.00%</span>)</div>
</td>
<td class="turnover">8.35B</td>
<td class="market">57.44B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0909.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (909)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>700</a></td>
<td class="name"><a>TENCENT</a></td>
<td class="price"><bdo>HK$503.500</bdo>
<br>
<div class="downval"><span>-1.500</span> (<span>-0.30%</span>)</div>
</td>
<td class="turnover">6.69B</td>
<td class="market">4,824.90B</td>
<td class="pe">46.13x</td>
<td class="dividend">0.24%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0700.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (700)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9988</a></td>
<td class="name"><a>BABA-SW</a></td>
<td class="price"><bdo>HK$258.000</bdo>
<br>
<div class="downval"><span>-3.000</span> (<span>-1.15%</span>)</div>
</td>
<td class="turnover">3.97B</td>
<td class="market">5,584.43B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9988.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9988)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3690</a></td>
<td class="name"><a>MEITUAN-W</a></td>
<td class="price"><bdo>HK$232.000</bdo>
<br>
<div class="downval"><span>-6.600</span> (<span>-2.77%</span>)</div>
</td>
<td class="turnover">3.75B</td>
<td class="market">1,364.39B</td>
<td class="pe">547.17x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3690.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3690)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3333</a></td>
<td class="name"><a>EVERGRANDE</a></td>
<td class="price"><bdo>HK$13.780</bdo>
<br>
<div class="downval"><span>-1.440</span> (<span>-9.46%</span>)</div>
</td>
<td class="turnover">2.80B</td>
<td class="market">180.00B</td>
<td class="pe">9.59x</td>
<td class="dividend">5.15%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3333.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3333)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1810</a></td>
<td class="name"><a>XIAOMI-W</a></td>
<td class="price"><bdo>HK$19.720</bdo>
<br>
<div class="downval"><span>-0.120</span> (<span>-0.60%</span>)</div>
</td>
<td class="turnover">2.69B</td>
<td class="market">475.78B</td>
<td class="pe">42.69x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1810.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1810)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2318</a></td>
<td class="name"><a>PING AN</a></td>
<td class="price"><bdo>HK$80.350</bdo>
<br>
<div class="downval"><span>-0.100</span> (<span>-0.12%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">598.41B</td>
<td class="pe">8.61x</td>
<td class="dividend">2.89%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2318.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2318)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>388</a></td>
<td class="name"><a>HKEX</a></td>
<td class="price"><bdo>HK$355.800</bdo>
<br>
<div class="downval"><span>-1.800</span> (<span>-0.50%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">451.09B</td>
<td class="pe">47.50x</td>
<td class="dividend">1.88%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0388.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (388)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1918</a></td>
<td class="name"><a>SUNAC</a></td>
<td class="price"><bdo>HK$28.950</bdo>
<br>
<div class="downval"><span>-1.600</span> (<span>-5.24%</span>)</div>
</td>
<td class="turnover">1.36B</td>
<td class="market">134.94B</td>
<td class="pe">4.44x</td>
<td class="dividend">4.64%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1918.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1918)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>981</a></td>
<td class="name"><a>SMIC</a></td>
<td class="price"><bdo>HK$18.580</bdo>
<br>
<div class="downval"><span>-0.760</span> (<span>-3.93%</span>)</div>
</td>
<td class="turnover">1.35B</td>
<td class="market">143.03B</td>
<td class="pe">59.55x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0981.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (981)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1299</a></td>
<td class="name"><a>AIA</a></td>
<td class="price"><bdo>HK$77.650</bdo>
<br>
<div class="upval"><span>+1.000</span> (<span>+1.30%</span>)</div>
</td>
<td class="turnover">1.34B</td>
<td class="market">939.05B</td>
<td class="pe">18.03x</td>
<td class="dividend">1.65%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1299.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1299)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2300</a></td>
<td class="name"><a>AMVIG HOLDINGS</a></td>
<td class="price"><bdo>HK$2.140</bdo>
<br>
<div class="upval"><span>+0.700</span> (<span>+48.61%</span>)</div>
</td>
<td class="turnover">1.31B</td>
<td class="market">1.98B</td>
<td class="pe">6.35x</td>
<td class="dividend">5.33%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2300.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2300)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1398</a></td>
<td class="name"><a>ICBC</a></td>
<td class="price"><bdo>HK$3.990</bdo>
<br>
<div class="downval"><span>-0.030</span> (<span>-0.75%</span>)</div>
</td>
<td class="turnover">1.28B</td>
<td class="market">346.30B</td>
<td class="pe">4.32x</td>
<td class="dividend">7.20%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1398.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1398)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>5</a></td>
<td class="name"><a>HSBC HOLDINGS</a></td>
<td class="price"><bdo>HK$28.200</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-1.40%</span>)</div>
</td>
<td class="turnover">1.20B</td>
<td class="market">583.52B</td>
<td class="pe">12.21x</td>
<td class="dividend">2.78%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0005.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (5)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>939</a></td>
<td class="name"><a>CCB</a></td>
<td class="price"><bdo>HK$5.020</bdo>
<br>
<div class="downval"><span>-0.040</span> (<span>-0.79%</span>)</div>
</td>
<td class="turnover">1.10B</td>
<td class="market">1,206.89B</td>
<td class="pe">4.36x</td>
<td class="dividend">6.97%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0939.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (939)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9618</a></td>
<td class="name"><a>JD-SW</a></td>
<td class="price"><bdo>HK$282.200</bdo>
<br>
<div class="downval"><span>-6.200</span> (<span>-2.15%</span>)</div>
</td>
<td class="turnover">1.08B</td>
<td class="market">883.22B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9618.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9618)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1658</a></td>
<td class="name"><a>PSBC</a></td>
<td class="price"><bdo>HK$3.160</bdo>
<br>
<div class="upval"><span>+0.060</span> (<span>+1.94%</span>)</div>
</td>
<td class="turnover">1.05B</td>
<td class="market">62.74B</td>
<td class="pe">4.01x</td>
<td class="dividend">7.23%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1658.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1658)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9633</a></td>
<td class="name"><a>NONGFU SPRING</a></td>
<td class="price"><bdo>HK$35.150</bdo>
<br>
<div class="downval"><span>-2.850</span> (<span>-7.50%</span>)</div>
</td>
<td class="turnover">1.01B</td>
<td class="market">174.92B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9633.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9633)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1211</a></td>
<td class="name"><a>BYD COMPANY</a></td>
<td class="price"><bdo>HK$103.100</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-0.39%</span>)</div>
</td>
<td class="turnover">844.11M</td>
<td class="market">94.33B</td>
<td class="pe">188.21x</td>
<td class="dividend">0.06%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1211.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1211)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>708</a></td>
<td class="name"><a>EVERG VEHICLE</a></td>
<td class="price"><bdo>HK$16.820</bdo>
<br>
<div class="downval"><span>-2.460</span> (<span>-12.76%</span>)</div>
</td>
<td class="turnover">789.41M</td>
<td class="market">148.29B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0708.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (708)"></td>
</tr>
</tbody>
</table>
The table is being loaded dynamically using javascript. If you inspect element and use the networking tab you can see the XHR calls being made to get the data. I would try looking at these and then scrape from the api directly.

How to extract the following lines after pattern match

the web source is like this:
<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>
I want to get the following numbers for every date:
say for example I have to get the numbers 12269.25, 12283.70, 12202.10 and 12214.55 for a particular date (2019-12-24). Then proceed for the next date given.
I am facing difficulty because I need to select next 4 lines(whose xpath is not exatly related much as shown above) following each date in the page. The dates can range from single date to 100-200 dates.
Can anybody please help with webdriver code snippet for the same.
Thanks a lot
Can this meet your needs
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>'''
doc = SimplifiedDoc(html)
table = doc.getElement(tag='table',value='tblchart')
trs = table.trs.notContains('<th') # get tr
for tr in trs:
tds = tr.tds # get all td
data = [td.text for td in tds]
print (data[0],data[1],data[2],data[3],data[4])

Remove text outside of tags with bs4

I want to delete the text Página consultada el: but I don't know how because it's outside any tag.
I've tried with this but nothing changes:
for b in soup.find('br'):
if( b.nextSibling == 'Página consultada el:'):
b.nextSibling.replaceWith('')
if(b.previousSibling == 'Página consultada el:'):
b.previousSibling.replaceWith('')
This is the html of the part I want to remove:
<br/>
<br/>Página consultada el:
<br/>
<strong>27/01/2018 21:42:14</strong>
Whole html:
<html xmlns="http://www.w3.org/1999/xhtml">
<body><strong></strong>
<center><strong></strong>
<br/><br/><br/><br/>
<center>
</center>
<table border="1" cellpadding="0" cellspacing="0" style="width:400px">
<tbody>
<tr>
<td align="CENTER">
<p>Turno: Matutino</p>
</td>
<td align="CENTER"> Grupo: 401 </td>
</tr>
<tr>
<td align="CENTER" colspan="2">
<p>Profesor tutor: <br/> MONICA OSORNIO PEREZ.</p>
</td>
</tr>
</tbody>
</table>
<br/><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td align="CENTER" style="width:70px;">
<p>Hora:</p>
</td>
<td align="CENTER" style="width:186px;">Lunes </td>
<td align="CENTER" style="width:186px;">Martes </td>
<td align="CENTER" style="width:186px;">Miércoles </td>
<td align="CENTER" style="width:186px;">Jueves </td>
<td align="CENTER" style="width:186px;">Viernes </td>
</tr>
<tr>
<td align="CENTER">
<p>7:00<br/>a<br/>7:50</p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(A): A204<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>7:50<br/>a<br/>8:40</p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
<td align="CENTER">
<p> MATEMAT. IV B108<br/></p>
</td>
<td align="CENTER">
<p> INGLES IV(B): C303<br/>INGLES IV(A): C304<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>8:40<br/>a<br/>9:30</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA CC2 <br/></p>
</td>
<td align="CENTER">
<p> HISTORIA III B116<br/></p>
</td>
<td align="CENTER">
<p> ORI.EDU.IV(B): A205<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>9:30<br/>a<br/>10:20</p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A102<br/></p>
</td>
<td align="CENTER">
<p> DIBUJO II(A): B-8 <br/>DIBUJO II(B): C101<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>10:20<br/>a<br/>11:10</p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
<td align="CENTER">
<p> INFORMATICA B108<br/></p>
</td>
<td align="CENTER">
<p> FISICA III A303<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>11:10<br/>a<br/>12:00</p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> GEOGRAFIA A103<br/></p>
</td>
<td align="CENTER">
<p> FISICA III LACE<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:00<br/>a<br/>12:50</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> LENG. ESP. B108<br/></p>
</td>
<td align="CENTER">
<p> LOGICA B108<br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> HISTORIA III B108<br/></p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>12:50<br/>a<br/>13:40</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>13:40<br/>a<br/>14:30</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> ED FISICA IV GIM <br/></p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
<tr>
<td align="CENTER">
<p>14:30<br/>a<br/>15:20</p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
<td align="CENTER">
<p> </p>
</td>
</tr>
</tbody>
</table><br/>
<table border="1" cellpadding="0" cellspacing="0" style="width:1000px">
<tbody>
<tr>
<td style="width:165px;">
<p>Asignatura:</p>
</td>
<td style="width:335px;">Nombre del Profesor:</td>
<td style="width:165px;">Asignatura:</td>
<td style="width:335px;">Nombre del Profesor:</td>
</tr>
<tr>
<td>
<p>ORI.EDU.IV(A):</p>
</td>
<td>BECERRA ALCANTARA IVONNE </td>
<td>
<p>INGLES IV(B):</p>
</td>
<td>CARRILLO SANCHEZ JACOBO </td>
</tr>
<tr>
<td>
<p>LENG. ESP.</p>
</td>
<td>ESTRADA GASCA SCARLETT </td>
<td>
<p>FISICA III</p>
</td>
<td>FLORES FLORES ANA </td>
</tr>
<tr>
<td>
<p>HISTORIA III</p>
</td>
<td>GONZALEZ GARCIA ANGELICA ARACELI </td>
<td>
<p>DIBUJO II(A):</p>
</td>
<td>JIMENEZ GENCHI ERIKA PAOLA </td>
</tr>
<tr>
<td>
<p>LOGICA</p>
</td>
<td>NAVARRO LOZANO JULIANA V. </td>
<td>
<p>MATEMAT. IV</p>
</td>
<td>OLVERA PE¥A ALEJANDRO </td>
</tr>
<tr>
<td>
<p>GEOGRAFIA</p>
</td>
<td>OSORNIO PEREZ MONICA </td>
<td>
<p>ORI.EDU.IV(B):</p>
</td>
<td>PINEDA VALLEJO MARIA GABRIELA </td>
</tr>
<tr>
<td>
<p>INGLES IV(A):</p>
</td>
<td>REYES CRUZ KIMBERLY </td>
<td>
<p>ED FISICA IV</p>
</td>
<td>SANCHEZ LUGO EDGARDO JAIME </td>
</tr>
<tr>
<td>
<p>INFORMATICA</p>
</td>
<td>SOTOMAYOR GUERRA JUAN CARLOS </td>
<td>
<p>DIBUJO II(B):</p>
</td>
<td>VILLANUEVA VILCHIS MONICA EDITH </td>
</tr>
<tr>
<td>
<p></p>
</td>
<td></td>
<td>
<p></p>
</td>
<td></td>
</tr>
</tbody>
</table>
<br/><br/>Página consultada el:<br/><strong>27/01/2018 21:42:14</strong>
</center>
</body>
</html>
This might accomplish what you need:
html = re.sub(r'</table>\n<br/><br/>.+<br/>', '</table>\n<br/><br/><br/>', html)
That removes the text "Página consultada el:" from html.

How to parse this html structure using BeautifulSoup?

I would like to parse this TABLE line by line and save to a csv file.
What I have done so far, return nothing in the csv file:
Django:
data_scrapper makes a request from Yahoo Finance.
def button_clicked(request):
headers = []
rows = []
gen_table = data_scrapper(symbol)
soup = BeautifulSoup(gen_table)
table = soup.find_all('table')
for table in soup.find_all('table'):
headers.extend([header.text for header in table.find_all('th')])
for row in soup.find_all('tr'):
rows.extend([val.text for val in row.find_all('td')])
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename= "{}.csv"'.format(symbol)
writer = csv.writer(response)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
return response
html:
<TABLE class="yfnc_tabledata1" width="100%" cellpadding="0" cellspacing="0" border="0">
<TR>
<TD>
<TABLE width="100%" cellpadding="2" cellspacing="0" border="0">
<TR class="yfnc_modtitle1" style="border-top:none;">
<td colspan="2" style="border-top:2px solid #000;">
<small>
<span class="yfi-module-title">Period Ending</span>
</small>
</td>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2014</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2013</th>
<th scope="col" style="border-top:2px solid #000;text-align:right; font-weight:bold">Dec 31, 2012</th>
</TR>
<tr>
<td colspan="2">
<strong>
Total Revenue
</strong>
</td>
<td align="right">
<strong>
4,479,648
</strong>
</td>
<td align="right">
<strong>
3,777,068
</strong>
</td>
<td align="right">
<strong>
3,209,782
</strong>
</td>
</tr>
<tr>
<td colspan="2">Cost of Revenue</td>
<td align="right">3,160,470 </td>
<td align="right">2,656,189 </td>
<td align="right">2,284,485 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Gross Profit
</strong>
</td>
<td align="right">
<strong>
1,319,178
</strong>
</td>
<td align="right">
<strong>
1,120,879
</strong>
</td>
<td align="right">
<strong>
925,297
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Operating Expenses</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Research Development</td>
<td align="right">148,458 </td>
<td align="right">139,193 </td>
<td align="right">127,361 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Selling General and Administrative</td>
<td align="right">456,030 </td>
<td align="right">403,772 </td>
<td align="right">319,511 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Non Recurring</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Others</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Operating Expenses</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Operating Income or Loss
</strong>
</td>
<td align="right">
<strong>
714,690
</strong>
</td>
<td align="right">
<strong>
577,914
</strong>
</td>
<td align="right">
<strong>
478,425
</strong>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Income from Continuing Operations</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Total Other Income/Expenses Net</td>
<td align="right">(10)</td>
<td align="right">5,139 </td>
<td align="right">7,529 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Earnings Before Interest And Taxes</td>
<td align="right">710,556 </td>
<td align="right">580,639 </td>
<td align="right">485,775 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Interest Expense</td>
<td align="right">11,239 </td>
<td align="right">6,210 </td>
<td align="right">5,932 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Before Tax</td>
<td align="right">699,317 </td>
<td align="right">574,429 </td>
<td align="right">479,843 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Income Tax Expense</td>
<td align="right">245,288 </td>
<td align="right">193,360 </td>
<td align="right">167,533 </td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Minority Interest</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td colspan="5" style="height:0; padding:0; " class="yfnc_d">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Net Income From Continuing Ops</td>
<td align="right">454,029 </td>
<td align="right">381,069 </td>
<td align="right">312,310 </td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td>
<spacer type="block" height="1" width="1" />
</td>
<td class="yfnc_d" colspan="4">Non-recurring Events</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Discontinued Operations</td>
<td align="right">
-
</td>
<td align="right">(3,777)</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Extraordinary Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Effect Of Accounting Changes</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td width="30" class="yfnc_tabledata1">
<spacer type="block" width="30" height="1" />
</td>
<td>Other Items</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; ">
<span style="display:block; width:5px; height:10px;"></span>
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
<tr>
<td colspan="2">Preferred Stock And Other Adjustments</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
<td align="right">
-
</td>
</tr>
<tr>
<td colspan="5" style="height:0;padding:0; border-top:3px solid #333;">
<span style="display:block; width:5px; height:1px;"></span>
</td>
</tr>
<tr>
<td colspan="2">
<strong>
Net Income Applicable To Common Shares
</strong>
</td>
<td align="right">
<strong>
454,029
</strong>
</td>
<td align="right">
<strong>
377,292
</strong>
</td>
<td align="right">
<strong>
312,310
</strong>
</td>
</tr>
</TABLE>
</TD>
</TR>
</TABLE>
Here's some code that makes a csv that looks like the table. The csvs I usually work with have a row as a complete record. So all the values in column one would be the csv header. Just something to think about, it might be helpful
Python 3.4
from bs4 import BeautifulSoup
import re
import csv
def button_clicked(request, filename):
soup = BeautifulSoup(request)
table = soup.find('table').find('table')
t_rows = table.find_all('tr')
with open(filename, 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
for t_row in t_rows:
rec_as_str = t_row.getText()
rec_as_str = rec_as_str.strip()
rec_as_str = rec_as_str.replace('\xa0', '')
rec_as_str = re.sub('\\n?\s*(\\n)+\s*', '|', rec_as_str)
if len(rec_as_str) > 0:
a_list = rec_as_str.split("|")
spamwriter.writerow(a_list)
Creates a file that looks like:
Period Ending,"Dec 31, 2014","Dec 31, 2013","Dec 31, 2012"
Total Revenue,"4,479,648","3,777,068","3,209,782"
Cost of Revenue,"3,160,470","2,656,189","2,284,485"
Gross Profit,"1,319,178","1,120,879","925,297"
Operating Expenses
Research Development,"148,458","139,193","127,361"
Selling General and Administrative,"456,030","403,772","319,511"
Non Recurring,-,-,-
Others,-,-,-
Total Operating Expenses,-,-,-
Operating Income or Loss,"714,690","577,914","478,425"
Income from Continuing Operations
Total Other Income/Expenses Net,(10),"5,139","7,529"
Earnings Before Interest And Taxes,"710,556","580,639","485,775"
Interest Expense,"11,239","6,210","5,932"
Income Before Tax,"699,317","574,429","479,843"
Income Tax Expense,"245,288","193,360","167,533"
Minority Interest,-,-,-
Net Income From Continuing Ops,"454,029","381,069","312,310"
Non-recurring Events
Discontinued Operations,-,"(3,777)",-
Extraordinary Items,-,-,-
Effect Of Accounting Changes,-,-,-
Other Items,-,-,-
Net Income,"454,029","377,292","312,310"
Preferred Stock And Other Adjustments,-,-,-
Net Income Applicable To Common Shares,"454,029","377,292","312,310"

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

Categories