parsing part of the table for data using BeautifulSoup

parsing part of the table for data using BeautifulSoup - python

I have a self-project to scrape data online using BeautifulSoup and Python, and I think historical stocks data would be a good one for me to practice. I looked at the source code here to analyze how I can use BeautifulSoup's select() or findall() to parse part of the data from the table. Here is the code I use, but it parsed things other than the table.
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.findAll( 'td', {'class':'yfnc_tabledata1'} )
print table
My Question: How to I parse only the 2 rows showing the 2 days of data from the table?
Here is the table that has 2 days of the historical data:
<table class="yfnc_datamodoutline1" width="100%" cellpadding="0" cellspacing="0" border="0">
<tr valign="top">
<td>
<table border="0" cellpadding="2" cellspacing="1" width="100%">
<tr>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Date</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Open</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">High</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Low</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">close</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Volume</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="15%">Adj Close*</th>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>
<tr>
<td class="yfnc_tabledata1" colspan="7" align="center">
* <small>Close price adjusted for dividends and splits.</small>
</td>
</tr>
</table>
</td>
</tr>
</table>
I only need the specific 2 rows of data from above:
<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>

You can select the all the rows from the nested table inside the yfnc_datamodoutline1 table and index the first two:
soup = BeautifulSoup(html)
table_rows = soup.select("table.yfnc_datamodoutline1 table tr + tr")
row1, row2 = table_rows[0:2]
print(row1)
print(row2)
Which would give you:
<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">12 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.44</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
<td align="right" class="yfnc_tabledata1">18,612,300</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
</tr>
<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">11 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">108.52</td>
<td align="right" class="yfnc_tabledata1">108.93</td>
<td align="right" class="yfnc_tabledata1">107.85</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
<td align="right" class="yfnc_tabledata1">27,484,500</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
</tr>
To get the td data just extract the text from each td:
print([td.text for td in row1.find_all("td")])
print([td.text for td in row2.find_all("td")])
Which would give you:
[u'12 Aug 2016', u'107.78', u'108.44', u'107.78', u'108.18', u'18,612,300', u'108.18']
[u'11 Aug 2016', u'108.52', u'108.93', u'107.85', u'107.93', u'27,484,500', u'107.93']
table.yfnc_datamodoutline1 table tr + tr selects all the rows inside the inner table skipping the first which is the header row.

Related

Python web scraping problems - result different from soruce code

The code below is not able to scrape the class_=datarow. i have tried to use read_html as well, but there are tables inside this table class="table_equities", which makes the read_html not working for me. I have no idea how to get the table.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import openpyxl
import scrapy
path = 'C:/Users/pacc_/OneDrive/Desktop/Eric/Investment/Trading record python.xlsx'
page_link = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
page_response = requests.get(page_link, timeout=2)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(class_='table_equities')
print(data)
My result：
<table class="table_equities">
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
</table>
Target url structure:
<table class="table_equities">
<tbody>
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text">Stock Code</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text">Name</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text">Nominal Price</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text">Turnover (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text">Market Cap (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text">P/E</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text">Dividend Yield (%)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text">Intraday Movement</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
<tr class="datarow">
<td class="code"><a>909</a></td>
<td class="name"><a>MING YUAN CLOUD</a></td>
<td class="price"><bdo>HK$30.700</bdo>
<br>
<div><span>0.000</span> (<span>0.00%</span>)</div>
</td>
<td class="turnover">8.35B</td>
<td class="market">57.44B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0909.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (909)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>700</a></td>
<td class="name"><a>TENCENT</a></td>
<td class="price"><bdo>HK$503.500</bdo>
<br>
<div class="downval"><span>-1.500</span> (<span>-0.30%</span>)</div>
</td>
<td class="turnover">6.69B</td>
<td class="market">4,824.90B</td>
<td class="pe">46.13x</td>
<td class="dividend">0.24%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0700.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (700)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9988</a></td>
<td class="name"><a>BABA-SW</a></td>
<td class="price"><bdo>HK$258.000</bdo>
<br>
<div class="downval"><span>-3.000</span> (<span>-1.15%</span>)</div>
</td>
<td class="turnover">3.97B</td>
<td class="market">5,584.43B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9988.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9988)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3690</a></td>
<td class="name"><a>MEITUAN-W</a></td>
<td class="price"><bdo>HK$232.000</bdo>
<br>
<div class="downval"><span>-6.600</span> (<span>-2.77%</span>)</div>
</td>
<td class="turnover">3.75B</td>
<td class="market">1,364.39B</td>
<td class="pe">547.17x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3690.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3690)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3333</a></td>
<td class="name"><a>EVERGRANDE</a></td>
<td class="price"><bdo>HK$13.780</bdo>
<br>
<div class="downval"><span>-1.440</span> (<span>-9.46%</span>)</div>
</td>
<td class="turnover">2.80B</td>
<td class="market">180.00B</td>
<td class="pe">9.59x</td>
<td class="dividend">5.15%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3333.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3333)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1810</a></td>
<td class="name"><a>XIAOMI-W</a></td>
<td class="price"><bdo>HK$19.720</bdo>
<br>
<div class="downval"><span>-0.120</span> (<span>-0.60%</span>)</div>
</td>
<td class="turnover">2.69B</td>
<td class="market">475.78B</td>
<td class="pe">42.69x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1810.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1810)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2318</a></td>
<td class="name"><a>PING AN</a></td>
<td class="price"><bdo>HK$80.350</bdo>
<br>
<div class="downval"><span>-0.100</span> (<span>-0.12%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">598.41B</td>
<td class="pe">8.61x</td>
<td class="dividend">2.89%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2318.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2318)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>388</a></td>
<td class="name"><a>HKEX</a></td>
<td class="price"><bdo>HK$355.800</bdo>
<br>
<div class="downval"><span>-1.800</span> (<span>-0.50%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">451.09B</td>
<td class="pe">47.50x</td>
<td class="dividend">1.88%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0388.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (388)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1918</a></td>
<td class="name"><a>SUNAC</a></td>
<td class="price"><bdo>HK$28.950</bdo>
<br>
<div class="downval"><span>-1.600</span> (<span>-5.24%</span>)</div>
</td>
<td class="turnover">1.36B</td>
<td class="market">134.94B</td>
<td class="pe">4.44x</td>
<td class="dividend">4.64%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1918.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1918)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>981</a></td>
<td class="name"><a>SMIC</a></td>
<td class="price"><bdo>HK$18.580</bdo>
<br>
<div class="downval"><span>-0.760</span> (<span>-3.93%</span>)</div>
</td>
<td class="turnover">1.35B</td>
<td class="market">143.03B</td>
<td class="pe">59.55x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0981.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (981)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1299</a></td>
<td class="name"><a>AIA</a></td>
<td class="price"><bdo>HK$77.650</bdo>
<br>
<div class="upval"><span>+1.000</span> (<span>+1.30%</span>)</div>
</td>
<td class="turnover">1.34B</td>
<td class="market">939.05B</td>
<td class="pe">18.03x</td>
<td class="dividend">1.65%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1299.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1299)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2300</a></td>
<td class="name"><a>AMVIG HOLDINGS</a></td>
<td class="price"><bdo>HK$2.140</bdo>
<br>
<div class="upval"><span>+0.700</span> (<span>+48.61%</span>)</div>
</td>
<td class="turnover">1.31B</td>
<td class="market">1.98B</td>
<td class="pe">6.35x</td>
<td class="dividend">5.33%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2300.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2300)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1398</a></td>
<td class="name"><a>ICBC</a></td>
<td class="price"><bdo>HK$3.990</bdo>
<br>
<div class="downval"><span>-0.030</span> (<span>-0.75%</span>)</div>
</td>
<td class="turnover">1.28B</td>
<td class="market">346.30B</td>
<td class="pe">4.32x</td>
<td class="dividend">7.20%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1398.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1398)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>5</a></td>
<td class="name"><a>HSBC HOLDINGS</a></td>
<td class="price"><bdo>HK$28.200</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-1.40%</span>)</div>
</td>
<td class="turnover">1.20B</td>
<td class="market">583.52B</td>
<td class="pe">12.21x</td>
<td class="dividend">2.78%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0005.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (5)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>939</a></td>
<td class="name"><a>CCB</a></td>
<td class="price"><bdo>HK$5.020</bdo>
<br>
<div class="downval"><span>-0.040</span> (<span>-0.79%</span>)</div>
</td>
<td class="turnover">1.10B</td>
<td class="market">1,206.89B</td>
<td class="pe">4.36x</td>
<td class="dividend">6.97%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0939.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (939)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9618</a></td>
<td class="name"><a>JD-SW</a></td>
<td class="price"><bdo>HK$282.200</bdo>
<br>
<div class="downval"><span>-6.200</span> (<span>-2.15%</span>)</div>
</td>
<td class="turnover">1.08B</td>
<td class="market">883.22B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9618.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9618)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1658</a></td>
<td class="name"><a>PSBC</a></td>
<td class="price"><bdo>HK$3.160</bdo>
<br>
<div class="upval"><span>+0.060</span> (<span>+1.94%</span>)</div>
</td>
<td class="turnover">1.05B</td>
<td class="market">62.74B</td>
<td class="pe">4.01x</td>
<td class="dividend">7.23%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1658.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1658)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9633</a></td>
<td class="name"><a>NONGFU SPRING</a></td>
<td class="price"><bdo>HK$35.150</bdo>
<br>
<div class="downval"><span>-2.850</span> (<span>-7.50%</span>)</div>
</td>
<td class="turnover">1.01B</td>
<td class="market">174.92B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9633.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9633)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1211</a></td>
<td class="name"><a>BYD COMPANY</a></td>
<td class="price"><bdo>HK$103.100</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-0.39%</span>)</div>
</td>
<td class="turnover">844.11M</td>
<td class="market">94.33B</td>
<td class="pe">188.21x</td>
<td class="dividend">0.06%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1211.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1211)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>708</a></td>
<td class="name"><a>EVERG VEHICLE</a></td>
<td class="price"><bdo>HK$16.820</bdo>
<br>
<div class="downval"><span>-2.460</span> (<span>-12.76%</span>)</div>
</td>
<td class="turnover">789.41M</td>
<td class="market">148.29B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0708.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (708)"></td>
</tr>
</tbody>
</table>

The table is being loaded dynamically using javascript. If you inspect element and use the networking tab you can see the XHR calls being made to get the data. I would try looking at these and then scrape from the api directly.

Web Scraping using bs4 with Python

I have a problem with the HTML text I am trying to work with.
I would like to extract the name of the player with all the statistics associated with him.
Basically I am not sure if I can extract the numbers of the column due to the syntax of the code.
In the HTML I included only 2 players, but I would like to add all the players of this club and then continue to the next team.
<table data-toggle="table-estadisticas-clubes" data-fixed-columns="true" data-fixed-number="2" class="roboto">
<thead>
<tr class="cabecera_general">
<th> </th>
<th> </th>
<th>PAR</th>
<th>MIN</th>
<th> </th>
<th>PT</th>
<th colspan="3">TIROS DE 3</th>
<th colspan="3">TIROS DE 2</th>
<th colspan="3">TIROS LIBRES</th>
<th colspan="3">REBOTES</th>
<th>ASI</th>
<th colspan="2">BALONES</th>
<th colspan="2">TAPONES</th>
<th> </th>
<th colspan="2">FALTAS</th>
<th> </th>
<th class="ultimo">VAL</th>
</tr>
<tr>
<th class="situacion"> </th>
<th class="nombre jugador"> </th>
<th>Jug</th>
<th>Jug</th>
<th>5i</th>
<th> </th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Def</th>
<th>Ofe</th>
<th>Tot</th>
<th>Efe</th>
<th>Rec</th>
<th>Per</th>
<th>Fav</th>
<th>Con</th>
<th>Mat</th>
<th>Com</th>
<th>Rec</th>
<th>+/-</th>
<th class="ultimo"> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="situacion"></td>
<td class="nombre jugador ellipsis"><span class="nombre_corto">William Magarity</span></td>
<td class="borde_derecho">2</td>
<td class="borde_derecho">23:57</td>
<td class="borde_derecho"></td>
<td class="borde_derecho">11,5</td>
<td class="borde_derecho">3,0</td>
<td class="borde_derecho">4,0</td>
<td class="borde_derecho">75,0%</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,5</td>
<td class="borde_derecho">20,0%</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">100,0%</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">16,0</td>
</tr>
<tr class="par">
<td class="situacion"></td>
<td class="nombre jugador ellipsis"><span class="nombre_corto">Jaime Echenique</span></td>
<td class="borde_derecho">2</td>
<td class="borde_derecho">23:34</td>
<td class="borde_derecho"></td>
<td class="borde_derecho">14,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">50,0%</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">7,0</td>
<td class="borde_derecho">50,0%</td>
<td class="borde_derecho">5,5</td>
<td class="borde_derecho">6,0</td>
<td class="borde_derecho">91,7%</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">3,0</td>
<td class="borde_derecho">4,0</td>
<td class="borde_derecho">-1,5</td>
<td class="borde_derecho">15,5</td>
</tr>
</tbody>
</table>
URL: https://www.acb.com/club/estadisticas/id/14

Easiest way to parse the table is to use pandas:
import pandas as pd
url = 'https://www.acb.com/club/estadisticas/id/14'
df = pd.read_html(url)[0].iloc[:,1:]
df.to_csv('data.csv', index=False)
Will grab the table to dataframe and saves it as data.csv:

How to extract the following lines after pattern match

the web source is like this:
<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>
I want to get the following numbers for every date:
say for example I have to get the numbers 12269.25, 12283.70, 12202.10 and 12214.55 for a particular date (2019-12-24). Then proceed for the next date given.
I am facing difficulty because I need to select next 4 lines(whose xpath is not exatly related much as shown above) following each date in the page. The dates can range from single date to 100-200 dates.
Can anybody please help with webdriver code snippet for the same.
Thanks a lot

Can this meet your needs
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>'''
doc = SimplifiedDoc(html)
table = doc.getElement(tag='table',value='tblchart')
trs = table.trs.notContains('<th') # get tr
for tr in trs:
tds = tr.tds # get all td
data = [td.text for td in tds]
print (data[0],data[1],data[2],data[3],data[4])

How to find a value in a table with no identifiers? (Python, Selenium)

I have a webpage with a table with many rows. A user will give me a number (15308) which can be found in the top line with the first <td> tag, and this is the only information I will have. I want to be able to use this number to find the data between the <th></th> tag (more specifically the 0), but only for the table row. For example, I attached two table rows and I want the <th> data using the number 15308, but not the <th> data from the table row that has the number 15309 in it's first <td>. Any help is appreciated!
Desired Output: 0
<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>

Use Following code :
userValue='15308'
all_td_th_of_row = driver.find_elements_by_xpath("//td[normalize-space()='" + userValue + "']//following-sibling::td|th")
i = 0
while i<len(all_td_th_of_row) :
print(all_td_th_of_row[i].text)
i=i+1

Something I have always found beautiful, using beauitfulsoup:
Using the xpath="1" as an attribute:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find('th', attrs={'xpath': '1'})
print(xpathTh.text.strip())
OUTPUT:
0
EDIT:
To get all the values from the attrib:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 2</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
1
2
EDIT 2:
Considering you only want the xpath value if the anchor tag inside the td (inside tr) has a value of 15308:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>22222</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
EDIT 3:
Continuing from comments:
line = '''<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('td')[7]
print("For the value: {}, The result is {}".format(toFind, xpathTh.find_next('th').text.strip()))
OUTPUT:
For the value: 15308, The result is 0

BeautifulSoup - extract number

I'm trying to learn how to scrape the web, but I have some problem getting my code to work. The number I want to extract is 77.80 from the code below. The problem I have is to find something that is unique enough to find the information(the place). can you help me with the right code. Thanks in advance!
</td>
<td class="small"> </td>
<td align="center" nowrap valign="center" class="small">
<a alt="Utvald" class="small" href="javascript:QT('/se/skandia/funds/chosen.aspx?tab=5&cid=0P0000T35O&lang=SV&curiso=SEK&country=SE&clientattributes=8&lastpage=Sök fond&LastPageURL=/se/skandia/quickrank/index.aspx?tab=RSLTS|lang=SV|univ=SE1|country=SE|curiso=SEK|mec=|cat=-1|search=|sortby=Custom_4|sortorder=ASC|PageNo=1|Firstletter=','0P0000T35O','600')" onmouseout="status=''; return true"><img src="../read/im/sigillsvartsmall_FFFFFF.gif" border="0" alt="Utvald av Skandia" height="12" width="9"/></a>
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
77.80
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
<!--<img src="../read/im/valueSEK.gif" align="texttop" height="10" width="22">-->
SEK
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
1.4
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
0.5
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
2.7
</td>
<td class="small"> </td>
<td align="right" nowrap valign="top" class="small">
6.6
</td>

Here is how to find text you wanted. This just looks for the first td that has class='small' and valign='top'.
soup = BeautifulSoup(s)
tds = soup.find_all('td', attrs={'class': 'small', 'valign': 'top'})
the_td = tds[0].text.strip()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing part of the table for data using BeautifulSoup - python

Related

Python web scraping problems - result different from soruce code

Web Scraping using bs4 with Python

How to extract the following lines after pattern match

How to find a value in a table with no identifiers? (Python, Selenium)

BeautifulSoup - extract number

Categories

Resources