Web Scraping using bs4 with Python

Web Scraping using bs4 with Python - python

I have a problem with the HTML text I am trying to work with.
I would like to extract the name of the player with all the statistics associated with him.
Basically I am not sure if I can extract the numbers of the column due to the syntax of the code.
In the HTML I included only 2 players, but I would like to add all the players of this club and then continue to the next team.
<table data-toggle="table-estadisticas-clubes" data-fixed-columns="true" data-fixed-number="2" class="roboto">
<thead>
<tr class="cabecera_general">
<th> </th>
<th> </th>
<th>PAR</th>
<th>MIN</th>
<th> </th>
<th>PT</th>
<th colspan="3">TIROS DE 3</th>
<th colspan="3">TIROS DE 2</th>
<th colspan="3">TIROS LIBRES</th>
<th colspan="3">REBOTES</th>
<th>ASI</th>
<th colspan="2">BALONES</th>
<th colspan="2">TAPONES</th>
<th> </th>
<th colspan="2">FALTAS</th>
<th> </th>
<th class="ultimo">VAL</th>
</tr>
<tr>
<th class="situacion"> </th>
<th class="nombre jugador"> </th>
<th>Jug</th>
<th>Jug</th>
<th>5i</th>
<th> </th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Con</th>
<th>Int</th>
<th>%</th>
<th>Def</th>
<th>Ofe</th>
<th>Tot</th>
<th>Efe</th>
<th>Rec</th>
<th>Per</th>
<th>Fav</th>
<th>Con</th>
<th>Mat</th>
<th>Com</th>
<th>Rec</th>
<th>+/-</th>
<th class="ultimo"> </th>
</tr>
</thead>
<tbody>
<tr>
<td class="situacion"></td>
<td class="nombre jugador ellipsis"><span class="nombre_corto">William Magarity</span></td>
<td class="borde_derecho">2</td>
<td class="borde_derecho">23:57</td>
<td class="borde_derecho"></td>
<td class="borde_derecho">11,5</td>
<td class="borde_derecho">3,0</td>
<td class="borde_derecho">4,0</td>
<td class="borde_derecho">75,0%</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,5</td>
<td class="borde_derecho">20,0%</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">100,0%</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">16,0</td>
</tr>
<tr class="par">
<td class="situacion"></td>
<td class="nombre jugador ellipsis"><span class="nombre_corto">Jaime Echenique</span></td>
<td class="borde_derecho">2</td>
<td class="borde_derecho">23:34</td>
<td class="borde_derecho"></td>
<td class="borde_derecho">14,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">50,0%</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">7,0</td>
<td class="borde_derecho">50,0%</td>
<td class="borde_derecho">5,5</td>
<td class="borde_derecho">6,0</td>
<td class="borde_derecho">91,7%</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">3,5</td>
<td class="borde_derecho">1,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">2,0</td>
<td class="borde_derecho">0,0</td>
<td class="borde_derecho">0,5</td>
<td class="borde_derecho">3,0</td>
<td class="borde_derecho">4,0</td>
<td class="borde_derecho">-1,5</td>
<td class="borde_derecho">15,5</td>
</tr>
</tbody>
</table>
URL: https://www.acb.com/club/estadisticas/id/14

Easiest way to parse the table is to use pandas:
import pandas as pd
url = 'https://www.acb.com/club/estadisticas/id/14'
df = pd.read_html(url)[0].iloc[:,1:]
df.to_csv('data.csv', index=False)
Will grab the table to dataframe and saves it as data.csv:

Related

Python web scraping problems - result different from soruce code

The code below is not able to scrape the class_=datarow. i have tried to use read_html as well, but there are tables inside this table class="table_equities", which makes the read_html not working for me. I have no idea how to get the table.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import openpyxl
import scrapy
path = 'C:/Users/pacc_/OneDrive/Desktop/Eric/Investment/Trading record python.xlsx'
page_link = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
page_response = requests.get(page_link, timeout=2)
page_content = BeautifulSoup(page_response.content, "html.parser")
data = page_content.find(class_='table_equities')
print(data)
My result：
<table class="table_equities">
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text"></th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
</table>
Target url structure:
<table class="table_equities">
<tbody>
<tr>
<th class="th code">
<table>
<thead>
<tr>
<th class="text">Stock Code</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th name">
<table>
<thead>
<tr>
<th class="text">Name</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th price">
<table>
<thead>
<tr>
<th class="text">Nominal Price</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th turnover selected uppercase">
<table>
<thead>
<tr>
<th class="text">Turnover (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th mktcap">
<table>
<thead>
<tr>
<th class="text">Market Cap (HK$)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th pe">
<table>
<thead>
<tr>
<th class="text">P/E</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th div_yield">
<table>
<thead>
<tr>
<th class="text">Dividend Yield (%)</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
<th class="th intraday">
<table>
<thead>
<tr>
<th class="text">Intraday Movement</th>
<th class="ico"><i></i></th>
</tr>
</thead>
</table>
</th>
</tr>
<tr class="datarow">
<td class="code"><a>909</a></td>
<td class="name"><a>MING YUAN CLOUD</a></td>
<td class="price"><bdo>HK$30.700</bdo>
<br>
<div><span>0.000</span> (<span>0.00%</span>)</div>
</td>
<td class="turnover">8.35B</td>
<td class="market">57.44B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0909.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (909)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>700</a></td>
<td class="name"><a>TENCENT</a></td>
<td class="price"><bdo>HK$503.500</bdo>
<br>
<div class="downval"><span>-1.500</span> (<span>-0.30%</span>)</div>
</td>
<td class="turnover">6.69B</td>
<td class="market">4,824.90B</td>
<td class="pe">46.13x</td>
<td class="dividend">0.24%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0700.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (700)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9988</a></td>
<td class="name"><a>BABA-SW</a></td>
<td class="price"><bdo>HK$258.000</bdo>
<br>
<div class="downval"><span>-3.000</span> (<span>-1.15%</span>)</div>
</td>
<td class="turnover">3.97B</td>
<td class="market">5,584.43B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9988.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9988)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3690</a></td>
<td class="name"><a>MEITUAN-W</a></td>
<td class="price"><bdo>HK$232.000</bdo>
<br>
<div class="downval"><span>-6.600</span> (<span>-2.77%</span>)</div>
</td>
<td class="turnover">3.75B</td>
<td class="market">1,364.39B</td>
<td class="pe">547.17x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3690.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3690)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>3333</a></td>
<td class="name"><a>EVERGRANDE</a></td>
<td class="price"><bdo>HK$13.780</bdo>
<br>
<div class="downval"><span>-1.440</span> (<span>-9.46%</span>)</div>
</td>
<td class="turnover">2.80B</td>
<td class="market">180.00B</td>
<td class="pe">9.59x</td>
<td class="dividend">5.15%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=3333.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (3333)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1810</a></td>
<td class="name"><a>XIAOMI-W</a></td>
<td class="price"><bdo>HK$19.720</bdo>
<br>
<div class="downval"><span>-0.120</span> (<span>-0.60%</span>)</div>
</td>
<td class="turnover">2.69B</td>
<td class="market">475.78B</td>
<td class="pe">42.69x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1810.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1810)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2318</a></td>
<td class="name"><a>PING AN</a></td>
<td class="price"><bdo>HK$80.350</bdo>
<br>
<div class="downval"><span>-0.100</span> (<span>-0.12%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">598.41B</td>
<td class="pe">8.61x</td>
<td class="dividend">2.89%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2318.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2318)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>388</a></td>
<td class="name"><a>HKEX</a></td>
<td class="price"><bdo>HK$355.800</bdo>
<br>
<div class="downval"><span>-1.800</span> (<span>-0.50%</span>)</div>
</td>
<td class="turnover">1.49B</td>
<td class="market">451.09B</td>
<td class="pe">47.50x</td>
<td class="dividend">1.88%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0388.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (388)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1918</a></td>
<td class="name"><a>SUNAC</a></td>
<td class="price"><bdo>HK$28.950</bdo>
<br>
<div class="downval"><span>-1.600</span> (<span>-5.24%</span>)</div>
</td>
<td class="turnover">1.36B</td>
<td class="market">134.94B</td>
<td class="pe">4.44x</td>
<td class="dividend">4.64%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1918.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1918)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>981</a></td>
<td class="name"><a>SMIC</a></td>
<td class="price"><bdo>HK$18.580</bdo>
<br>
<div class="downval"><span>-0.760</span> (<span>-3.93%</span>)</div>
</td>
<td class="turnover">1.35B</td>
<td class="market">143.03B</td>
<td class="pe">59.55x</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0981.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (981)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1299</a></td>
<td class="name"><a>AIA</a></td>
<td class="price"><bdo>HK$77.650</bdo>
<br>
<div class="upval"><span>+1.000</span> (<span>+1.30%</span>)</div>
</td>
<td class="turnover">1.34B</td>
<td class="market">939.05B</td>
<td class="pe">18.03x</td>
<td class="dividend">1.65%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1299.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1299)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>2300</a></td>
<td class="name"><a>AMVIG HOLDINGS</a></td>
<td class="price"><bdo>HK$2.140</bdo>
<br>
<div class="upval"><span>+0.700</span> (<span>+48.61%</span>)</div>
</td>
<td class="turnover">1.31B</td>
<td class="market">1.98B</td>
<td class="pe">6.35x</td>
<td class="dividend">5.33%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=2300.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (2300)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1398</a></td>
<td class="name"><a>ICBC</a></td>
<td class="price"><bdo>HK$3.990</bdo>
<br>
<div class="downval"><span>-0.030</span> (<span>-0.75%</span>)</div>
</td>
<td class="turnover">1.28B</td>
<td class="market">346.30B</td>
<td class="pe">4.32x</td>
<td class="dividend">7.20%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1398.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1398)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>5</a></td>
<td class="name"><a>HSBC HOLDINGS</a></td>
<td class="price"><bdo>HK$28.200</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-1.40%</span>)</div>
</td>
<td class="turnover">1.20B</td>
<td class="market">583.52B</td>
<td class="pe">12.21x</td>
<td class="dividend">2.78%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0005.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (5)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>939</a></td>
<td class="name"><a>CCB</a></td>
<td class="price"><bdo>HK$5.020</bdo>
<br>
<div class="downval"><span>-0.040</span> (<span>-0.79%</span>)</div>
</td>
<td class="turnover">1.10B</td>
<td class="market">1,206.89B</td>
<td class="pe">4.36x</td>
<td class="dividend">6.97%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0939.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (939)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9618</a></td>
<td class="name"><a>JD-SW</a></td>
<td class="price"><bdo>HK$282.200</bdo>
<br>
<div class="downval"><span>-6.200</span> (<span>-2.15%</span>)</div>
</td>
<td class="turnover">1.08B</td>
<td class="market">883.22B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9618.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9618)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1658</a></td>
<td class="name"><a>PSBC</a></td>
<td class="price"><bdo>HK$3.160</bdo>
<br>
<div class="upval"><span>+0.060</span> (<span>+1.94%</span>)</div>
</td>
<td class="turnover">1.05B</td>
<td class="market">62.74B</td>
<td class="pe">4.01x</td>
<td class="dividend">7.23%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1658.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1658)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>9633</a></td>
<td class="name"><a>NONGFU SPRING</a></td>
<td class="price"><bdo>HK$35.150</bdo>
<br>
<div class="downval"><span>-2.850</span> (<span>-7.50%</span>)</div>
</td>
<td class="turnover">1.01B</td>
<td class="market">174.92B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=9633.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (9633)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>1211</a></td>
<td class="name"><a>BYD COMPANY</a></td>
<td class="price"><bdo>HK$103.100</bdo>
<br>
<div class="downval"><span>-0.400</span> (<span>-0.39%</span>)</div>
</td>
<td class="turnover">844.11M</td>
<td class="market">94.33B</td>
<td class="pe">188.21x</td>
<td class="dividend">0.06%</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=1211.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (1211)"></td>
</tr>
<tr class="datarow">
<td class="code"><a>708</a></td>
<td class="name"><a>EVERG VEHICLE</a></td>
<td class="price"><bdo>HK$16.820</bdo>
<br>
<div class="downval"><span>-2.460</span> (<span>-12.76%</span>)</div>
</td>
<td class="turnover">789.41M</td>
<td class="market">148.29B</td>
<td class="pe">-</td>
<td class="dividend">-</td>
<td class="intraday"><img width="82" height="33" src="https://www1.hkex.com.hk/hkexwidget/chart/genchart?&sym=0708.HK&int=2&per=1&w=82&h=33&bottom=2&mode=8&tm=1601021280000" alt="Stock Chart (708)"></td>
</tr>
</tbody>
</table>

The table is being loaded dynamically using javascript. If you inspect element and use the networking tab you can see the XHR calls being made to get the data. I would try looking at these and then scrape from the api directly.

How to extract the following lines after pattern match

the web source is like this:
<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>
I want to get the following numbers for every date:
say for example I have to get the numbers 12269.25, 12283.70, 12202.10 and 12214.55 for a particular date (2019-12-24). Then proceed for the next date given.
I am facing difficulty because I need to select next 4 lines(whose xpath is not exatly related much as shown above) following each date in the page. The dates can range from single date to 100-200 dates.
Can anybody please help with webdriver code snippet for the same.
Thanks a lot

Can this meet your needs
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>'''
doc = SimplifiedDoc(html)
table = doc.getElement(tag='table',value='tblchart')
trs = table.trs.notContains('<th') # get tr
for tr in trs:
tds = tr.tds # get all td
data = [td.text for td in tds]
print (data[0],data[1],data[2],data[3],data[4])

Table extraction: BeautifulSoup vs. Pandas.read_html

I have an html file taken from this link, but I am not being able to extract any sort of table neither with bs4.BeautifulSoup() nor with pandas.read_html. I understand that each row of my desired table starts with <tr class='odd'>. Despite that, something is not working when I pass soup.find({'class': 'odd'}) or pd.read_html(url, attrs = {'class': 'odd'}). Where is the mistake or what should I do instead?
The beginning of the table apparently starts in requests.get(url).content[8359:].
<table style="background-color:#FFFEEE; border-width:thin; border-collapse:collapse; border-spacing:0; border-style:outset;" rules="groups" >
<colgroup>
<colgroup>
<colgroup>
<colgroup>
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup>
<tbody>
<tr style="vertical-align:middle; background-color:#177A9C">
<th scope="col" style="text-align:center">Ion</th>
<th scope="col" style="text-align:center"> Observed <br /> Wavelength <br /> Vac (nm) </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>g<sub>k</sub>A<sub>ki</sub></i><br /> (10<sup>8</sup> s<sup>-1</sup>) </th>
<th scope="col"> Acc. </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>i</sub></i> <br /> (eV) </th>
<th> </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>k</sub></i> <br /> (eV) </th>
<th scope="col" style="text-align:center" colspan="3"> Lower Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center" colspan="3"> Upper Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center"> <i>g<sub>i</sub></i> </th>
<th scope="col" style="text-align:center"> <b>-</b> </th>
<th scope="col" style="text-align:center"> <i>g<sub>k</sub></i> </th>
<th scope="col" style="text-align:center"> Type </th>
</tr>
</tbody>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr class='odd'>
<td class="lft1"><b>C I</b> </td>
<td class="fix"> 193.090540 </td>
<td class="lft1">1.02e+01 </td>
<td class="lft1"> A</td>
<td class="fix">1.2637284 </td>
<td class="dsh">- </td>
<td class="fix">7.68476771 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i><sup>2</sup> </td>
<td class="lft1"> <sup>1</sup>D </td>
<td class="lft1"> 2 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i>3<i>s</i> </td>
<td class="lft1"> <sup>1</sup>P° </td>
<td class="lft1"> 1 </td>
<td class="rgt"> 5</td>
<td class="dsh">-</td>
<td class="lft1">3 </td>
<td class="cnt"><sup></sup><sub></sub></td>
</tr>

This code can give you a jump start on this project, however, if you're looking for someone to build the whole project, request data, scrape, store, manipulate I would suggest hiring someone or learning how to do it. HERE is the BeautifulSoup Documentation.
Go through (the quickstart guide) it once and you'll pretty much be know all there is on bs4.
import requests
from bs4 import BeautifulSoup
from time import sleep
url = 'https://physics.nist.gov/'
second_part = 'cgi-bin/ASD/lines1.pl?spectra=C%20I%2C%20Ti%20I&limits_type=0&low_w=190&upp_w=250&unit=1&de=0&format=0&line_out=0&no_spaces=on&remove_js=on&en_unit=1&output=0&bibrefs=0&page_size=15&show_obs_wl=1&unc_out=0&order_out=0&max_low_enrg=&show_av=2&max_upp_enrg=&tsb_value=0&min_str=&A_out=1&A8=1&max_str=&allowed_out=1&forbid_out=1&min_accur=&min_intens=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page = requests.get(url+second_part)
soup = BeautifulSoup(page.content, "lxml")
whole_table = soup.find('table', rules='groups')
sub_tbody = whole_table.find_all('tbody')
# the two above lines are used to locate the table and the content
# we then continue to iterate through sub-categories i.e. tbody-s > tr-s > td-s
for tag in sub_tbody:
if tag.find('tr').find('td'):
table_rows = tag.find_all('tr')
for tag2 in table_rows:
if tag2.has_attr('class'):
td_tags = tag2.find_all('td')
print(td_tags[0].text, '<- Is the ion')
print(td_tags[1].text, '<- Wavelength')
print(td_tags[2].text, '<- Some formula gk Aki')
# and so on...
print('--'*40) # unecessary but does print ----------...
else:
pass

You need to search for the tags and then the class. So using the lxml parser;
soup = BeautifulSoup(yourdata, 'lxml')
for i in soup.find_all('tr',attrs={'class':"odd"}):
print(i.text)
From this point you can write this data directly to a file or generate an array (list of lists - your rows) then put into pandas etc etc.

Selenium python - looping through the tr and check i the td value is not 15 or 16

I am trying to go through every tr and check the 4th td in every tr to check that the value of that td is not 15 or 16
This is how the HTML page looks like
and this is how the html code looks like
Not sure how to approach this.

Please include the actual HTML code, and not an image of the code.
What you are trying to do can be accomplished by finding the rows, then finding the columns, then checking the 4th column in the array. I can't see the <table> element, so you'll have to figure out how to define that yourself.
table = driver.find_element_by_whatever-method-you-use(...)
rows = table.find_elements_by_tag_name('tr')
for row in rows:
cols = row.find_elements_by_tag_name('td')
if cols[3].text == '15' or cols[3].text == '16':
# do whatever

I've tried to approximate the content of that page with this. Notice that the rows are the same except that I've made the fourth columns distinct.
<html>
<body>
<table>
<tbody>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-1 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-2 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-3 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-4 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-5 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-6 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-7 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
<tr>
<td class="Data1">1.</td>
<td class="Data1">237229877</td>
<td class="Data1">1.</td>
<td class="Data1">10-8 </td>
<td class="Data1">Church </td>
<td class="Data1">Corporation </td>
<td class="Data1">BELFAST </td>
<td class="Data1">IRELAND </td>
<td class="Data1">. </td>
<td class="Data1">00000-0000 </td>
<td class="Data1"></td>
<td class="Data1">03 </td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1"></td>
<td class="Data1">98 </td>
</tr>
</tbody>
</table>
</body>
</html>
You can recover the fourth columns using code like this.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('file://c:/scratch/temp.htm')
>>> row = 1
>>> while True:
... try:
... td = driver.find_element_by_xpath('//tr[%s]/td[4]'%row)
... td.text
... row+=1
... except:
... break
...
'10-1'
'10-2'
'10-3'
'10-4'
'10-5'
'10-6'
'10-7'
'10-8'

parsing part of the table for data using BeautifulSoup

I have a self-project to scrape data online using BeautifulSoup and Python, and I think historical stocks data would be a good one for me to practice. I looked at the source code here to analyze how I can use BeautifulSoup's select() or findall() to parse part of the data from the table. Here is the code I use, but it parsed things other than the table.
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.findAll( 'td', {'class':'yfnc_tabledata1'} )
print table
My Question: How to I parse only the 2 rows showing the 2 days of data from the table?
Here is the table that has 2 days of the historical data:
<table class="yfnc_datamodoutline1" width="100%" cellpadding="0" cellspacing="0" border="0">
<tr valign="top">
<td>
<table border="0" cellpadding="2" cellspacing="1" width="100%">
<tr>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Date</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Open</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">High</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Low</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">close</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Volume</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="15%">Adj Close*</th>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>
<tr>
<td class="yfnc_tabledata1" colspan="7" align="center">
* <small>Close price adjusted for dividends and splits.</small>
</td>
</tr>
</table>
</td>
</tr>
</table>
I only need the specific 2 rows of data from above:
<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>

You can select the all the rows from the nested table inside the yfnc_datamodoutline1 table and index the first two:
soup = BeautifulSoup(html)
table_rows = soup.select("table.yfnc_datamodoutline1 table tr + tr")
row1, row2 = table_rows[0:2]
print(row1)
print(row2)
Which would give you:
<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">12 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.44</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
<td align="right" class="yfnc_tabledata1">18,612,300</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
</tr>
<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">11 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">108.52</td>
<td align="right" class="yfnc_tabledata1">108.93</td>
<td align="right" class="yfnc_tabledata1">107.85</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
<td align="right" class="yfnc_tabledata1">27,484,500</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
</tr>
To get the td data just extract the text from each td:
print([td.text for td in row1.find_all("td")])
print([td.text for td in row2.find_all("td")])
Which would give you:
[u'12 Aug 2016', u'107.78', u'108.44', u'107.78', u'108.18', u'18,612,300', u'108.18']
[u'11 Aug 2016', u'108.52', u'108.93', u'107.85', u'107.93', u'27,484,500', u'107.93']
table.yfnc_datamodoutline1 table tr + tr selects all the rows inside the inner table skipping the first which is the header row.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping using bs4 with Python - python

Easiest way to parse the table is to use pandas: import pandas as pd url = 'https://www.acb.com/club/estadisticas/id/14' df = pd.read_html(url)[0].iloc[:,1:] df.to_csv('data.csv', index=False) Will grab the table to dataframe and saves it as data.csv:

Related

Python web scraping problems - result different from soruce code

How to extract the following lines after pattern match

Table extraction: BeautifulSoup vs. Pandas.read_html

Selenium python - looping through the tr and check i the td value is not 15 or 16

parsing part of the table for data using BeautifulSoup

Categories

Resources