I'm learning on how to webscrape using Python since I'm a novice. Right now, I attempted to webscrape Euros 2020 stats from this website https://theanalyst.com/na/2021/06/euro-2020-player-stats. After running my initial code (see below) to gather the html from the webpage, I cannot locate the table tag and its data-table class. I can see the table and its data-table when I inspected the website, but it is not shown when I print out the page_soup.
from urllib.request import urlopen as uReq # Web client
from bs4 import BeautifulSoup as soup # HTML data structure
url_page = 'https://theanalyst.com/na/2021/06/euro-2020-player-stats'
# Open connection & download the html from the url
uClient = uReq(url_page)
# Parses html into a soup data structure
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
print(page_soup)
The table is loaded dynamically in JSON format via sending a GET request to:
https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json
Since we're dealing with JSON data, it's easier to use the requests library to get the data.
Here is an example using the pandas library to print the table into a DataFrame (you don't have to use the pandas library).
import pandas as pd
import requests
url = "https://dataviz.theanalyst.com/euro-2020-hub/player_stats_3_2020.json"
response = requests.get(url).json()
print(pd.json_normalize(response["data"]).to_string())
Output (truncated):
player_id team_id team_name player_first_name player_last_name player age position detailed_position mins_played np_shots np_sot np_goals np_xG op_chances_created op_assists op_xA op_passes op_pass_completion_rate tackles_won interceptions recoveries avg_carry_distance avg_carry_progress carry_w_shot carry_w_goal carry_w_chance_created carry_w_assist take_ons take_ons_success_rate goal_ending total_xG shot_ending team_badge
0 103955 114 England Raheem Sterling Raheem Sterling 26 Forward Second Striker 641 14 8 3 3.82 2 1 1.18 193 0.85 5 4 23 12.98 6.73 3 0 3 1 38 52.63 6 7.08 24 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
1 56979 114 England Jordan Henderson Jordan Henderson 31 Midfielder Central Midfielder 150 1 1 1 0.32 0 0 0.06 111 0.88 0 1 11 7.83 0.49 0 0 0 0 3 66.67 0 0.00 0 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
2 78830 114 England Harry Kane Harry Kane 27 Forward Striker 649 15 7 4 3.57 5 0 0.39 159 0.70 0 3 8 10.52 3.06 2 0 2 0 15 53.33 7 6.38 21 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
3 58621 114 England Kyle Walker Kyle Walker 31 Defender Full Back 599 0 0 0 0.00 2 0 0.18 352 0.87 0 8 37 11.66 5.09 0 0 0 0 1 100.00 3 2.54 10 https://omo.akamai.opta.net/image.php?secure=true&h=omo.akamai.opta.net&sport=football&entity=team&description=badges&dimensions=150&id=114
The variable response is now a dictionary (dict) which you can access the keys/values. To view and prettify the data:
from pprint import pprint
print(type(response))
pprint(response)
Output (truncated):
<class 'dict'>
{'data': [{'age': 26,
'avg_carry_distance': 12.98,
'avg_carry_progress': 6.73,
'carry_w_assist': 1,
'carry_w_chance_created': 3,
'carry_w_goal': 0,
'carry_w_shot': 3,
'detailed_position': 'Second Striker',
Related
I have a trouble of pulling prices on the Bid and Ask columns of this website: [https://banggia.vps.com.vn/chung-khoan/derivative-VN30][1]. Now I can only pull the name of the class, which is "price-table-content". How can I improve these codes so that I can pull prices on the Bid and Ask columns? Any helps to pull these prices are greatly appreciated :)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.headless = True
path = 'C:/Users/quank/PycharmProjects/pythonProject2/chromedriver.exe'
driver = webdriver.Chrome(executable_path=path, options=options)
url = 'https://banggia.vps.com.vn/chung-khoan/derivative-VN30'
driver.get(url=url)
element = driver.find_elements_by_css_selector('#root > div > div.content.undefined >
div.derivative > table.price-table > tbody')
for i in element:
print(i.get_attribute('outerHTML'))
Here is the result of running these codes
C:\Users\quank\PycharmProjects\Botthudulieu\venv\Scripts\python.exe
C:/Users/quank/PycharmProjects/pythonProject2/Botthudulieu.py
<tbody class="price-table-content"></tbody>
When you check the network activity you'll see that the data is retrieved from an api. So query the api directly rather than trying to scrape the site.
import requests
data = requests.get('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203').json()
Or with pandas:
import pandas as pd
df = pd.read_json('https://bgapidatafeed.vps.com.vn/getpsalldatalsnapshot/VN30F2109,VN30F2110,VN30F2112,VN30F2203')
Resulting dataframe:
id
sym
mc
c
f
r
lastPrice
lastVolume
lot
avePrice
highPrice
lowPrice
fBVol
fBValue
fSVolume
fSValue
g1
g2
g3
g4
g5
g6
g7
mkStatus
listing_status
matureDate
closePrice
ptVol
oi
oichange
lv
0
2100
VN30F2109
4
1505.1
1308.3
1406.7
1420
6018
351832
1406.49
1422.9
1390
2011
0
2225.0
0
1420.00
37
i
1419.90
232
i
1419.80
3
i
1420.10
289
i
1420.20
2
i
3
356133
e
A
0
37433
3
1
2100
VN30F2110
4
1504.3
1307.5
1405.9
1418
14
462
1406.94
1422
1390
0
0
1.0
0
1418.00
1
i
1417.80
1
i
1417.50
1
i
1420.00
4
i
1421.00
1
i
1
523
e
A
0
193
2
2
2100
VN30F2112
4
1505.5
1308.7
1407.1
1420
1
54
1424.31
1420
1390.8
0
0
0
1412.30
1
i
1411.60
1
i
1411.20
1
i
1420.80
1
i
1421.20
1
i
1
88
e
A
0
596
1
3
2100
VN30F2203
4
1503.9
1307.3
1405.6
1420
1
50
1402.19
1420
1390
0
0
0
1412.10
1
i
1412.00
1
i
1410.10
1
i
1420.50
1
i
1421.00
2
i
2
85
e
A
0
138
1
I want to extract the data from the table given in 'https://statisticstimes.com/demographics/india/indian-states-population.php' and put it in a list or a dictionary.
I am a beginner in Python. From what I have learned so far all I could do is:
import urllib.request , urllib.error , urllib.parse
from bs4 import BeautifulSoup
url = input("Enter url: ")
html = urllib.request.urlopen(url).read()
x = BeautifulSoup(html , 'html.parser')
tags = x('tr')
lst = list()
for tag in tags:
lst.append(tag.findAll('td'))
print(lst)
You can use requests and pandas.
Here's how:
import pandas as pd
import requests
from tabulate import tabulate
url = "https://statisticstimes.com/demographics/india/indian-states-population.php"
df = pd.read_html(requests.get(url).text, flavor="bs4")[-1]
print(tabulate(df.head(10), showindex=False))
Output:
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
NCT Delhi 18710922 16787941 1922981 11.45 1.36 Malawi 63
18 Haryana 28204692 25351462 2853230 11.25 2.06 Venezuela 51
14 Kerala 35699443 33406061 2293382 6.87 2.6 Morocco 41
20 Himachal Pradesh 7451955 6864602 587353 8.56 0.54 China, Hong Kong SAR 104
16 Punjab 30141373 27743338 2398035 8.64 2.2 Mozambique 48
12 Telangana 39362732 35004000 4358732 12.45 2.87 Iraq 36
25 Goa 1586250 1458545 127705 8.76 0.12 Bahrain 153
19 Uttarakhand 11250858 10086292 1164566 11.55 0.82 Haiti 84
UT3 Chandigarh 1158473 1055450 103023 9.76 0.08 Eswatini 159
9 Gujarat 63872399 60439692 3432707 5.68 4.66 France 23
--- ---------------- -------- -------- ------- ----- ---- -------------------- ---
With:
df.to_csv("your_table.csv", index=False)
you can dump the table to a .csv file:
I have a little code for scraping info from fbref (link for data: https://fbref.com/en/comps/9/stats/Premier-League-Stats) and it worked well but now I have some problems with some features (I've checked that the fields which don't work now are"player","nationality","position","squad","age","birth_year"). I have also checked that the fields have the same name in the web that it used to be. Any ideas/help to solve the problem?
Many Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import sys, getopt
import csv
def get_tables(url):
res = requests.get(url)
## The next two lines get around the issue with comments breaking the parsing.
comm = re.compile("<!--|-->")
soup = BeautifulSoup(comm.sub("",res.text),'lxml')
all_tables = soup.findAll("tbody")
team_table = all_tables[0]
player_table = all_tables[1]
return player_table, team_table
def get_frame(features, player_table):
pre_df_player = dict()
features_wanted_player = features
rows_player = player_table.find_all('tr')
for row in rows_player:
if(row.find('th',{"scope":"row"}) != None):
for f in features_wanted_player:
cell = row.find("td",{"data-stat": f})
a = cell.text.strip().encode()
text=a.decode("utf-8")
if(text == ''):
text = '0'
if((f!='player')&(f!='nationality')&(f!='position')&(f!='squad')&(f!='age')&(f!='birth_year')):
text = float(text.replace(',',''))
if f in pre_df_player:
pre_df_player[f].append(text)
else:
pre_df_player[f] = [text]
df_player = pd.DataFrame.from_dict(pre_df_player)
return df_player
stats = ["player","nationality","position","squad","age","birth_year","games","games_starts","minutes","goals","assists","pens_made","pens_att","cards_yellow","cards_red","goals_per90","assists_per90","goals_assists_per90","goals_pens_per90","goals_assists_pens_per90","xg","npxg","xa","xg_per90","xa_per90","xg_xa_per90","npxg_per90","npxg_xa_per90"]
def frame_for_category(category,top,end,features):
url = (top + category + end)
player_table, team_table = get_tables(url)
df_player = get_frame(features, player_table)
return df_player
top='https://fbref.com/en/comps/9/'
end='/Premier-League-Stats'
df1 = frame_for_category('stats',top,end,stats)
df1
I suggest loading the table with panda's read_html. There is a direct link to this table under Share & Export --> Embed this Table.
import pandas as pd
df = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard", header=1)
This outputs a list of dataframes, the table can be accessed as df[0]. Output df[0].head():
Rk
Player
Nation
Pos
Squad
Age
Born
MP
Starts
Min
90s
Gls
Ast
G-PK
PK
PKatt
CrdY
CrdR
Gls.1
Ast.1
G+A
G-PK.1
G+A-PK
xG
npxG
xA
npxG+xA
xG.1
xA.1
xG+xA
npxG.1
npxG+xA.1
Matches
0
1
Patrick van Aanholt
nl NED
DF
Crystal Palace
30-190
1990
16
15
1324
14.7
0
1
0
0
0
1
0
0
0.07
0.07
0
0.07
1.2
1.2
0.8
2
0.08
0.05
0.13
0.08
0.13
Matches
1
2
Tammy Abraham
eng ENG
FW
Chelsea
23-156
1997
20
12
1021
11.3
6
1
6
0
0
0
0
0.53
0.09
0.62
0.53
0.62
5.6
5.6
0.9
6.5
0.49
0.08
0.57
0.49
0.57
Matches
2
3
Che Adams
eng ENG
FW
Southampton
24-237
1996
26
22
1985
22.1
5
4
5
0
0
1
0
0.23
0.18
0.41
0.23
0.41
5.5
5.5
4.3
9.9
0.25
0.2
0.45
0.25
0.45
Matches
3
4
Tosin Adarabioyo
eng ENG
DF
Fulham
23-164
1997
23
23
2070
23
0
0
0
0
0
1
0
0
0
0
0
0
1
1
0.1
1.1
0.04
0.01
0.05
0.04
0.05
Matches
4
5
Adrián
es ESP
GK
Liverpool
34-063
1987
3
3
270
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Matches
If you're only after the player stats, change player_table = all_tables[1] to player_table = all_tables[2], because now you are feeding team table into get_frame function.
I tried it and it worked fine after that.
en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul
in the link above, there is an un-tabulated data for Istanbul Neighborhoods.
I want to fetch these Neighborhoods into a data frame by this code
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('a',{'class':"new"})
neighborhoods=[]
for item in tocList:
text = item.get_text()
neighborhoods.append(text)
df = pd.DataFrame(neighborhoods, columns=['Neighborhood'])
print(df)
and I got this output:
Neighborhood
0 Maden
1 Nizam
2 Anadolu
3 Arnavutköy İmrahor
4 Arnavutköy İslambey
... ...
705 Seyitnizam
706 Sümer
707 Telsiz
708 Veliefendi
709 Yeşiltepe
710 rows × 1 columns
but some data are not fetched, check the data below and compare to the output:
Adalar
Burgazada
Heybeliada
Kınalıada
Maden
Nizam
findall() is not fetching the Neighborhoods which referred as links, not class, i.e.
<ol><li>Burgazada</li>
<li>Heybeliada</li>
and can I develop the code into 2 columns, each 'Neighborhood' and its 'District'
Are you trying to fetch this list from Table of Contents ?
Please check if this solves your problem:
import pandas as pd
import requests
from bs4 import BeautifulSoup
wikiurl="https://en.wikipedia.org/wiki/List_of_neighbourhoods_of_Istanbul"
response=requests.get(wikiurl)
soup = BeautifulSoup(response.text, 'html.parser')
tocList=soup.findAll('span',{'class':"toctext"})
districts=[]
blocked_words = ['Neighbourhoods by districts','Further reading', 'External links']
for item in tocList:
text = item.get_text()
if text not in blocked_words:
districts.append(text)
df = pd.DataFrame(districts, columns=['districts'])
print(df)
Output:
districts
0 Adalar
1 Arnavutköy
2 Ataşehir
3 Avcılar
4 Bağcılar
5 Bahçelievler
6 Bakırköy
7 Başakşehir
8 Bayrampaşa
9 Beşiktaş
10 Beykoz
11 Beylikdüzü
12 Beyoğlu
13 Büyükçekmece
14 Çatalca
15 Çekmeköy
16 Esenler
17 Esenyurt
18 Eyüp
19 Fatih
20 Gaziosmanpaşa
21 Güngören
22 Kadıköy
23 Kağıthane
24 Kartal
25 Küçükçekmece
26 Maltepe
27 Pendik
28 Sancaktepe
29 Sarıyer
30 Silivri
31 Sultanbeyli
32 Sultangazi
33 Şile
34 Şişli
35 Tuzla
36 Ümraniye
37 Üsküdar
38 Zeytinburnu
How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?
this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html