scraping basketball results and associate related competition to each match - python

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)

Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools

This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Related

Unable to do web scraping from URL using Python Alchemy

I have a script where I'm trying to web scraping the data into table. But I'm getting errors
raise exc.with_traceback(traceback)
ValueError: No tables found
Script :
import pandas as pd
import logging
from sqlalchemy import create engine
from urlib.parse import quote
db_connection = {mysql}://{username}:{quote'pwd'}#{DB:port}
ds_connection = create_engine(db_connection)
a = pd.read_html("https://www.centralbank.ae/en/forex-eibor/exchange-rates/")
df = pd.Dataframe(a[0])
df_final = df.loc[:,['Currency','Rate']]
df_final.to_sql('rate_table',db_connection,if_exists = append,index=false)
Can anyone suggest on this
One easy way to obtain those exchange rates would be to scrape the API accessed to retrieve information in page (check Dev Tools - network tab):
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://www.centralbank.ae/en/forex-eibor/exchange-rates/'
}
r = requests.post('https://www.centralbank.ae/umbraco/Surface/Exchange/GetExchangeRateAllCurrency', headers=headers)
dfs = pd.read_html(r.text)
print(dfs[0].loc[:,['Currency','Rates']])
This returns:
Currency
Rates
0
US Dollar
3.6725
1
Argentine Peso
0.026993
2
Australian Dollar
2.52753
3
Bangladesh Taka
0.038508
4
Bahrani Dinar
9.74293
5
Brunei Dollar
2.64095
6
Brazilian Real
0.706549
7
Botswana Pula
0.287552
8
Belarus Rouble
1.45526
9
Canadian Dollar
2.82565
10
Swiss Franc
3.83311
11
Chilean Peso
0.003884
12
Chinese Yuan - Offshore
0.536978
13
Chinese Yuan
0.538829
14
Colombian Peso
0.000832
15
Czech Koruna
0.149763
16
Danish Krone
0.496304
17
Algerian Dinar
0.025944
18
Egypt Pound
0.191775
19
Euro
3.69096
20
GB Pound
4.34256
21
Hongkong Dollar
0.468079
22
Hungarian Forint
0.009112
23
Indonesia Rupiah
0.000248
24
Indian Rupee
0.045976
25
Iceland Krona
0.026232
26
Jordan Dinar
5.17472
27
Japanese Yen
0.026818
28
Kenya Shilling
0.030681
29
Korean Won
0.002746
30
Kuwaiti Dinar
11.9423
31
Kazakhstan Tenge
0.007704
32
Lebanon Pound
0.002418
33
Sri Lanka Rupee
0.010201
34
Moroccan Dirham
0.353346
35
Macedonia Denar
0.059901
36
Mexican Peso
0.181874
37
Malaysia Ringgit
0.820395
38
Nigerian Naira
0.008737
39
Norwegian Krone
0.37486
40
NewZealand Dollar
2.27287
41
Omani Rial
9.53921
42
Peru Sol
0.952659
43
Philippine Piso
0.065562
44
Pakistan Rupee
0.017077
45
Polish Zloty
0.777446
46
Qatari Riyal
1.00254
47
Serbian Dinar
0.031445
48
Russia Rouble
0.06178
49
Saudi Riyal
0.977847
50
Sudanese Pound
0.006479
51
Swedish Krona
0.347245
52
Singapore Dollar
2.64038
53
Thai Baht
0.102612
54
Tunisian Dinar
1.1505
55
Turkish Lira
0.20272
56
Trin Tob Dollar
0.541411
57
Taiwan Dollar
0.121961
58
Tanzania Shilling
0.001575
59
Uganda Shilling
0.000959
60
Vietnam Dong
0.000157
61
Yemen Rial
0.01468
62
South Africa Rand
0.216405
63
Zambian Kwacha
0.227752
64
Azerbaijan manat
2.16157
65
Bulgarian lev
1.8873
66
Croatian kuna
0.491344
67
Ethiopian birr
0.069656
68
Iraqi dinar
0.002516
69
Israeli new shekel
1.12309
70
Libyan dinar
0.752115
71
Mauritian rupee
0.079837
72
Romanian leu
0.755612
73
Syrian pound
0.001462
74
Turkmen manat
1.05079
75
Uzbekistani som
0.000336

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)
The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Beautifulsoup Python loops

I have this code that returns None for each row, can someone help me?
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = 'https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find('table',{'class':'p-data-table'}).find('tbody').find_all('tr')
name=[]
reported_date=[]
card=[]
card_date=[]
opponent=[]
resolution=[]
for result in results:
print(name.append(i.find_all('td')[0].get_text()))
You can use pandas directly and get all the columns up to the last two:
import pandas
website = "https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19"
df = pandas.read_html(website)[0].iloc[:, :-2]
print(df.to_string())
Output (truncated):
Fighter Reported Card Card Date Opponent Resolution
0 Rani Yahya 7/31/2021 UFC Vegas 33 7/31/2021 Kyung Ho Kang Fight scratched
1 Amanda Nunes 7/29/2021 UFC 265 8/7/2021 Julianna Pena Fight scratched
2 Amanda Ribas 5/23/2021 UFC Vegas 28 6/5/2021 Angela Hill Fight scratched
3 Jack Hermansson 5/19/2021 UFC 262 5/17/2021 Edmen Shahbazyan Rescheduled for UFC Vegas 27 - May 22

How to exclude certain rows in a table using BeautifulSoup?

The code works fine, however, the URL I'm trying to fetch the table for seems to have headers repeated throughout the table, I'm not sure how to deal with this and remove those rows as I'm trying to get the data into BigQuery and there are certain characters which aren't allowed.
URL = 'https://www.basketball-reference.com/leagues/NBA_2020_games-august.html'
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(URL)
soup = BeautifulSoup(driver.page_source,'html')
driver.quit()
tables = soup.find_all('table',{"id":["schedule"]})
table = tables[0]
tab_data = [[cell.text for cell in row.find_all(["th","td"])]
for row in table.find_all("tr")]
json_string = ''
headers = [col.replace('.', '_').replace('/', '_').replace('%', 'pct').replace('3', '_3').replace('(', '_').replace(')', '_') for col in tab_data[1]]
for row in tab_data[2:]:
json_string += json.dumps(dict(zip(headers, row))) + '\n'
with open('example.json', 'w') as f:
f.write(json_string)
print(json_string)
You can make class of the tr rows to None so that you don't get duplicate headers.
The following code creates a dataframe from the table
from bs4 import BeautifulSoup
import requests
import pandas as pd
res = requests.get("https://www.basketball-reference.com/leagues/NBA_2020_games-august.html")
soup = BeautifulSoup(res.text, "html.parser")
table = soup.find("div", {"id":"div_schedule"}).find("table")
columns = [i.get_text() for i in table.find("thead").find_all('th')]
data = []
for tr in table.find('tbody').find_all('tr', class_=False):
temp = [tr.find('th').get_text(strip=True)]
temp.extend([i.get_text(strip=True) for i in tr.find_all("td")])
data.append(temp)
df = pd.DataFrame(data, columns = columns)
print(df)
Output:
Date Start (ET) Visitor/Neutral PTS Home/Neutral PTS     Attend. Notes
0 Sat, Aug 1, 2020 1:00p Miami Heat 125 Denver Nuggets 105 Box Score
1 Sat, Aug 1, 2020 3:30p Utah Jazz 94 Oklahoma City Thunder 110 Box Score
2 Sat, Aug 1, 2020 6:00p New Orleans Pelicans 103 Los Angeles Clippers 126 Box Score
3 Sat, Aug 1, 2020 7:00p Philadelphia 76ers 121 Indiana Pacers 127 Box Score
4 Sat, Aug 1, 2020 8:30p Los Angeles Lakers 92 Toronto Raptors 107 Box Score
.. ... ... ... ... ... ... ... .. ... ...
75 Thu, Aug 13, 2020 Portland Trail Blazers Brooklyn Nets
76 Fri, Aug 14, 2020 Philadelphia 76ers Houston Rockets
77 Fri, Aug 14, 2020 Miami Heat Indiana Pacers
78 Fri, Aug 14, 2020 Oklahoma City Thunder Los Angeles Clippers
79 Fri, Aug 14, 2020 Denver Nuggets Toronto Raptors
[80 rows x 10 columns]
In order to insert to bigquery, you can directly insert json to bigquery or a dataframe to bigquery using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_gbq.html

Read excel file in spyder but some data missing

I am a new in python and is trying to read my excel file in spyder, anaconda. However, when I run it, some row is missing and replaced with '...'. I have seven columns and 100 rows in my excel file. The column arrangement also quite weird.
This is my code:
import pandas as pd
print(" Comparing within 100 Airline \n\n")
def view():
airlines = pd.ExcelFile('Airline_final.xlsx')
df1 = pd.read_excel("Airline_final.xlsx",sheet_name=2)
print("\n\n 1: list of all Airlines \n")
print(df1)
view()
Here is what I get:
18 #051 Cubana Cuba
19 #003 Aigle Azur France
20 #011 Air Corsica France
21 #012 Air France France
22 #019 Air Mediterranee France
23 #050 Corsair France
24 #072 HOP France
25 #087 Joon France
26 #006 Air Berlin Germany
27 #049 Condor Flugdienst Germany
28 #057 Eurowings Germany
29 #064 Germania Germany
.. ... ... ...
70 #018 Air Mandalay Myanmar
71 #020 Air KBZ Myanmar
72 #067 Golden Myanmar Airlines Myanmar
73 #017 Air Koryo North Korea
74 #080 Jetstar Asia Singapore
75 #036 Binter Canarias Spain
76 #040 Canaryfly Spain
77 #073 Iberia and Iberia Express Spain
To print the whole dataframe use:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df1)

Categories