Scrapy: unable to locate table or scrape data in table - python

For a group project, I am trying to scrape Salaries table within https://www.basketball-reference.com/players/a/allenra02.html.
I have tried multiple CSS and Xpath selectors such as
#all_salaries > tbody > tr:nth-child(1)
#all_salaries > tbody
#all_salaries > tbody > tr:nth-child(1) > td.right
#all_salaries
//*[#id="all_salaries"]/tbody/tr[1]/td[3]
//*[#id="all_salaries"]/tbody
//*[#id="all_salaries"]
Code look as follows:
def start_requests(self):
start_urls = ['https://www.basketball-reference.com/players/a/allenra02.html']
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse_season)
def parse_player(self, response):
response.css('#all_salaries > tbody)
I tried printing it out, but it keeps returning an empty list.
Other tables seem fine, except this one.
EDIT:
My final solution looks something like
regex = re.compile(r'<!--(.*)-->', re.DOTALL)
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').get()
if salaries:
salaries = response.xpath('//*[#id="all_all_salaries"]/comment()').re(regex)[0]
salaries_sel = scrapy.Selector(text=salaries, type="html")
all_salaries = salaries_sel.css('#all_salaries > tbody > tr').extract()

You can use BeautifulSoup to pull out the comments then parse the table with pandas. I chose to only pull out the salary table, but you can get all the tables in the comments this way.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.basketball-reference.com/players/a/allenra02.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each), attrs = {'id': 'all_salaries'})[0])
break
except:
continue
print(tables[0].to_string())
Output:
Season Team Lg Salary
0 1996-97 Milwaukee Bucks NBA $1,785,000
1 1997-98 Milwaukee Bucks NBA $2,052,360
2 1998-99 Milwaukee Bucks NBA $2,320,000
3 1999-00 Milwaukee Bucks NBA $9,000,000
4 2000-01 Milwaukee Bucks NBA $10,130,000
5 2001-02 Milwaukee Bucks NBA $11,250,000
6 2002-03 Milwaukee Bucks NBA $12,375,000
7 2003-04 Seattle SuperSonics NBA $13,500,000
8 2004-05 Seattle SuperSonics NBA $14,625,000
9 2005-06 Seattle SuperSonics NBA $13,223,140
10 2006-07 Seattle SuperSonics NBA $14,611,570
11 2007-08 Boston Celtics NBA $16,000,000
12 2008-09 Boston Celtics NBA $18,388,430
13 2009-10 Boston Celtics NBA $18,776,860
14 2010-11 Boston Celtics NBA $10,000,000
15 2011-12 Boston Celtics NBA $10,000,000
16 2012-13 Miami Heat NBA $3,090,000
17 2013-14 Miami Heat NBA $3,229,050
18 Career (may be incomplete) NaN $184,356,410

It's because that table is actually commented out in the original source code and later added via javascript. Have a look here on how to get the comment contents: Scrapy: Extract commented (hidden) content

Related

Web scraping a table through multiple pages with a single link

I am trying to web scrape a table on a webpage as part of an assignment using Python. I want to scrape all 618 records of the table which are scattered across 13 pages in the same URL. However, my program only scrapes the first page of the table and its records. The URL is in my code, which can be found below:
from bs4 import BeautifulSoup as bs
import requests as r
base_URL = 'https://www.nba.com/players'
def scrape_webpage(URL):
player_names = []
page = r.get(URL)
print(f'{page.status_code}')
soup = bs(page.content, 'html.parser')
raw_player_names = soup.find_all('div', class_='flex flex-col lg:flex-row')
for name in raw_player_names:
player_names.append(name.get_text().strip())
print(player_names)
scrape_webpage(base_URL)
The player data is embedded inside <script> element in the page. You can decode it with this example:
import re
import json
import requests
import pandas as pd
url = "https://www.nba.com/players"
data = re.search(r'({"props":.*})', requests.get(url).text).group(0)
data = json.loads(data)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data["props"]["pageProps"]["players"])
print(df.head().to_markdown())
Prints:
PERSON_ID
PLAYER_LAST_NAME
PLAYER_FIRST_NAME
PLAYER_SLUG
TEAM_ID
TEAM_SLUG
IS_DEFUNCT
TEAM_CITY
TEAM_NAME
TEAM_ABBREVIATION
JERSEY_NUMBER
POSITION
HEIGHT
WEIGHT
COLLEGE
COUNTRY
DRAFT_YEAR
DRAFT_ROUND
DRAFT_NUMBER
ROSTER_STATUS
FROM_YEAR
TO_YEAR
PTS
REB
AST
STATS_TIMEFRAME
PLAYER_LAST_INITIAL
HISTORIC
0
1630173
Achiuwa
Precious
precious-achiuwa
1610612761
raptors
0
Toronto
Raptors
TOR
5
F
6-8
225
Memphis
Nigeria
2020
1
20
1
2020
2021
9.1
6.5
1.1
Season
A
False
1
203500
Adams
Steven
steven-adams
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
4
C
6-11
265
Pittsburgh
New Zealand
2013
1
12
1
2013
2021
6.9
10
3.4
Season
A
False
2
1628389
Adebayo
Bam
bam-adebayo
1610612748
heat
0
Miami
Heat
MIA
13
C-F
6-9
255
Kentucky
USA
2017
1
14
1
2017
2021
19.1
10.1
3.4
Season
A
False
3
1630583
Aldama
Santi
santi-aldama
1610612763
grizzlies
0
Memphis
Grizzlies
MEM
7
F-C
6-11
215
Loyola-Maryland
Spain
2021
1
30
1
2021
2021
4.1
2.7
0.7
Season
A
False
4
200746
Aldridge
LaMarcus
lamarcus-aldridge
1610612751
nets
0
Brooklyn
Nets
BKN
21
C-F
6-11
250
Texas-Austin
USA
2006
1
2
1
2006
2021
12.9
5.5
0.9
Season
A
False

Beautifulsoup Python loops

I have this code that returns None for each row, can someone help me?
from bs4 import BeautifulSoup
import requests
import pandas as pd
website = 'https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19'
response = requests.get(website)
soup = BeautifulSoup(response.content, 'html.parser')
results = soup.find('table',{'class':'p-data-table'}).find('tbody').find_all('tr')
name=[]
reported_date=[]
card=[]
card_date=[]
opponent=[]
resolution=[]
for result in results:
print(name.append(i.find_all('td')[0].get_text()))
You can use pandas directly and get all the columns up to the last two:
import pandas
website = "https://www.bloodyelbow.com/22198483/comprehensive-list-of-ufc-fighters-who-have-tested-positive-for-covid-19"
df = pandas.read_html(website)[0].iloc[:, :-2]
print(df.to_string())
Output (truncated):
Fighter Reported Card Card Date Opponent Resolution
0 Rani Yahya 7/31/2021 UFC Vegas 33 7/31/2021 Kyung Ho Kang Fight scratched
1 Amanda Nunes 7/29/2021 UFC 265 8/7/2021 Julianna Pena Fight scratched
2 Amanda Ribas 5/23/2021 UFC Vegas 28 6/5/2021 Angela Hill Fight scratched
3 Jack Hermansson 5/19/2021 UFC 262 5/17/2021 Edmen Shahbazyan Rescheduled for UFC Vegas 27 - May 22

scraping basketball results and associate related competition to each match

I want to scrape basketball results from this webpage:
http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
I created the code using bs4 and requests:
url = http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
r = session.get(url, timeout=30)
soup = BeautifulSoup(r.content, 'html.parser')
The issue I face is how to add competition to each row I scrape
I want to create a table and each row is the match results (competition, home team, away team, score...)
Selenium
Try this (selenium):
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
import time
res =[]
url = 'http://www.nowgoal.group/nba/Schedule.aspx?f=ft2&date=2020-07-29'
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
driver.get(url)
time.sleep(2)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page, 'html.parser')
span = soup.select_one('span#live')
tables = span.select('table')
for table in tables:
if table.get('class'):
competition = table.select_one('a b font').text
else:
for home, away in zip(table.select('tr.b1')[0::2], table.select('tr.b1')[1::2]):
res.append([f"{competition}",
f"{home.select_one('td a').text}",
f"{away.select_one('td a').text}",
f"{home.select_one('td.red').text}",
f"{away.select_one('td.red').text}",
f"{home.select_one('td.odds1').text}",
f"{away.select_one('td.odds1').text}",
f"{home.select('td font')[0].text}/{home.select('td font')[1].text}",
f"{away.select('td font')[0].text}/{away.select('td font')[1].text}",
f"{home.select('td div a')[-1].get('href')}"])
df = pd.DataFrame(res, columns=['competition',
'home',
'away',
'home score',
'away score',
'home odds',
'away odds',
'home ht',
'away ht',
'odds'
])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score home odds away odds home ht away ht odds
0 National Basketball Association Portland Trail Blazers Oklahoma City Thunder 120 131 2.72 1.45 50/70 63/68 http://data.nowgoal.group/OddsCompBasket/387520.html
1 National Basketball Association Houston Rockets Boston Celtics 137 112 1.49 2.58 77/60 60/52 http://data.nowgoal.group/OddsCompBasket/387521.html
2 National Basketball Association Philadelphia 76ers Dallas Mavericks 115 118 2.04 1.76 39/64 48/55 http://data.nowgoal.group/OddsCompBasket/387522.html
3 Women’s National Basketball Association Connecticut Sun Washington Mystics 89 94 2.28 1.59 52/37 48/46 http://data.nowgoal.group/OddsCompBasket/385886.html
4 Women’s National Basketball Association Chicago Sky Los Angeles Sparks 96 78 2.72 1.43 40/56 36/42 http://data.nowgoal.group/OddsCompBasket/385618.html
5 Women’s National Basketball Association Seattle Storm Minnesota Lynx 90 66 1.21 4.19 41/49 35/31 http://data.nowgoal.group/OddsCompBasket/385884.html
6 Friendly Competition Labas Pasauli LT Balduasenaras 85 78 52/33 31/47 http://data.nowgoal.group/OddsCompBasket/387769.html
7 Friendly Competition BC Vikings Nemuno Banga KK 66 72 29/37 30/42 http://data.nowgoal.group/OddsCompBasket/387771.html
8 Friendly Competition NRG Kiev Hizhaki 51 76 31/20 28/48 http://data.nowgoal.group/OddsCompBasket/387766.html
9 Friendly Competition Finland Estonia 97 76 2.77 1.40 48/49 29/47 http://data.nowgoal.group/OddsCompBasket/387740.html
10 Friendly Competition Synkarb Sk nemenchine 82 79 37/45 38/41 http://data.nowgoal.group/OddsCompBasket/387770.html
and so on....
And saves a Res.csv that looks like this:
Requests
Try this (requests):
import pandas as pd
from bs4 import BeautifulSoup
import requests
res = []
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('h')
for item in items:
values = item.text.split('^')
res.append([f'{values[1]}', f'{values[8]}', f'{values[10]}', f'{values[11]}', f'{values[12]}'])
df = pd.DataFrame(res, columns=['competition', 'home', 'away', 'home score', 'away score'])
print(df.to_string())
df.to_csv('Res.csv')
prints:
competition home away home score away score
0 NBA Portland Trail Blazers Oklahoma City Thunder 120 131
1 NBA Houston Rockets Boston Celtics 137 112
2 NBA Philadelphia 76ers Dallas Mavericks 115 118
3 WNBA Connecticut Sun Washington Mystics 89 94
4 WNBA Chicago Sky Los Angeles Sparks 96 78
5 WNBA Seattle Storm Minnesota Lynx 90 66
6 FC Labas Pasauli LT Balduasenaras 85 78
7 FC BC Vikings Nemuno Banga KK 66 72
8 FC NRG Kiev Hizhaki 51 76
And saves a Res.csv that looks like this:
If you do not want the index column you can simply add index=False to df.to_csv('Res.csv') so it looks like this df.to_csv('Res.csv', index=False)
Note selenium: You need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
The selenium version is slower but has no need to fetch and find the XML file with devtools
This page uses JavaScript to load data but requests/BeautifulSoup can't run JavaScript.
So you have two options.
First: you can use Selenium to control real web browser which can run JavaScript. It can be better when page use complex JavaScript code to generate data - but this slower because it needs to run web browser which has to render page and run JavaScript.
Second: you can try to use DevTools in Firefox/Chrome (tab Network, filter XHR) to find URL used by JavaScript/AJAX(XHR) to get data from server and use this URL with requests. often you can get JSON data which can be converted to Python list/dictionary and then you don't need BeautifulSoupto scrape data. It is faster but sometimes page uses some JavaScript code which hard to replace with Python code.
I choose second method.
I found it reads data from
http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000
but it gives XML data so it still needs BeautifulSoup (or lxml) to scrape data.
import requests
from bs4 import BeautifulSoup as BS
url = 'http://www.nowgoal.group/GetNbaWithTimeZone.aspx?date=2020-07-29&timezone=2&kind=0&t=1596143185000'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
all_items = soup.find_all('h')
for item in all_items:
values = item.text.split('^')
#print(values)
print(values[8], values[11])
print(values[10], values[12])
print('---')
Result:
Portland Trail Blazers 120
Oklahoma City Thunder 131
---
Houston Rockets 137
Boston Celtics 112
---
Philadelphia 76ers 115
Dallas Mavericks 118
---
Connecticut Sun 89
Washington Mystics 94
---
Chicago Sky 96
Los Angeles Sparks 78
---
Seattle Storm 90
Minnesota Lynx 66
---
Labas Pasauli LT 85
Balduasenaras 78
---
BC Vikings 66
Nemuno Banga KK 72
---
NRG Kiev 51
Hizhaki 76
---
Finland 97
Estonia 76
---
Synkarb 82
Sk nemenchine 79
---
CS Sfaxien (w) 51
ES Cap Bon (w) 54
---
Police De La Circulation (w) 43
Etoile Sportive Sahel (w) 39
---
CA Bizertin 63
ES Goulette 71
---
JS Manazeh 77
AS Hammamet 53
---
Southern Huskies 84
Canterbury Rams 98
---
Taranaki Mountainairs 99
Franklin Bulls 90
---
Chaophraya Thunder 67
Thai General Equipment 102
---
Airforce Madgoat Basketball Club 60
HiTech Bangkok City 77
---
Bizoni 82
Leningrad 75
---
chameleon 104
Leningrad 80
---
Bizoni 71
Zubuyu 57
---
Drakony 89
chameleon 79
---
Dragoni 71
Zubuyu 87
---

Missing data not being scraped from Hansard

I'm trying to scrape data from Hansard, the official verbatim record of everything spoken in the UK House of Parliament. This is the precise link I'm trying to scrape: in a nutshell, I want to scrape every "mention" container on this page and the following 50 pages after that.
But I find that when my scraper is "finished," it's only collected data on 990 containers and not the full 1010. Data on 20 containers is missing, as if it's skipping a page. When I only set the page range to (0,1), it fails to collect any values. When I set it to (0,2), it collects only the first page's values. Asking it to collect data on 52 pages does not help. I thought that this was perhaps due to the fact that I wasn't giving the URLs enough time to load, so I added some delays in the scraper's crawl. That didn't solve anything.
Can anyone provide me with any insight into what I may be missing? I'd like to make sure that my scraper is collecting all available data.
pages = np.arange(0, 52)
for page in pages:
hansard_url = "https://hansard.parliament.uk/search/Contributions? searchTerm=%22civilian%20casualties%22&startDate=01%2F01%2F1988%2000%3A00%3A00&endDate=07%2F14%2F2020%2000%3A00%3A00"
full_url = hansard_url + "&page=" + str(page) + "&partial=true"
page = get(full_url)
html_soup = BeautifulSoup(page.text, 'html.parser')
mention_containers = html_soup.find_all('div', class_="result contribution")
time.sleep(randint(2,10))
for mention in mention_containers:
topic = mention.div.span.text
topics.append(topic)
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
houses.append("House of Lords")
elif house == "Commons Portcullis":
houses.append("House of Commons")
else:
houses.append("N/A")
name = mention.find('div', class_="secondaryTitle").text
names.append(name)
date = mention.find('div', class_="").text
dates.append(date)
time.sleep(randint(2,10))
hansard_dataset = pd.DataFrame(
{'Date': dates, 'House': houses, 'Speaker': names, 'Topic': topics})
)
print(hansard_dataset.info())
print(hansard_dataset.isnull().sum())
hansard_dataset.to_csv('hansard.csv', index=False, sep="#")
Any help in helping me solve this problem is appreciated.
The server returns on page 48 empty container, so total results are 1000 from pages 1 to 51 (inclusive):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://hansard.parliament.uk/search/Contributions'
params = {
'searchTerm':'civilian casualties',
'startDate':'01/01/1988 00:00:00',
'endDate':'07/14/2020 00:00:00',
'partial':'True',
'page':1,
}
all_data = []
for page in range(1, 52):
params['page'] = page
print('Page {}...'.format(page))
soup = BeautifulSoup(requests.get(url, params=params).content, 'html.parser')
mention_containers = soup.find_all('div', class_="result contribution")
if not mention_containers:
print('Empty container!')
for mention in mention_containers:
topic = mention.div.span.text
house = mention.find("img")["alt"]
if house == "Lords Portcullis":
house = "House of Lords"
elif house == "Commons Portcullis":
house = "House of Commons"
else:
house = "N/A"
name = mention.find('div', class_="secondaryTitle").text
date = mention.find('div', class_="").get_text(strip=True)
all_data.append({'Date': date, 'House': house, 'Speaker': name, 'Topic': topic})
df = pd.DataFrame(all_data)
print(df)
Prints:
...
Page 41...
Page 42...
Page 43...
Page 44...
Page 45...
Page 46...
Page 47...
Page 48...
Empty container! # <--- here is the server error
Page 49...
Page 50...
Page 51...
Date House Speaker Topic
0 14 July 2014 House of Lords Baroness Warsi Gaza debate in Lords Chamber
1 3 March 2016 House of Lords Lord Touhig Armed Forces Bill debate in Grand Committee
2 2 December 2015 House of Commons Mr David Cameron ISIL in Syria debate in Commons Chamber
3 3 March 2016 House of Lords Armed Forces Bill debate in Grand Committee
4 27 April 2016 House of Lords Armed Forces Bill debate in Lords Chamber
.. ... ... ... ...
995 18 June 2003 House of Lords Lord Craig of Radley Defence Policy debate in Lords Chamber
996 7 September 2004 House of Lords Lord Rea Iraq debate in Lords Chamber
997 14 February 1994 House of Lords The Parliamentary Under-Secretary of State, Mi... Landmines debate in Lords Chamber
998 12 January 2000 House of Commons The Minister of State, Foreign and Commonwealt... Serbia And Kosovo debate in Westminster Hall
999 26 February 2003 House of Lords Lord Rea Iraq debate in Lords Chamber
[1000 rows x 4 columns]

How to pull only certain fields with BeautifulSoup

I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")
Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")
Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco

Categories