I want to extract the data from the table on this webpage: http://stats.nba.com/league/team/#!/advanced/ . Unfortunately, the following code does not give me anything because the soup (see below) contains no "td"s, even though there are many "td"s to be found when inspecting the webpage.
On the other hand, running the same code for the website "http://espn.go.com/nba/statistics/team/_/stat/offense-per-game" does give me what I want.
Why does the code work for one site and not the other, and is there anything I can do to get the data I want from the first site?
import requests
from bs4 import BeautifulSoup
url="http://stats.nba.com/league/team/#!/advanced/"
r=requests.get(url)
soupNBAadv=BeautifulSoup(r.content)
tds=soupNBAadv.find_all("td")
for i in tds:
print i.text
You don't need BeautifulSoup here at all. The table you see in the browser is formed with the help of an additional get request to an endpoint which returns a JSON response, simulate it:
import requests
url = "http://stats.nba.com/league/team/#!/advanced/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
session.get(url, headers=headers)
params = {
'DateFrom': '',
'DateTo': '',
'GameScope': '',
'GameSegment': '',
'LastNGames': '0',
'LeagueID': '00',
'Location': '',
'MeasureType': 'Advanced',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PaceAdjust': 'N',
'PerMode': 'Totals',
'Period': '0',
'PlayerExperience': '',
'PlayerPosition': '',
'PlusMinus': 'N',
'Rank': 'N',
'Season': '2014-15',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'StarterBench': '',
'VsConference': '',
'VsDivision': ''
}
response = session.get('http://stats.nba.com/stats/leaguedashteamstats', params=params)
results = response.json()
headers = results['resultSets'][0]['headers']
rows = results['resultSets'][0]['rowSet']
for row in rows:
print(dict(zip(headers, row)))
Prints:
{u'MIN': 2074.0, u'TEAM_ID': 1610612737, u'TEAM_NAME': u'Atlanta Hawks', u'AST_PCT': 0.687, u'CFPARAMS': u'Atlanta Hawks', u'EFG_PCT': 0.531, u'DEF_RATING': 99.4, u'NET_RATING': 7.5, u'PIE': 0.556, u'AST_TO': 1.81, u'TS_PCT': 0.57, u'GP': 43, u'L': 8, u'OREB_PCT': 0.21, u'REB_PCT': 0.488, u'W': 35, u'W_PCT': 0.814, u'DREB_PCT': 0.743, u'CFID': 10, u'PACE': 96.17, u'TM_TOV_PCT': 0.149, u'AST_RATIO': 19.9, u'OFF_RATING': 106.9}
{u'MIN': 1897.0, u'TEAM_ID': 1610612738, u'TEAM_NAME': u'Boston Celtics', u'AST_PCT': 0.635, u'CFPARAMS': u'Boston Celtics', u'EFG_PCT': 0.494, u'DEF_RATING': 104.0, u'NET_RATING': -2.7, u'PIE': 0.489, u'AST_TO': 1.73, u'TS_PCT': 0.527, u'GP': 39, u'L': 26, u'OREB_PCT': 0.245, u'REB_PCT': 0.496, u'W': 13, u'W_PCT': 0.333, u'DREB_PCT': 0.747, u'CFID': 10, u'PACE': 99.12, u'TM_TOV_PCT': 0.145, u'AST_RATIO': 18.5, u'OFF_RATING': 101.3}
...
Selenium-based solution:
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://stats.nba.com/league/team/#!/advanced/')
wait = WebDriverWait(driver, 5)
# wait for the table to load
table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table-responsive')))
stats = []
headers = [th.text for th in table.find_elements_by_tag_name('th')]
for tr in table.find_elements_by_xpath('//tr[#data-ng-repeat]'):
cells = [td.text for td in tr.find_elements_by_tag_name('td')]
stats.append(dict(zip(headers, cells)))
pprint(stats)
driver.quit()
Prints:
[{u'AST Ratio': u'19.8',
u'AST%': u'68.1',
u'AST/TO': u'1.84',
u'DREB%': u'74.3',
u'DefRtg': u'100.2',
u'GP': u'51',
u'MIN': u'2458',
u'NetRtg': u'7.4',
u'OREB%': u'21.0',
u'OffRtg': u'107.7',
u'PACE': u'96.12',
u'PIE': u'55.3',
u'REB%': u'48.8',
u'TO Ratio': u'14.6',
u'TS%': u'57.2',
u'Team': u'Atlanta Hawks',
u'eFG%': u'53.4'},
...
{u'AST Ratio': u'18.6',
u'AST%': u'62.8',
u'AST/TO': u'1.65',
u'DREB%': u'77.8',
u'DefRtg': u'100.2',
u'GP': u'52',
u'MIN': u'2526',
u'NetRtg': u'3.5',
u'OREB%': u'24.9',
u'OffRtg': u'103.7',
u'PACE': u'95.75',
u'PIE': u'53.4',
u'REB%': u'51.8',
u'TO Ratio': u'15.4',
u'TS%': u'54.4',
u'Team': u'Washington Wizards',
u'eFG%': u'50.9'}]
The reason behind not getting the data from the first url using requests.get() is, the data is fetched from the server using an ajax call. And the ajax call url is http://stats.nba.com/stats/leaguedashteamstats. You have to pass some parameters with it.
When making a requests.get() call you will only get those data that show in the page source of your browser. In your browser press ctrl+u to see the page source and you can see that, there is no data in the source.
In chrome browser use the developer tools and see in the Network tab what requests the page is making. In firefox you can use firebug and see in Net tab.
In case of second url the page source is populated with data(View page source to examine). So you can get it by making the a get request in that specific url.
alecxe's answer demonstrate how to get data from the first url.
Related
I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.
Is there any way?
from urllib.request import urlopen, Request
url = "https://members.parliament.uk/members/commons"
headers={'User-Agent': 'Mozilla/5.0'}
request= Request(url=url, headers=headers)
response = urlopen(request)
data = response.read()
The longer header is this:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4 and now it's working like a charm.
Example:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
page = browser.new_page()
page.goto('https://members.parliament.uk/members/commons')
page.wait_for_timeout(5000)
loc = page.locator('div[class="card-list card-list-2-col"]')
html = loc.inner_html()
#print(html)
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())
for card in soup.select('.card.card-member'):
d = {
'Name':card.select_one('.primary-info').get_text(strip=True)
}
data.append(d)
print(data)
Output:
[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]
I want to get the link in order to download the pdf in tbody tag, but the tag didn't show up after getting the HTML source code. How to get those links. Here's the website: https://www.nexeoplastics.com/types/plastics-product-finder?s=TPU
Here is my code:
from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.nexeoplastics.com/types/plastics-product-finder?s=TPU")
soup=BeautifulSoup(r.content,"html.parser")
table = soup.find_all(id="maintable")
print(table)
The results only shows thead, but not tbody, can anyone whelp me to get the link?
Content is provided by an ajax call - to check that open your developer tools of your browser and take a closer look into the network section. You could use selenium to render the page like a browser and work on the drivers page_source or go with the ajay call like this:
import requests
r=requests.get("https://www.nexeoplastics.com/product-ajax-search/?s=TPU&_=1657516141911").json()
for item in r['data']:
print(item)
Output
{'id': '<label class="checkbox-custom"><input type="checkbox" id="row-checkbox"><span class="checkmark"></span></label>', 'supplier_td': 'Wanhua Chemical Group Co., Ltd.', 'product': 'Wanthane®', 'grade': 'WHT-1190', 'link': '<img src="/assets/tpl/img/dowloaddocument.png" />', 'generic': 'Thermoplastic Polyurethane Elastomer (Polyester)', 'density': '1.19,ASTM D792', 'flowrate': '', 'impact': '', 'tensilestrength': '', 'flexuralstrength': '', 'modulus': '', 'rockwell': '', 'dtul264': '', 'dtul66': '', 'region': 'Europe'}
{'id': '<label class="checkbox-custom"><input type="checkbox" id="row-checkbox"><span class="checkmark"></span></label>', 'supplier_td': 'Wanhua Chemical Group Co., Ltd.', 'product': 'Wanthane®', 'grade': 'WHT-1195', 'link': '<img src="/assets/tpl/img/dowloaddocument.png" />', 'generic': 'Thermoplastic Polyurethane Elastomer (Polyester)', 'density': '1.20,ASTM D792', 'flowrate': '', 'impact': '', 'tensilestrength': '', 'flexuralstrength': '', 'modulus': '', 'rockwell': '', 'dtul264': '', 'dtul66': '', 'region': 'Europe'}
...
To get the download link:
Note: walrus operator requires python 3.8 or higher
from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.nexeoplastics.com/product-ajax-search/?s=TPU&_=1657516141911").json()
for item in r['data']:
if e:=BeautifulSoup(item['link']):
print('https://www.nexeoplastics.com/'+e.a.get('href'))
Python version < 3.8:
...
for item in r['data']:
if BeautifulSoup(item['link']):
print('https://www.nexeoplastics.com/'+BeautifulSoup(item['link']).a.get('href'))
I am fairly new to Python. I am trying to scrape NBA Drives data via https://stats.nba.com/players/drives/
I used Chrome Devtools to find the API URL. I then used the requests package to get the JSON string.
Original code:
import requests
headers = {"User-Agent": "Mozilla/5.0..."}
url = " https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
r = requests.get(url, headers = headers)
d = r.json()
This no longer works, however. For some reason the request for the URL link below times out on the NBA server. So I need to find a new way to get this information.
< https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=>
I was exploring Chrome devtools and I found out that the desired JSON string was stored in the Network XHR Response tab. Is there any way to scrape that into python. See the image below.
Chrome Devtools: XHR Response JSON string
I tested url with other headers (which I saw in DevTool for this request) and it seems it needs header Referer to work correctly
EDIT 2020.08.15:
I had to add new headers to read it
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
import requests
headers = {
'User-Agent': 'Mozilla/5.0',
#'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
'Referer': 'https://stats.nba.com/players/drives/',
#'Accept': 'application/json, text/plain, */*',
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
}
url = 'https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='
r = requests.get(url, headers=headers)
data = r.json()
print(data)
BTW: the same but with params as dictionary so it is easier to set different value
import requests
headers = {
'User-Agent': 'Mozilla/5.0',
#'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
'Referer': 'https://stats.nba.com/players/drives/',
#'Accept': 'application/json, text/plain, */*',
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
}
url = 'https://stats.nba.com/stats/leaguedashptstats'
params = {
'College': '',
'Conference': '',
'Country': '',
'DateFrom': '',
'DateTo': '',
'Division': '',
'DraftPick': '',
'DraftYear': '',
'GameScope': '',
'Height': '',
'LastNGames': '0',
'LeagueID': '00',
'Location': '',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PORound': '0',
'PerMode': 'PerGame',
'PlayerExperience': '',
'PlayerOrTeam': 'Player',
'PlayerPosition': '',
'PtMeasureType': 'Drives',
'Season': '2019-20',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'StarterBench': '',
'TeamID': '0',
'VsConference': '',
'VsDivision': '',
'Weight': '',
}
r = requests.get(url, headers=headers, params=params)
#print(r.request.url)
data = r.json()
print(data)
Im trying to scrape data from yahoo finance with beautiful soup. One field is a span tag with an attribute of "data-reactid"="42" representing the previous close value of the stock. If I run the following commands it returns None. Why is that?
code below:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://finance.yahoo.com/q/op?s=spy+Options'
page = urlopen(url)
soup = BeautifulSoup(page)
soup.find("span", attrs={"data-reactid":"42"})
Try:
soup.find_all("span", attrs={"data-reactid":"42"})
See attrs for more examples.
EDIT:
Since the page is rendered using ReactJS, the data you are trying to access is not available when you make the request, this is why you are always getting None
I suggest you to use something like yfinance.
See this for more information.
When you open your browser and point to http://finance.yahoo.com/q/op?s=spy+Options few XHR calls are issued by the browser. One of those calls return a a data structure with the field 'previousClose'. It may be the fields you are looking for. See the code below.
import requests
import pprint
r = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=%5EGSPC&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance')
if r.status_code == 200:
pprint.pprint(r.json())
output
{'spark': {'error': None,
'result': [{'response': [{'indicators': {'quote': [{'close': [2982.4,
2981.64,
2982.96,
2982.85,
2978.84,
2977.04,
2974.02,
2974.34,
2973.85,
2974.75,
2975.68,
2978.29,
2977.26,
2978.91,
2980.48,
2983.23,
2982.07,
2984.48,
2984.14,
2984.07,
2984.66,
2981.88,
2983.19,
2983.86,
2983.79,
2967.47,
2968.61,
2971.59,
2970.77,
2975.5,
2971.69,
2972.09,
2973.98,
2968.89,
2969.1,
2970.09,
2968.25,
2969.2,
2966.84,
2963.03,
2962.99,
2958.07,
2959.89,
2963.97,
2962.86,
2960.46,
2958.78,
2961.0,
2959.69,
2959.73,
2961.58,
2958.23,
2959.21,
2960.67,
2958.38,
2955.76,
2956.29,
2955.62,
2954.33,
2954.44,
2952.78,
2951.81,
2951.3,
2948.71,
2946.59,
2948.26,
2950.32,
2948.2,
2948.35,
2953.54,
2955.45,
2952.13,
2955.97,
2956.97,
2957.38,
2958.28,
2961.44,
2962.13]}]},
'meta': {'chartPreviousClose': 2977.62,
'currency': 'USD',
'currentTradingPeriod': {'post': {'end': 1569628800,
'gmtoffset': -14400,
'start': 1569614400,
'timezone': 'EDT'},
'pre': {'end': 1569591000,
'gmtoffset': -14400,
'start': 1569571200,
'timezone': 'EDT'},
'regular': {'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}},
'dataGranularity': '5m',
'exchangeName': 'SNP',
'exchangeTimezoneName': 'America/New_York',
'firstTradeDate': -1325602800,
'gmtoffset': -14400,
'instrumentType': 'INDEX',
'previousClose': 2977.62,
'priceHint': 2,
'range': '1d',
'regularMarketPrice': 2961.79,
'regularMarketTime': 1569618019,
'scale': 3,
'symbol': '^GSPC',
'timezone': 'EDT',
'tradingPeriods': [[{'end': 1569614400,
'gmtoffset': -14400,
'start': 1569591000,
'timezone': 'EDT'}]],
'validRanges': ['1d',
'5d',
'1mo',
'3mo',
'6mo',
'1y',
'2y',
'5y',
'10y',
'ytd',
'max']},
'timestamp': [1569591000,
1569591300,
1569591600,
1569591900,
1569592200,
1569592500,
1569592800,
1569593100,
1569593400,
1569593700,
1569594000,
1569594300,
1569594600,
1569594900,
1569595200,
1569595500,
1569595800,
1569596100,
1569596400,
1569596700,
1569597000,
1569597300,
1569597600,
1569597900,
1569598200,
1569598500,
1569598800,
1569599100,
1569599400,
1569599700,
1569600000,
1569600300,
1569600600,
1569600900,
1569601200,
1569601500,
1569601800,
1569602100,
1569602400,
1569602700,
1569603000,
1569603300,
1569603600,
1569603900,
1569604200,
1569604500,
1569604800,
1569605100,
1569605400,
1569605700,
1569606000,
1569606300,
1569606600,
1569606900,
1569607200,
1569607500,
1569607800,
1569608100,
1569608400,
1569608700,
1569609000,
1569609300,
1569609600,
1569609900,
1569610200,
1569610500,
1569610800,
1569611100,
1569611400,
1569611700,
1569612000,
1569612300,
1569612600,
1569612900,
1569613200,
1569613500,
1569613800,
1569614100]}],
'symbol': '^GSPC'}]}}
I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object:
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print(url)
response = urlopen(url)
try:
reader = codecs.getreader("utf-8")
print(reader(response))
jsonresponse = json.load(reader(response))
print(jsonresponse)
and similarly using the response library instead of urllib also fails to retrieve the full JSON
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print("using this url %s"%url)
r = requests.get(url)
try:
print(r.json())
jsonresponse = r.json()# json.loads(response.read())
In both cases I get about 1/4 of the JSON. For example, in this case:
https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897
I received:
{'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\
e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\
temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\
ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\
0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}}
I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data.
I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original.
I would appreciate any advice about how to troubleshoot this. Thanks.
Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers.
Once you set a User-Agent header with the UA string of a common browser, you get the full response:
>>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0'
>>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897'
>>> r = requests.get(url, headers={'User-Agent': UA})
>>>
>>> print len(r.content)
96412
See the requests docs for more details on setting custom headers.