Python3 requests module or urllib.request module both retrieving incomplete json

Python3 requests module or urllib.request module both retrieving incomplete json - python

I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object:
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print(url)
response = urlopen(url)
try:
reader = codecs.getreader("utf-8")
print(reader(response))
jsonresponse = json.load(reader(response))
print(jsonresponse)
and similarly using the response library instead of urllib also fails to retrieve the full JSON
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print("using this url %s"%url)
r = requests.get(url)
try:
print(r.json())
jsonresponse = r.json()# json.loads(response.read())
In both cases I get about 1/4 of the JSON. For example, in this case:
https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897
I received:
{'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\
e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\
temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\
ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\
0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}}
I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data.
I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original.
I would appreciate any advice about how to troubleshoot this. Thanks.

Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers.
Once you set a User-Agent header with the UA string of a common browser, you get the full response:
>>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0'
>>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897'
>>> r = requests.get(url, headers={'User-Agent': UA})
>>>
>>> print len(r.content)
96412
See the requests docs for more details on setting custom headers.

Related

403 error in web-scraping a specific website with Python

I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.
Is there any way?
from urllib.request import urlopen, Request
url = "https://members.parliament.uk/members/commons"
headers={'User-Agent': 'Mozilla/5.0'}
request= Request(url=url, headers=headers)
response = urlopen(request)
data = response.read()
The longer header is this:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}

The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4 and now it's working like a charm.
Example:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
page = browser.new_page()
page.goto('https://members.parliament.uk/members/commons')
page.wait_for_timeout(5000)
loc = page.locator('div[class="card-list card-list-2-col"]')
html = loc.inner_html()
#print(html)
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())
for card in soup.select('.card.card-member'):
d = {
'Name':card.select_one('.primary-info').get_text(strip=True)
}
data.append(d)
print(data)
Output:
[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]

Html request for Biwenger in python

I'm trying to scrape the data from Biwenger with an html request but the response is returning different data than if url is opened in chrome.
Here is my code
import requests
shots_url = "https://biwenger.as.com/user/naranjas-4537694"
response = requests.get(shots_url)
response.raise_for_status() # raise exception if invalid response
print(response.text)
I don't get any error () however the data request show different data than the data in the url and this message:
<!doctype html><meta charset=utf-8><title>Biwenger</title><base href=/ ><meta...<div class=body><p>Looks like the browser you're using is not compatible with Biwenger :(<p>We recommend using <a href=http://www.google.com/chrome/ target=_blank>Google Chrome</a>...</script>
Any idea what code I can use to get the right data?
If you require any more information please let me know. Thank you everyone.

The data is loaded dynamically via JavaScript/JSON. When you open Firefox/Chrome developer tools - Network tab, you will see where the page is making requests).
This example will get the information about user players:
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
user_data_url = 'https://biwenger.as.com/api/v2/user/4537694?fields=*,account(id),players(id,owner),lineups(round,points,count,position),league(id,name,competition,mode,scoreID),market,seasons,offers,lastPositions'
all_data_url = 'https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1&callback=jsonp_xxx' # <--- check #αԋɱҽԃ αмєяιcαη answer, it's possible to do it without callback= parameter
response = requests.get(all_data_url)
data = json.loads( re.findall(r'jsonp_xxx\((.*)\)', response.text)[0] )
user_data = requests.get(user_data_url).json()
# pprint(user_data) # <-- uncomment this to see user data
# pprint(data) # <-- uncomment this to see data about all players
for p in user_data['data']['players']:
pprint(data['data']['players'][str(p['id'])])
print('-' * 80)
Prints:
{'fantasyPrice': 22000000,
'fitness': [10, 2, 2, 2, -2],
'id': 599,
'name': 'Pedro León',
'playedAway': 8,
'playedHome': 8,
'points': 38,
'pointsAway': 16,
'pointsHome': 22,
'pointsLastSeason': 16,
'position': 3,
'price': 1400000,
'priceIncrement': 60000,
'slug': 'pedro-leon',
'status': 'ok',
'teamID': 76}
--------------------------------------------------------------------------------
{'fantasyPrice': 9000000,
'fitness': [None, 'injured', 'doubt', None, 2],
'id': 1093,
'name': 'Javi López',
'playedAway': 4,
'playedHome': 2,
'points': 10,
'pointsAway': 6,
'pointsHome': 4,
'pointsLastSeason': 77,
'position': 2,
'price': 210000,
'priceIncrement': 0,
'slug': 'javier-lopez',
'status': 'ok',
'teamID': 7}
--------------------------------------------------------------------------------
... and so on.

import requests
import csv
r = requests.get(
"https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1").json()
data = []
for k, v in r['data']['players'].items():
data.append(v.values())
with open('output.csv', 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(v.keys())
writer.writerows(data)
Output: Click Here

How to scrape web page content that's returned by ajax?

Here is the page I wanna scrape: https://www.racing.com/form/2018-11-06/flemington/race/7/results
The race results info are not in the source code.
I tried in the Chrome DevTools, but didn't find the response data that contains the results.
Here is some code in the source code:
ng-controller="formTabResultsController"
ng-init="meet=5149117;race=7;init();" ajax-loader="result"
I think the results is returned and saved in a "result" structure because there are many like this: "result.PrizeMoney" "result.Record".
So how can I get the data of the result with Python? Thanks.

This sites uses a GraphQL API on https://graphql.rmdprod.racing.com. An API key needs to be sent through headers & is retrieved here.
An example with curl, sed & jq :
api_key=$(curl -s "https://www.racing.com/layouts/app.aspx" | \
sed -nE 's/.*headerAPIKey:\s*"(.*)"/\1/p')
curl -s "https://www.racing.com/layouts/app.aspx"
query='query GetMeeting($meetCode: ID!) {
getMeeting(id: $meetCode) {
id
trackName
date
railPosition
races {
id
raceNumber
status
tempo
formRaceEntries {
id
raceEntryNumber
horseName
silkUrl
jockeyName
trainerName
scratched
speedValue
barrierNumber
horse {
name
fullName
colour
}
}
}
}
}'
variables='{ "meetCode": 5149117 }'
curl -G 'https://graphql.rmdprod.racing.com' \
--data-urlencode "query=$query" \
--data-urlencode "variables=$variables" \
-H "X-Api-Key: $api_key" | jq '.'
Using python with python-requests :
import requests
import re
import json
r = requests.get("https://www.racing.com/layouts/app.aspx")
api_key = re.search(".*headerAPIKey:\s*\"(.*)\"", r.text).group(1)
query= """query GetMeeting($meetCode: ID!) {
getMeeting(id: $meetCode) {
id
trackName
date
railPosition
races {
id
raceNumber
status
tempo
formRaceEntries {
id
raceEntryNumber
horseName
silkUrl
jockeyName
trainerName
scratched
speedValue
barrierNumber
horse {
name
fullName
colour
}
}
}
}
}"""
payload = {
"variables": json.dumps({
"meetCode": 5149117
}),
"query": query
}
r = requests.get(
'https://graphql.rmdprod.racing.com',
params = payload,
headers = {
"X-Api-Key": api_key
})
print(r.json())

Chrome Dev tool shows a call to their API
import re
import requests
import json
resp = requests.get('https://api.racing.com/v1/en-au/race/results/5149117/7/?callback=angular.callbacks._b')
# Returned JSONP so we remove the function call: keep only what is between ()
m = re.search(r'\((.*)\)', resp.text, flags=re.S)
data = json.loads(m.group(1))
print(data.keys())
# dict_keys(['race', 'resultCollection', 'exoticCollection'])
print(data['resultCollection'][0])
# {'position': {'at400m': 12, 'at800m': 20, 'finish': 1, 'positionAbbreviation': '1st', 'positionDescription': '', 'positionType': 'Finished'}, 'scratched': False, 'winningTime': 20117, 'margin': None, 'raceEntryNumber': 23, 'number': 23, 'barrierNumber': 19, 'isDeadHeat': False, 'weight': '51kg', 'rating': {'handicapRating': 109, 'ratingProgression': 0}, 'prizeMoney': 4000000.0, 'horse': {'fullName': 'Cross Counter (GB)', 'code': 5256710, 'urlSegment': 'cross-counter-gb', 'silkUrl': '//s3-ap-southeast-2.amazonaws.com/racevic.silks/bb/12621.png', 'age': 5, 'sex': 'Gelding', 'colour': 'Bay', 'sire': 'Teofilo (IRE)', 'dam': 'Waitress (USA)', 'totalPrizeMoney': '$4,576,227', 'averagePrize': '$508,470'}, 'trainer': {'fullName': None, 'shortName': 'C.Appleby', 'code': 20658431, 'urlSegment': 'charlie-appleby-gb'}, 'jockey': {'fullName': 'K.McEvoy', 'shortName': 'K.McEvoy', 'code': 25602, 'urlSegment': 'kerrin-mcevoy', 'allowedClaim': 0.0, 'apprentice': False}, 'gear': {'hasChanges': True, 'gearCollection': [{'changeDate': '2018-11-02T00:00:00', 'currentChange': True, 'description': 'Bandages (Front): On', 'name': 'Bandages (Front)', 'status': 'On', 'comments': None}, {'changeDate': '2018-08-01T00:00:00', 'currentChange': False, 'description': 'Ear Muffs (Pre-Race Only)', 'name': 'Ear Muffs (Pre-Race Only)', 'status': 'On', 'comments': None}, {'changeDate': '2018-08-01T00:00:00', 'currentChange': False, 'description': 'Lugging Bit', 'name': 'Lugging Bit', 'status': 'On', 'comments': 'Rubber ring bit'}, {'changeDate': '2018-08-01T00:00:00', 'currentChange': False, 'description': 'Cross-over Nose Band', 'name': 'Cross-over Nose Band', 'status': 'On', 'comments': None}], 'currentGearCollection': None}, 'odds': {'priceStart': '$9.00', 'parimutuel': {'returnWin': '12', 'returnPlace': '4.40', 'isFavouriteWin': False}, 'fluctuations': {'priceOpening': '$10.00', 'priceFluc': '$10.00'}}, 'comment': 'Bit Slow Out Settled Down near tail lucky to avoid injured horse was checked though 12l bolting Turn Straightened Up Off Mid-Field 7-8l gets Clear 400 and charged home to score. big win # very good from back', 'extendedApiUrl': '/v1/en-au/form/horsestat/5149117/7/5256710', 'extendedApiUrlMobile': '/v1/en-au/form/horsestatmobile/5149117/7/5256710', 'last5': ['-', '4', '3', '-', '4']}

Another way to do it is to use these parameters (discoverable through the Developer tab in your browser), without using regex:
import requests
import json
url = 'https://graphql.rmdprod.racing.com/?query=query%20GetMeeting($meetCode:%20ID!)%20%7BgetMeeting(id:%20$meetCode)%7Bid,trackName,date,railPosition,races%7Bid,raceNumber,status,tempo,formRaceEntries%7Bid,raceEntryNumber,horseName,silkUrl,jockeyName,trainerName,scratched,speedValue,barrierNumber%7D%7D%7D%7D&variables=%7B%20%22meetCode%22:%205149117%20%7D'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Content-Type": "application/json",
"X-Api-Key": "da2-akkuiub3brhahc7nab2msruddq"
}
resp = requests.get(url,headers=headers)
data= json.loads(resp.text) # or data = json.decoder(resp.text)
data

Beautiful Soup Not Getting NBA.com Data

I want to extract the data from the table on this webpage: http://stats.nba.com/league/team/#!/advanced/ . Unfortunately, the following code does not give me anything because the soup (see below) contains no "td"s, even though there are many "td"s to be found when inspecting the webpage.
On the other hand, running the same code for the website "http://espn.go.com/nba/statistics/team/_/stat/offense-per-game" does give me what I want.
Why does the code work for one site and not the other, and is there anything I can do to get the data I want from the first site?
import requests
from bs4 import BeautifulSoup
url="http://stats.nba.com/league/team/#!/advanced/"
r=requests.get(url)
soupNBAadv=BeautifulSoup(r.content)
tds=soupNBAadv.find_all("td")
for i in tds:
print i.text

You don't need BeautifulSoup here at all. The table you see in the browser is formed with the help of an additional get request to an endpoint which returns a JSON response, simulate it:
import requests
url = "http://stats.nba.com/league/team/#!/advanced/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
session.get(url, headers=headers)
params = {
'DateFrom': '',
'DateTo': '',
'GameScope': '',
'GameSegment': '',
'LastNGames': '0',
'LeagueID': '00',
'Location': '',
'MeasureType': 'Advanced',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PaceAdjust': 'N',
'PerMode': 'Totals',
'Period': '0',
'PlayerExperience': '',
'PlayerPosition': '',
'PlusMinus': 'N',
'Rank': 'N',
'Season': '2014-15',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'StarterBench': '',
'VsConference': '',
'VsDivision': ''
}
response = session.get('http://stats.nba.com/stats/leaguedashteamstats', params=params)
results = response.json()
headers = results['resultSets'][0]['headers']
rows = results['resultSets'][0]['rowSet']
for row in rows:
print(dict(zip(headers, row)))
Prints:
{u'MIN': 2074.0, u'TEAM_ID': 1610612737, u'TEAM_NAME': u'Atlanta Hawks', u'AST_PCT': 0.687, u'CFPARAMS': u'Atlanta Hawks', u'EFG_PCT': 0.531, u'DEF_RATING': 99.4, u'NET_RATING': 7.5, u'PIE': 0.556, u'AST_TO': 1.81, u'TS_PCT': 0.57, u'GP': 43, u'L': 8, u'OREB_PCT': 0.21, u'REB_PCT': 0.488, u'W': 35, u'W_PCT': 0.814, u'DREB_PCT': 0.743, u'CFID': 10, u'PACE': 96.17, u'TM_TOV_PCT': 0.149, u'AST_RATIO': 19.9, u'OFF_RATING': 106.9}
{u'MIN': 1897.0, u'TEAM_ID': 1610612738, u'TEAM_NAME': u'Boston Celtics', u'AST_PCT': 0.635, u'CFPARAMS': u'Boston Celtics', u'EFG_PCT': 0.494, u'DEF_RATING': 104.0, u'NET_RATING': -2.7, u'PIE': 0.489, u'AST_TO': 1.73, u'TS_PCT': 0.527, u'GP': 39, u'L': 26, u'OREB_PCT': 0.245, u'REB_PCT': 0.496, u'W': 13, u'W_PCT': 0.333, u'DREB_PCT': 0.747, u'CFID': 10, u'PACE': 99.12, u'TM_TOV_PCT': 0.145, u'AST_RATIO': 18.5, u'OFF_RATING': 101.3}
...
Selenium-based solution:
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://stats.nba.com/league/team/#!/advanced/')
wait = WebDriverWait(driver, 5)
# wait for the table to load
table = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'table-responsive')))
stats = []
headers = [th.text for th in table.find_elements_by_tag_name('th')]
for tr in table.find_elements_by_xpath('//tr[#data-ng-repeat]'):
cells = [td.text for td in tr.find_elements_by_tag_name('td')]
stats.append(dict(zip(headers, cells)))
pprint(stats)
driver.quit()
Prints:
[{u'AST Ratio': u'19.8',
u'AST%': u'68.1',
u'AST/TO': u'1.84',
u'DREB%': u'74.3',
u'DefRtg': u'100.2',
u'GP': u'51',
u'MIN': u'2458',
u'NetRtg': u'7.4',
u'OREB%': u'21.0',
u'OffRtg': u'107.7',
u'PACE': u'96.12',
u'PIE': u'55.3',
u'REB%': u'48.8',
u'TO Ratio': u'14.6',
u'TS%': u'57.2',
u'Team': u'Atlanta Hawks',
u'eFG%': u'53.4'},
...
{u'AST Ratio': u'18.6',
u'AST%': u'62.8',
u'AST/TO': u'1.65',
u'DREB%': u'77.8',
u'DefRtg': u'100.2',
u'GP': u'52',
u'MIN': u'2526',
u'NetRtg': u'3.5',
u'OREB%': u'24.9',
u'OffRtg': u'103.7',
u'PACE': u'95.75',
u'PIE': u'53.4',
u'REB%': u'51.8',
u'TO Ratio': u'15.4',
u'TS%': u'54.4',
u'Team': u'Washington Wizards',
u'eFG%': u'50.9'}]

The reason behind not getting the data from the first url using requests.get() is, the data is fetched from the server using an ajax call. And the ajax call url is http://stats.nba.com/stats/leaguedashteamstats. You have to pass some parameters with it.
When making a requests.get() call you will only get those data that show in the page source of your browser. In your browser press ctrl+u to see the page source and you can see that, there is no data in the source.
In chrome browser use the developer tools and see in the Network tab what requests the page is making. In firefox you can use firebug and see in Net tab.
In case of second url the page source is populated with data(View page source to examine). So you can get it by making the a get request in that specific url.
alecxe's answer demonstrate how to get data from the first url.

Parse this JSON response From App Annie in Python

I am working with the request module within python to grab certain fields within the JSON response.
import json
fn = 'download.json'
data = json
response = requests.get('http://api.appannie.com/v1/accounts/1000/apps/mysuperapp/sales?break_down=application+iap&start_date=2013-10-01&end_date=2013-10-02', \
auth=('username', 'password'))
data = response.json()
print(data)
This works in python, as the response is the following:
{'prev_page': None, 'currency': 'USD', 'next_page': None, 'sales_list': [{'revenue': {'ad': '0.00', 'iap': {'refunds': '0.00', 'sales': '0.00', 'promotions': '0.00'}, 'app': {'refunds': '0.00', 'updates': '0.00', 'downloads': '0.00', 'promotions': '0.00'}},
'units': {'iap': {'refunds': 0, 'sales': 0, 'promotions': 0}, 'app': {'refunds': 0, 'updates': 0, 'downloads': 2000, 'promotions': 0}}, 'country': 'all', 'date': 'all'}], 'iap_sales': [], 'page_num': 1, 'code': 200, 'page_index': 0}
The question is how do I parse this to get my downloads number within the 'app' block - namely the "2000" value?

After the response.json() data is already a dictionary otherwise response.json() would raise an exception. Therefore you can access it just like any other dictionary.

You can use the loads() method of json -
import json
response = requests.get('http://api.appannie.com/v1/accounts/1000/apps/mysuperapp/sales?break_down=application+iap&start_date=2013-10-01&end_date=2013-10-02',
auth=('username', 'password'))
data = json.loads(response.json()) # data is a dictionary now
sales_list = data.get('sales_list')
for sales in sales_list:
print sales['revenue']['app']

You can use json.loads:
import json
import requests
response = requests.get(...)
json_data = json.loads(response.text)
This converts a given string into a dictionary which allows you to access your JSON data easily within your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3 requests module or urllib.request module both retrieving incomplete json - python

Related

403 error in web-scraping a specific website with Python

Html request for Biwenger in python

How to scrape web page content that's returned by ajax?

Beautiful Soup Not Getting NBA.com Data

Parse this JSON response From App Annie in Python

Categories

Resources