403 error in web-scraping a specific website with Python

403 error in web-scraping a specific website with Python - python

I'm trying to open the following UK parliament website from my colab environment, but I haven't been able to make it work without 403 errors. The header restriction is too strict. Following several answers for previous similar questions, I've tried much more extended versions of the header but still does not work.
Is there any way?
from urllib.request import urlopen, Request
url = "https://members.parliament.uk/members/commons"
headers={'User-Agent': 'Mozilla/5.0'}
request= Request(url=url, headers=headers)
response = urlopen(request)
data = response.read()
The longer header is this:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.4 Safari/605.1.15',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}

The website is under cloudflare protection. As Andrew Ryan already has stated about the possible solution.I also used cloudscraper but didn't work and still getting 403 then i use playwright with bs4 and now it's working like a charm.
Example:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
data = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=False,slow_mo=50)
page = browser.new_page()
page.goto('https://members.parliament.uk/members/commons')
page.wait_for_timeout(5000)
loc = page.locator('div[class="card-list card-list-2-col"]')
html = loc.inner_html()
#print(html)
soup = BeautifulSoup(html,"lxml")
#print(soup.prettify())
for card in soup.select('.card.card-member'):
d = {
'Name':card.select_one('.primary-info').get_text(strip=True)
}
data.append(d)
print(data)
Output:
[{'Name': 'Ms Diane Abbott'}, {'Name': 'Debbie Abrahams'}, {'Name': 'Nigel Adams'}, {'Name': 'Bim Afolami'}, {'Name': 'Adam Afriyie'}, {'Name': 'Nickie Aiken'}, {'Name': 'Peter Aldous'}, {'Name': 'Rushanara Ali'}, {'Name': 'Tahir Ali'}, {'Name': 'Lucy Allan'}, {'Name': 'Dr Rosena Allin-Khan'}, {'Name': 'Mike Amesbury'}, {'Name': 'Fleur Anderson'}, {'Name': 'Lee Anderson'}, {'Name': 'Stuart Anderson'}, {'Name': 'Stuart Andrew'}, {'Name': 'Caroline Ansell'}, {'Name': 'Tonia Antoniazzi'}, {'Name': 'Edward Argar'}, {'Name': 'Jonathan Ashworth'}]

Related

Parsing site in python using beautifulsoup

I'm having a problem parsing a site for a client. I want to parse the https://okchanger.com/exchangers URL. I have tried post requests using headers and a payload which consists of some form data for the site. When I inspected the site in the networks tab, there was post request to data-table and get for the page itself. I would like to get the names and URLs, but the HTML source doesn't seem to show them (When parsing the HTML and look for the elements, it shows me an empty list). Can you kindly please tell me how would you approach this? Thanks in advance.
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
class okchangerScraper:
def __init__(self, URL):
self.URL = URL
self.headers = {
'accept': 'application/json, text/javascript, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-length': '1835',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'cookie': '__RequestVerificationToken=N5w7MfY6iyx6ExDA6a7kFlKD6rSeYuYE-ExXkw_hOAIK5TpeSb6YUgSPMWWEypMzYNjVELCxA41W7XE0oTJtlLa4TJNIMmsvya8CTCHRkxM1',
'origin': 'https://www.okchanger.com',
'referer': 'https://www.okchanger.com/exchangers',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
def scrape_this_page(self, page):
with requests.session() as s:
s.headers = self.headers
payload = {
'draw': '2',
'columns[0][data]': 'Logo',
'columns[0][name]': None,
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'false',
'columns[0][search][value]': None,
'columns[0][search][regex]': 'false',
'columns[1][data]': 'Name',
'columns[1][name]': None,
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'true',
'columns[1][search][value]': None,
'columns[1][search][regex]': 'false',
'columns[2][data]': 'ReserveUSD',
'columns[2][name]': None,
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': None,
'columns[2][search][regex]': 'false',
'columns[3][data]': 'Rates',
'columns[3][name]': None,
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': None,
'columns[3][search][regex]': 'false',
'columns[4][data]': 'AlexaRank',
'columns[4][name]': None,
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'false',
'columns[4][search][value]': None,
'columns[4][search][regex]': 'false',
'columns[5][data]': 'Popularity',
'columns[5][name]': None,
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': None,
'columns[5][search][regex]': 'false',
'columns[6][data]': 'Status',
'columns[6][name]': None,
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': None,
'columns[6][search][regex]': 'false',
'columns[7][data]': 'PositiveReviews',
'columns[7][name]': None,
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': None,
'columns[7][search][regex]': 'false',
'order[0][column]': '5',
'order[0][dir]': 'desc',
'start': '0',
'length': '100',
'search[value]': None,
'search[regex]': 'false'
}
r = requests.post(self.URL + page + '/data-table',
payload, headers=s.headers)
h = r.status_code
html = r.text
soup = bs(html, 'html.parser')
table = soup.find('tbody')
rows = table.select('tbody > tr:nth-child(1) > td.nowrap')
print(h)
print(len(rows))
if __name__ == "__main__":
scraper = okchangerScraper('https://www.okchanger.com/')
scraper.scrape_this_page('exchangers')

You are receiving json here, not html. Try this:
import json
# ...
content = json.loads(r.text)
print(content)

Python Scrape NBA Tracking Drives Data

I am fairly new to Python. I am trying to scrape NBA Drives data via https://stats.nba.com/players/drives/
I used Chrome Devtools to find the API URL. I then used the requests package to get the JSON string.
Original code:
import requests
headers = {"User-Agent": "Mozilla/5.0..."}
url = " https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight="
r = requests.get(url, headers = headers)
d = r.json()
This no longer works, however. For some reason the request for the URL link below times out on the NBA server. So I need to find a new way to get this information.
< https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=>
I was exploring Chrome devtools and I found out that the desired JSON string was stored in the Network XHR Response tab. Is there any way to scrape that into python. See the image below.
Chrome Devtools: XHR Response JSON string

I tested url with other headers (which I saw in DevTool for this request) and it seems it needs header Referer to work correctly
EDIT 2020.08.15:
I had to add new headers to read it
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
import requests
headers = {
'User-Agent': 'Mozilla/5.0',
#'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
'Referer': 'https://stats.nba.com/players/drives/',
#'Accept': 'application/json, text/plain, */*',
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
}
url = 'https://stats.nba.com/stats/leaguedashptstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&Height=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&PlayerExperience=&PlayerOrTeam=Player&PlayerPosition=&PtMeasureType=Drives&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='
r = requests.get(url, headers=headers)
data = r.json()
print(data)
BTW: the same but with params as dictionary so it is easier to set different value
import requests
headers = {
'User-Agent': 'Mozilla/5.0',
#'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0',
'Referer': 'https://stats.nba.com/players/drives/',
#'Accept': 'application/json, text/plain, */*',
'x-nba-stats-origin': 'stats',
'x-nba-stats-token': 'true',
}
url = 'https://stats.nba.com/stats/leaguedashptstats'
params = {
'College': '',
'Conference': '',
'Country': '',
'DateFrom': '',
'DateTo': '',
'Division': '',
'DraftPick': '',
'DraftYear': '',
'GameScope': '',
'Height': '',
'LastNGames': '0',
'LeagueID': '00',
'Location': '',
'Month': '0',
'OpponentTeamID': '0',
'Outcome': '',
'PORound': '0',
'PerMode': 'PerGame',
'PlayerExperience': '',
'PlayerOrTeam': 'Player',
'PlayerPosition': '',
'PtMeasureType': 'Drives',
'Season': '2019-20',
'SeasonSegment': '',
'SeasonType': 'Regular Season',
'StarterBench': '',
'TeamID': '0',
'VsConference': '',
'VsDivision': '',
'Weight': '',
}
r = requests.get(url, headers=headers, params=params)
#print(r.request.url)
data = r.json()
print(data)

Extracting Text from Javascript or Ajax based webpages?

Is there a way to scrap the text from Javascript based sites example: https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown
I need the product specifications from this page How can I do this?

Product details can easily be extracted using selenium webdriver -
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown')
list_product = driver.find_elements_by_xpath('//ul[#class="prod-list"]/li')
description_1 = list_product[0].text
Similarity, you can extract all other values.

Without selenium, just regexp.
import re
import json
import requests
from pprint import pprint
from sys import exit
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.9,en;q=0.8',
}
response = requests.get('https://www.ajio.com/ajio-mid-rise-slim-fit-cargo-pants/p/460151939_brown', headers=headers)
html = response.content
regex = ur"<script>\s+window.__PRELOADED_STATE__ =(.*);\s+<\/script>\s+<script\s+id\s+=\s+\"appJs\""
data = re.findall(regex, html, re.MULTILINE | re.DOTALL)[0]
json = json.loads(data)
details = []
for row in json['product']['productDetails']['featureData']:
try:
value = row['featureValues'][0]['value']
except KeyError:
value = None
finally:
details.append({'name': row['name'], 'value' : value})
pprint(details)
Result:
[{'name': u'Highlight', 'value': u'Multiple pockets'},
{'name': u'Hidden Detail', 'value': u'Belt loops'},
{'name': u'Additional Informations', 'value': u'Zip fly closure'},
{'name': u'Waist Rise', 'value': u'Mid Rise'},
{'name': u'Fabric Composition', 'value': u'100% Cotton'},
{'name': u'Size worn by Model', 'value': u'32'},
{'name': u'Fit Type', 'value': u'Straight Fit'},
{'name': u'Size Detail', 'value': u'Fits true to standard size on the model'},
{'name': u'Wash Care', 'value': u'Machine wash'},
{'name': u'Model Waist Size', 'value': u'32"'},
{'name': u'Model Height', 'value': u"6'"},
{'name': u'Size Format', 'value': None}]

Posting Payment Data

I've been working on this script. It's a script for auto checkout on shopify based site like this (https://www.cityblueshop.com/products/kixx_asst). My problem is everything works fine, except submitting the payment data. For some reason it won't post the payment, even though I'm correctly extracting the id for cc_verify_id. If you guys can test it out and let me know what I'm doing wrong (stuck on this step for several days), then it will be really appreciated. You can input fake contact and credit card information. P.S. I'm new to programming so it might look messy. Thanks in advance for your help.
[EDIT]It looks like it's not submitting the data properly from paymentdata, but I still can't pinpoint where's the problem.
import requests, sys, time, re
from datetime import datetime
from bs4 import BeautifulSoup
from urllib.parse import urlparse
s = requests.session()
def UTCtoEST():
current=datetime.now()
return str(current) + ' EST'
home = 'cityblueshop'
###Get Session Id###
session = s.get('https://www.'+home+'.com/cart.js').json()
sessionID = session['token']
print('SessionID:', sessionID)
###ATC###
print(UTCtoEST(), 'Adding item....')
atcdata = {
'id': '37431305678',
'quantity': '1'
}
for atcurlRetry in range(1):
atcURL = s.post('https://www.'+home+'.com/cart/add.js', data=atcdata, allow_redirects=True)
match = re.findall('"quantity":1', atcURL.text)
if match:
print(UTCtoEST(), 'ATC successful....')
break
print(UTCtoEST(), 'Trying to ATC....')
time.sleep(0)
else:
print(UTCtoEST(), 'Could not ATC after ' + ' retries, therefore exiting the bot.')
sys.exit(1)
###Going to Checkout Page###
for cartRetry in range(1):
cartdata = {
'updates[]': 1,
'note': '',
'checkout': 'Check Out'
}
atc = s.post('https://www.'+home+'.com/cart', data=cartdata, allow_redirects=True)
###Parsing URL###
parse = urlparse(atc.url)
storeID = parse.path.split('/')[1]
checkoutID = parse.path.split('checkouts/')[1]
print('Checkout Session Id:', checkoutID)
###Get Token###
soup = BeautifulSoup(atc.text, 'lxml')
input = soup.find_all('input')[2]
auth_token = input.get('value')
print('Auth_token:', auth_token)
###Get Contact info###
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Host': 'checkout.shopify.com',
'Referer': 'https: //checkout.shopify.com/'+storeID+'/checkouts/'+checkoutID+'?step=contact_information',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
qs = {
'utf8': '✓',
'_method': 'patch',
'authenticity_token': auth_token,
'previous_step': 'contact_information',
'checkout[email]': 'email',
'checkout[shipping_address][first_name]': 'First',
'checkout[shipping_address][last_name]': 'Last',
'checkout[shipping_address][company]': '',
'checkout[shipping_address][address1]': 'Address 1',
'checkout[shipping_address][address2]': '',
'checkout[shipping_address][city]': 'City',
'checkout[shipping_address][country]': 'United States',
'checkout[shipping_address][province]': '',
'checkout[shipping_address][province]': '',
'checkout[shipping_address][province]': 'New York',
'checkout[shipping_address][zip]': 'Zip',
'checkout[shipping_address][phone]': 'Phone',
'checkout[remember_me]': '',
'checkout[remember_me]': '0',
'checkout[client_details][browser_width]': '979',
'checkout[client_details][browser_height]': '631',
'checkout[client_details][javascript_enabled]': '1',
'step': 'contact_information'
}
GETcontact = s.get(atc.url, data=qs, headers=headers, allow_redirects=True)
###Post Contact Info###
headers1 = {
'Origin': 'https://checkout.shopify.com',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Referer': 'https://checkout.shopify.com/'+storeID+'/checkouts/'+checkoutID,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
formData = {
'utf8': '✓',
'_method': 'patch',
'authenticity_token': auth_token,
'button': '',
'checkout[email]': 'Email',
'checkout[shipping_address][first_name]': 'First',
'checkout[shipping_address][last_name]': 'Last',
'checkout[shipping_address][company]': '',
'checkout[shipping_address][address1]': 'Address 1',
'checkout[shipping_address][address2]': '',
'checkout[shipping_address][city]': 'City',
'checkout[shipping_address][country]': 'United States',
'checkout[shipping_address][province]': 'New York',
'checkout[shipping_address][zip]': 'Zip',
'checkout[shipping_address][phone]': 'Phone',
'checkout[remember_me]': '0',
'checkout[client_details][browser_width]': '979',
'checkout[client_details][browser_height]': '631',
'checkout[client_details][javascript_enabled]': '1',
'previous_step': 'contact_information',
'step': 'shipping_method'
}
POSTcontact = s.post(atc.url, data=formData, headers=headers1, allow_redirects=True)
###Parsing Shipping Method###
soup = BeautifulSoup(POSTcontact.text, 'html.parser')
shipping = soup.find(attrs={'class': 'radio-wrapper'})
shipping_method = ship.get('data-shipping-method')
###Submitting Shipping Data###
headers2 = {
'Origin': 'https://checkout.shopify.com',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Referer': 'https://checkout.shopify.com/'+storeID+'/checkouts/'+checkoutID,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
ShipformData = {
'utf8': '✓',
'_method': 'patch',
'authenticity_token': auth_token,
'previous_step': 'shipping_method',
'step': 'payment_method',
'checkout[shipping_rate][id]': shipping_method,
'button': '',
'checkout[client_details][browser_width]': '1280',
'checkout[client_details][browser_height]': '368',
'checkout[client_details][javascript_enabled]': '1'
}
shippingmethod = s.post(atc.url, data=ShipformData, headers=headers2, allow_redirects=True)
###Parsing payment_gateaway###
soup = BeautifulSoup(shippingmethod.text, 'html.parser')
ul = soup.find(attrs={'class': 'radio-wrapper content-box__row '})
payment_gateaway = ul.get('data-select-gateway')
###submitting payment info###
CCheaders = {
'accept': 'application/json',
'Origin': 'https://checkout.shopifycs.com',
'Accept-Language': 'en-US,en;q=0.8',
'Host': 'elb.deposit.shopifycs.com',
'content-type': 'application/json',
'Referer': 'https://checkout.shopifycs.com/number?identifier='+checkoutID+'&location=3A%2F%2Fcheckout.shopify.com%2F'+storeID+'%2Fcheckouts%2F'+checkoutID+'%3Fpreviousstep%3Dshipping_method%26step%3Dpayment_method',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
ccinfo = {
'number': "0000 0000 0000 0000",
'name': "First Last",
'month': 5,
'year': 2020,
'verification_value': "000"
}
creditcard = s.post('https://elb.deposit.shopifycs.com/sessions', json=ccinfo, headers=CCheaders, allow_redirects=True)
cc_verify = creditcard.json()
cc_verify_id = cc_verify['id']
###submitting credit card info##
paymentheaders = {
'Origin': 'https://checkout.shopify.com',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8',
'Referer': 'https://checkout.shopify.com/'+storeID+'/checkouts/'+checkoutID,
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36'
}
paymentdata = {
'_method': 'patch',
'authenticity_token': auth_token,
'checkout[buyer_accepts_marketing]': '1',
'checkout[client_details][browser_height]': '979',
'checkout[client_details][browser_width]': '631',
'checkout[client_details][javascript_enabled]': '1',
'checkout[credit_card][vault]': 'false',
'checkout[different_billing_address]': 'false',
'checkout[payment_gateway]': payment_gateaway,
'checkout[total_price]': '1199',
'complete': '1',
'previous_step': 'payment_method',
's': cc_verify_id,
'step': '',
'utf8': '✓'
}
submitpayment = s.post(atc.url, data=paymentdata, headers=paymentheaders, allow_redirects=True)
print(UTCtoEST(), submitpayment.status_code, submitpayment.url)

Just a guess, but this isn't a proper key if you are trying to post JSON..
'checkout[total_price]': '1199',
You need to rewrite as
'checkout' : {
'total_price': '1199',
}
And you need to apply this solution to all the other values in that format
For example
'checkout' : {
'remember_me' : '',
'shipping_address' : {
'first_name': 'First',
"last_name': 'Last'
And I think you can use Python False value instead of the string ' false', but that depends on the API

Python3 requests module or urllib.request module both retrieving incomplete json

I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object:
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print(url)
response = urlopen(url)
try:
reader = codecs.getreader("utf-8")
print(reader(response))
jsonresponse = json.load(reader(response))
print(jsonresponse)
and similarly using the response library instead of urllib also fails to retrieve the full JSON
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print("using this url %s"%url)
r = requests.get(url)
try:
print(r.json())
jsonresponse = r.json()# json.loads(response.read())
In both cases I get about 1/4 of the JSON. For example, in this case:
https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897
I received:
{'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\
e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\
temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\
ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\
0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}}
I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data.
I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original.
I would appreciate any advice about how to troubleshoot this. Thanks.

Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers.
Once you set a User-Agent header with the UA string of a common browser, you get the full response:
>>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0'
>>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897'
>>> r = requests.get(url, headers={'User-Agent': UA})
>>>
>>> print len(r.content)
96412
See the requests docs for more details on setting custom headers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

403 error in web-scraping a specific website with Python - python

Related

Parsing site in python using beautifulsoup

Python Scrape NBA Tracking Drives Data

Extracting Text from Javascript or Ajax based webpages?

Posting Payment Data

Python3 requests module or urllib.request module both retrieving incomplete json

Categories

Resources