I am trying to read a web page using a get request in python.
The original URL is given here. I found out that the information I am interested in is in a subpage with this URL (I replaced the authenticity token with XXX).
I tried using the second URL in my script but I get a 406 error. Can you suggest what am I doing wrong? Is the authenticity token for preventing scraping? if so, can I work around it?
import urllib.request
url = ...
agent={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = urllib.request.Request(url,headers=agent)
data = urllib.request.urlopen(req)
Thanks!
PS, This is how I get the URL using Chrome:
First I browse to https://www.goodreads.com/book/show/385228.On_Liberty
Then I open Chrome's developer tools: three dots -> more tools -> developer tools. Choose the network tab.
Then I go to the bottom of the page (just after the last review) and click "next".
In the tool window choose the request and in the header I get the Request URL: https://www.goodreads.com/book/reviews/385228?csm_scope=&hide_last_page=true&language_code=en&page=2&authenticity_token=XXX
Can you try to update your headers to include one more item, like:
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3',
'X-Requested-With': 'XMLHttpRequest',
}
req = urllib.request.Request(url,headers= headers)
I managed to get 200 OK back when adding that header, however, the response you'll get back from this endpoint might not really be what you need in the end, since it is a piece of JavaScript code which in return updates the HTML page. You can still use it in some way, but it's very dirty approach and might complicate things a lot.
What information do you need exactly? There might be a different approach than using that "problematic" response from your second URL.
Related
I'm trying to make a web scraper with python, I made it with selenium but it is really slow.Then i saw that i could speed up the project because of a button that make a post request.
import requests
from bs4 import BeautifulSoup
url = "http://vidtome.host/tnoz00am9j8p"
myobj = {
'op': 'download1',
'code':'tnoz00am9j8p',
'hash': 'the hash',
'imhuman': 'Proceed to video'
}
x = requests.post(url, data = myobj)
print(x.text)
That's the code and it works but only for the first time.
When I started it the first time it doesn't show any error and it printed me out the page with the right changes, but when i started it later it gave me no error but it printed me out the page with no changes like it doesn't do anything.
How can it be possible?
Requests are faster, but you cannot extract dynamically rendered content. However this is probably not the issue.
Problem is that you do not have access to the website.
If it is a basic human checking system, you could try to add user agent to your request
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36 Edg/88.0.705.68',
}
r = requests.get(url, headers=headers)
If this will not work, I would recommend looking into the data that you are passing. Maybe it is validating through it and it contains expired values or something.
As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim¤cy=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL
This question is no duplicate, as adding user-agent to a header isn't fixing anything.
I've been trying to get the response from this URL. It's an XML feed, not an HTML file. It's a live feed updated every second from cashpoint.com's live soccer page. I can just fine get the HTML page from the last-mentioned page, but from the first mentioned URL I can't retrieve the XML data. I can inspect it with google chrome inspector and see the response just fine. But it returns b''. Tried both get and post.
EDIT: Tried to add some more headers but it still doesn't work.
Shouldn't it be possible it retrieve this information if the inspector can see it?
Below is my code and some pictures (if you are too busy to check the links).
import requests
class GetFeed():
def __init__(self):
pass
def live_odds(self):
live_index_page = 'https://www.cashpoint.dk/en/live/index.html'
live_oddsupdate = 'https://www.cashpoint.dk/index.php?r=games/oddsupdate'
r = requests.get(live_oddsupdate)
print(r.text)
feed = GetFeed()
feed.live_odds()
Well, for one, in the Chrome console you can see that it's a POST request, and it appears you're performing a GET request in your Python code.
You need to have some data and some headers in a post request. Try this:
url = 'https://www.cashpoint.dk/index.php?r=games/oddsupdate'
headers = {
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "_ga=GA1.2.517291307.1531264976; _gid=GA1.2.1421702183.1531264976; _pk_id.155.9810=7984b0a0e139caba.1531264979.1.1531264979.1531264979.; cookieConsent=1; cpLanguage=en; langid=2; ad_network=DIRECT; PHPSESSID=f4mbbfd8adb3m59gfc1pelo126"
}
data = "parameters%5Baction%5D=odds_update¶meters%5Bgame_category%5D=live¶meters%5Bsport_id%5D=¶meters%5Btimestamp%5D=1531268162¶meters%5Bgameids%5D=%5B905814%2C905813%2C905815%2C905818%2C905792%5D&formToken=c3fed3ea6b46dae171a6f1a6d21db14fcc21474c"
response = requests.post(url, data=data, headers=headers)
print response.content
Just tested this and it works. The point here is that ALL of this information can be found on the exact same xhr network inspection in google chrome. Next time, please read up on xmlhttprequests before you post a question.
I have been trying to do a get request on a YouTube video page in order to read simple information off of the page. I have done this many times before, and generally it is quite easy to reverse engineer a get request with the help of Google Chrome's developer tools.
To demonstrate, here is a screen shot of the request I get when I reload a YouTube video in a fresh incognito window (to prevent cookies from being sent) as seen from the developer menu:
chrome screenshot
Every time I close the window and reload the page I recieve nearly identical HTML (apart from authorization keys and the like), the bottom of which can seen here: another chrome screenshot
First I tried recreating this request using a header-less get with Requests in Python:
import requests
sesh = requests.Session()
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8").content
this returns a different page which still contains some of the data present on the page I get from chrome but not nearly all of it. Next I tried including all the headers I saw in the chrome request using the following code:
import requests
sesh = requests.Session()
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-encoding": "gzip, deflate, br",
"accept-language":"en-US,en;q=0.8",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"}
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8", headers = headers).content
However this very strangely returns a seemingly random quick paragraph of unicode characters in varying lengths, sometimes around 10 characters long, sometimes closer to 50. I couldn't think of any other ways to make this closer to the request I was seeing from chrome. I tried fiddling with this for a couple of hours doing things like running the request multiple times in the same session and messing with the headers a bit, but to no avail.
FInally out of desperation I tried dropping everything except the user agent, using the following code:
import requests
sesh = requests.Session()
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"}
print sesh.get("https://www.youtube.com/watch?v=5eA8IVrQWn8", headers = headers).content
And this got me the page I wanted.
However I am left unsatisfied with the knowledge that somehow replicating the Get I was seeing in chrome didn't work. What am I missing from my second attempt?
I would like to get store info from the web-site(http://www.hilife.com.tw/storeInquiry_street.aspx).
The method I found by chrome is POST.
By using below method, I still cannot access.
Could someone give me a hint?
import requests
from bs4 import BeautifulSoup
head = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
}
payload = {
'__EVENTTARGET':'AREA',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE':'/wEPDwULLTE0NjI2MjI3MjMPZBYCAgcPZBYMAgEPZBYCAgEPFgIeBFRleHQFLiQoJyNzdG9yZUlucXVpcnlfc3RyZWV0JykuYXR0cignY2xhc3MnLCdzZWwnKTtkAgMPEA8WBh4NRGF0YVRleHRGaWVsZAUJY2l0eV9uYW1lHg5EYXRhVmFsdWVGaWVsZAUJY2l0eV9uYW1lHgtfIURhdGFCb3VuZGdkEBUSCeWPsOWMl+W4ggnln7rpmobluIIJ5paw5YyX5biCCeWunOiYree4ownmlrDnq7nnuKMJ5qGD5ZyS5biCCeiLl+agl+e4ownlj7DkuK3luIIJ5b2w5YyW57ijCeWNl+aKlee4ownlmInnvqnnuKMJ6Zuy5p6X57ijCeWPsOWNl+W4ggnpq5jpm4TluIIJ5bGP5p2x57ijCemHkemWgOe4ownmlrDnq7nluIIJ5ZiJ576p5biCFRIJ5Y+w5YyX5biCCeWfuumahuW4ggnmlrDljJfluIIJ5a6c6Jit57ijCeaWsOeruee4ownmoYPlnJLluIIJ6IuX5qCX57ijCeWPsOS4reW4ggnlvbDljJbnuKMJ5Y2X5oqV57ijCeWYiee+qee4ownpm7LmnpfnuKMJ5Y+w5Y2X5biCCemrmOmbhOW4ggnlsY/mnbHnuKMJ6YeR6ZaA57ijCeaWsOerueW4ggnlmInnvqnluIIUKwMSZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnFgECB2QCBQ8QDxYGHwEFCXRvd25fbmFtZR8CBQl0b3duX25hbWUfA2dkEBUWBuS4reWNgAbmnbHljYAG5Y2X5Y2ABuilv+WNgAbljJfljYAJ5YyX5bGv5Y2ACeilv+Wxr+WNgAnljZflsa/ljYAJ5aSq5bmz5Y2ACeWkp+mHjOWNgAnpnKfls7DljYAJ54OP5pel5Y2ACeixkOWOn+WNgAnlkI7ph4zljYAJ5r2t5a2Q5Y2ACeWkp+mbheWNgAnnpZ7lsqHljYAJ5aSn6IKa5Y2ACeaymem5v+WNgAnmoqfmo7LljYAJ5riF5rC05Y2ACeWkp+eUsuWNgBUWBuS4reWNgAbmnbHljYAG5Y2X5Y2ABuilv+WNgAbljJfljYAJ5YyX5bGv5Y2ACeilv+Wxr+WNgAnljZflsa/ljYAJ5aSq5bmz5Y2ACeWkp+mHjOWNgAnpnKfls7DljYAJ54OP5pel5Y2ACeixkOWOn+WNgAnlkI7ph4zljYAJ5r2t5a2Q5Y2ACeWkp+mbheWNgAnnpZ7lsqHljYAJ5aSn6IKa5Y2ACeaymem5v+WNgAnmoqfmo7LljYAJ5riF5rC05Y2ACeWkp+eUsuWNgBQrAxZnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnZ2dnFgECBGQCBw8PFgIfAAUJ5Y+w5Lit5biCZGQCCQ8PFgIfAAUG5YyX5Y2AZGQCCw8WAh4LXyFJdGVtQ291bnQCAhYEZg9kFgJmDxUFBEg2NDYP5Y+w5Lit5aSq5bmz5bqXIOWPsOS4reW4guWMl+WNgDQwNOWkquW5s+i3rzcy6JmfCzA0LTIyMjkwOTI4Azg3MGQCAQ9kFgJmDxUFBDQzMTgP5Y+w5Lit5rC45aSq5bqXM+WPsOS4reW4guWMl+WNgDQwNOWkquWOn+i3r+S6jOautTI0MOiZn+S4gOaok+WFqOmDqAswNC0yMzY5MDA1NwM4NzFkZFHxmtQaBu2Yr9cvskfEZMWn57JLRfjPYBFYDy+tHr6X',
'__VIEWSTATEGENERATOR':'B77476FC',
'__EVENTVALIDATION':'/wEdACtWrrgS52/ojbuYEYvRDXHZ2ryV+Ed5kWYedGp5zjHs3Neeeo/9TTvNTdElW+hiVA25mZnLEQUYPOZFLnuVu9jOT+Zq1/xceVgC7GxWRM+A8tOS3xZBjlhgzlx5UN3H3D0UrdtoyeScvRqxFL8L3gGKRyCJu029oItLX7X6c7SW7C7IVzuAeZ6t9kFMeOQus7MtrV7YeOXrlOP8inI96UkaJEU7Ro3FtK29+B+NamR2j4qInKVwJ4+JD3cjWm5buZdnOhT/ISzrljaf+F9GnVjm4dGchVglf1PxMMHl7EEoLjs20TZ856RDCGXvzK/6J+tEFp7zDvFTYGoeHtuHy+YF/IoR/CRFBAaEkys48FIAUCSUKnxACPyW6Ar2guIADjOqYue7v4fhV1jIq65P/lwanoaJpIsboCbjakbTYnqK8BLngMayrRehyT58dmj3SbzY1mOtzSNnakdpUxaC0EpOJ7rhB52A2FKsxy5EbP0PwHHuHNMa9dit0AxPMfYUP1/LWuYPWMX0W8tyEMKxoUcYsCb+qJLF9yXPgM6c8sIQTRxcBokm1PGzFN4M6vnSF8OfFSC+c0frLZ4GH6l497B/5oDIjq7Bz4/cPeGCavvh9NUqPcmzJIr8Abx9vjtMGpZSwBdVY3bR/ARswIDrmWLt1qMD4jcRvGPxBa8nsRR8HNdVINbR+iOSFLwVhBCg+s+mV5NeTdOKvAeggfOsJHmJKL0ApQSCyjY5kEiOvo2JAI07C08ENIFF7HpDTaGCi93i2WnmdDrYoaoLZi96dRTlk4xoWV9tc7rd9X/wE6QoKHxFtADSz9WkgtbUn88lAhY2++OiqWCaQZobh7K26ndH1z34JXVB7C/AiOEV+CCb97oVyooxWullV44iFQ0isVBjYC1XWS3eGf1PwMS++A+EjQTkl9VJhIRDoS6sg2mD7mikimBjQGvZX/lcYtKSrjY=',
'CITY':'台中市',
'AREA':'北區'
}
res = requests.post('http://www.hilife.com.tw/storeInquiry_street.aspx', data=payload, headers = head)
res.encoding = 'utf-8'
print res.text
I see that you are missing this : Content-Type:application/x-www-form-urlencoded, you have to send a header like this as well as send data in x-www-form-urlencoded format . I recommend using POSTMAN for testing before writing the code. Also make sure you have relevant permission before crawling third party website. Happy Crawling.