Python requests fail to scrape a web page by 403 Forbidden

Python requests fail to scrape a web page by 403 Forbidden - python

I want to audit trains timetable. The trains have a GPS and their positions are published in https://trenesendirecto.sofse.gob.ar/mapas/sanmartin/index.php My plan is to scrape the train positions and check the time that they arrive to the stations and publish this info to all users.
In order to obtain train coordinates I write the following script in Python
import requests, random, string
#Function to generate random code for rnd
def RandomGenerator():
x = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for _ in range(16))
return x
# URL requests
url = 'https://trenesendirecto.sofse.gob.ar/mapas/ajax_posiciones.php'
parametros = {
'ramal':'31',
'rnd':RandomGenerator(),
'key':'v%23v%23QTUNWp%23MpWR0wkj%23RhHTqVUM'}
encabezado = {
'Host': 'trenes.sofse.gob.ar',
'Referer': 'https://trenesendirecto.sofse.gob.ar/mapas/sanmartin/index.php',
'X-Requested-With': 'XMLHttpRequest',
'Accept':'application/json, text/javascript, */*',
'UserAgent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/65.0.3325.146 Safari/537.36'
}
res = requests.get(url, params = parametros, headers = encabezado, timeout=1)
# Output
print(res.url)
print(res.headers)
print(res.status_code)
print(res.content)
The output is:
https://trenesendirecto.sofse.gob.ar/mapas/ajax_posiciones.php?ramal=31&key=v%2523v%2523QTUNWp%2523MpWR0wkj%2523RhHTqVUM&rnd=ui8GObHTSpVpPqRo
{'Date': 'Tue, 13 Mar 2018 12:16:03 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Server': 'nginx'}
403
b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'
Using the same url generated by the requests in the browser I obtain the following
output from browser, which is exactly what I want.
Why the script does not work?
Is there any other method to obtain the data?

Have you tried testing the API url on a REST Client such as Postman or Mozilla's RESTClient add-on? This is the first step in web development before you can consume web-services in an application.
Besides, error code 403 means you may not be authorized to access this data or do not have the right permissions set. The latter in most usually the case with 403 errors as it differs from a 401 error.
You must confirm whether the API uses Basic Auth or token-based authentication.
A general GET request on RESTClient for this url gives status: 200OK which means the endpoint responds to HTTP requests but needs authorization if you want to request certain information.

Related

Python Requests login: i have error 403 but the request looks correct

I am trying to login into www.zalando.it using the requests library, but every time I try to post my data I am getting a 403 error. I saw in the network tab from Zalando and the login call and is the same. These are just dummy data, you can test creating a test account.
Here is the code for the login function:
import requests
import data
from bs4 import BeautifulSoup
home_page_link = "https://www.zalando.it/"
login_api_schema = "https://accounts.zalando.com/api/login/schema"
login_api_post = "https://accounts.zalando.com/api/login"
headers = {
'Host': 'www.zalando.it',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection' : 'close',
'Upgrade-Insecure-Requests': '1'
}
with requests.Session() as s:
s.headers.update(headers)
r = s.get(home_page_link)
# fetch these cookies: frsx, Zalando-Client-Id
cookie_dict = s.cookies.get_dict()
# update the headers
# remove this header for the xhr requests
del s.headers['Upgrade-Insecure-Requests']
# these 2 are taken from some response cookies
s.headers['x-xsrf-token'] = cookie_dict['frsx']
s.headers['x-zalando-client-id'] = cookie_dict['Zalando-Client-Id']
# i didn't pay attention to where these came from
# just saw them and manually added them
s.headers['x-zalando-render-page-uri'] = '/'
s.headers['x-zalando-request-uri'] = '/'
# this is sent as a response header and is needed to
# track future requests/responses
s.headers['x-flow-id'] = r.headers['X-Flow-Id']
# only accept json data from xhr requests
s.headers['Accept'] = 'application/json'
# when clicking the login button this request is sent
# i didn't test without this request
r = s.get(login_api_schema)
# add an origin header
s.headers['Origin'] = 'https://www.zalando.it'
# finally log in, this should return a 201 response with a cookie
login_data = {'email:': data.email,
'request': data.request,
'secret': data.secret}
r = s.post(login_api_post, json=login_data)
print(r.status_code)
print(r.headers)
I also used fiddler to try to sniff the data traffic but the HTTPS request is not performed and the latter generates the following exception:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='accounts.zalando.com', port=443) : Max retries exceeded with url: /api/login
(Caused by ProxyError('Your proxy appears to only use HTTP and not HTTPS, try changing your proxy URL to be HTTP. See: https://urllib3.readthedocs.io/en /1.26.x/advanced-usage.html#https-proxy-error-http-proxy',
SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1091)'))))
For the HTTP request instead I get a 301 error

Maybe this answer helps you.
I think youru website might detect, if a request gets sent with javascript.

Requests Python - Cross Origin API access denied even with headers, user-agent, and token. How do I get authroized?

Hello I'm new to dealing with servers and requests so bear with me,
I am trying to send get requests to an API hosted on a different server than the site itself (soldhistory.ca) so that I can get a JSON response of a list of property ids and then manually send individual requests to the API sever to get the details of each property. The later part works fine with my authorization token (and doesn't when I use a random string) so I know the issue isn't my authorization token but getting the initial json of property ids doesn't work.
I keep getting an 'Access Denied' - 403 message even when I send the headers, authorization token, and user-agent. When I send the wrong token, I get a different message that says 'Access Denied: Token not found or Expired'. When I send no token, it says 'Signature Required'. I think I am missing something trivial here. I have also noticed that when I login, the json response of that has an access token and another token that is different from the access token in it called 'token' which I think may have something to do with the problem I am experiencing but I have no idea what do with it. All in all, how do I get authorized to be able to send requests to the API server?
I have included a dummy account I have made with fake credentials in the code below if anyone wants to send requests. If you visit the site ,zoom out of the maps entirely and filter to show any price and only show sold properties, you will see there is data on roughly 450 000 past properties sold in Canada that I would like to get. My end goal is to get this data. If anyone can help me out I would greatly appreciate it.
It is worth noting, I have also tried using selenium to initially go to the homepage and then transfer the cookies to the requests session but that didn't work either. I have also tried using selenium-requests with no luck either but maybe I did not implement those right.
Also, if you look at the XMLHttpRequests of the site you will that there is an initial request called properties that is made and then the subsequent get requests are generated from that json response. I am trying to get the JSON response for properties. It is a SEARCH method.
Code:
import requests
s = requests.Session()
s.get('https://www.soldhistory.ca')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': '*/*',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'Access-Control-Request-Method': 'SEARCH',
'Access-Control-Request-Headers': 'access-token,access_token,content-type',
'Referer': 'https://www.soldhistory.ca/mapsearchapp/search/eyJzZWFyY2giOnsic2VhcmNoVHlwZSI6InJlc2lkZW50aWFsIiwibGlzdGluZ1R5cGUiOlsiU29sZCJdLCJvcGVuSG91c2UiOnsiZnJvbSI6MCwidG8iOjB9LCJiZWQiOjAsIm1hcmtldGRheXMiOjAsImJhdGgiOjAsInNlYXJjaEJ5Ijoic2VhcmNoYWxsIiwic2VhcmNoQnlUZXh0IjoiIiwicHJpY2VSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sImZlZXRSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sInNob3dPbmx5IjpbXSwicHJpY2VEcm9wIjpmYWxzZSwic29ydGJ5IjoibmV3ZXN0IiwiY29uZG9UeXBlIjoiIiwiY29uZG9PY2N1cGFuY3kiOiIiLCJjb25kb1N0YXR1cyI6IiIsImNvbmRvQnVpbGRlciI6IiIsImtleXdvcmRzIjpbXSwiUG9zdGFsQ29kZSI6ZmFsc2UsIlByb3ZpbmNlIjpmYWxzZSwiQ2l0eSI6ZmFsc2UsImNpdHlOYW1lIjoiTWVuZXNldCJ9LCJsb2NhdGlvbiI6eyJMb25naXR1ZGUiOi04Ni43Njc2MTkyMDA0ODM5MiwiTGF0aXR1ZGUiOjUzLjIzNjIzOTgyNTA1NjUxLCJab29tIjoyLCJtYXBWaWV3VHlwZSI6InJvYWRtYXAiLCJtYXBJbmZvVHlwZSI6W10sInNlbGVjdGVkUGF0aElEIjoiIiwiQm91bmRzIjp7InNvdXRoIjotNC41NDgwMzU0MjY0NTgxNzQsIndlc3QiOi0xODAsIm5vcnRoIjo3OC4zNTI5NDI4MzEyNjQ2MywiZWFzdCI6MTgwfX0sImNvbnRyb2xTcGVjaWFsIjp7fX0=',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
}
response = s.options('https://api.mapsearch.vps-private.net/properties', headers=headers)
headers2 = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': 'application/json',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'Content-Type': 'application/json',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
'Referer': 'https://www.soldhistory.ca/',
}
data2 = '{"mail":"robbydummy123#gmail.com","pass":"helloworld"}'
response2 = s.post('https://www.soldhistory.ca/mapsearchapp/visitor/login', headers=headers2, data=data2, verify=True)
parsed = response2.json()
print(json.dumps(parsed, indent=1, sort_keys=True))
accessToken = parsed['accessToken']
headers3 = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0',
'Accept': 'application/json',
'Accept-Language': 'en-CA,en-US;q=0.7,en;q=0.3',
'access_token': accessToken,
'Access-Token': accessToken,
'Content-Type': 'application/json',
'Content-Length': '317',
'Origin': 'https://www.soldhistory.ca',
'Connection': 'keep-alive',
'Referer': 'https://www.soldhistory.ca/mapsearchapp/search/eyJzZWFyY2giOnsic2VhcmNoVHlwZSI6InJlc2lkZW50aWFsIiwibGlzdGluZ1R5cGUiOlsiU29sZCJdLCJvcGVuSG91c2UiOnsiZnJvbSI6MCwidG8iOjB9LCJiZWQiOjAsIm1hcmtldGRheXMiOjAsImJhdGgiOjAsInNlYXJjaEJ5Ijoic2VhcmNoYWxsIiwic2VhcmNoQnlUZXh0IjoiIiwicHJpY2VSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sImZlZXRSYW5nZSI6eyJtaW4iOjAsIm1heCI6MH0sInNob3dPbmx5IjpbXSwicHJpY2VEcm9wIjpmYWxzZSwic29ydGJ5IjoibmV3ZXN0IiwiY29uZG9UeXBlIjoiIiwiY29uZG9PY2N1cGFuY3kiOiIiLCJjb25kb1N0YXR1cyI6IiIsImNvbmRvQnVpbGRlciI6IiIsImtleXdvcmRzIjpbXSwiUG9zdGFsQ29kZSI6ZmFsc2UsIlByb3ZpbmNlIjpmYWxzZSwiQ2l0eSI6ZmFsc2UsImNpdHlOYW1lIjoiTWVuZXNldCJ9LCJsb2NhdGlvbiI6eyJMb25naXR1ZGUiOi04Ni43Njc2MTkyMDA0ODM5MiwiTGF0aXR1ZGUiOjUzLjIzNjIzOTgyNTA1NjUxLCJab29tIjoyLCJtYXBWaWV3VHlwZSI6InJvYWRtYXAiLCJtYXBJbmZvVHlwZSI6W10sInNlbGVjdGVkUGF0aElEIjoiIiwiQm91bmRzIjp7InNvdXRoIjotNC41NDgwMzU0MjY0NTgxNzQsIndlc3QiOi0xODAsIm5vcnRoIjo3OC4zNTI5NDI4MzEyNjQ2MywiZWFzdCI6MTgwfX0sImNvbnRyb2xTcGVjaWFsIjp7fX0=',
}
data3 = '{"query":{"coordinates":{"$geoWithin":{"$box":[[160.3305414532229,35.087235763335656],[2.6547602032228923,71.87799155489013]]}},"searchType":"residential","listingType":{"$in":["Sale","Sold","Rent"]}},"fields":["Latitude","Longitude","listingType","searchType","Price"],"sort":{"Price":1},"limit":20,"soldData":false}'
response3 = s.post('https://api.mapsearch.vps-private.net/properties', headers=headers3, data=data3)
parsed = response3.json()
print(json.dumps(parsed, indent=1, sort_keys=True))
print(response3.status_code)

Python requests returning succesful POST request, but not redirecting page on Twitter

I've got a script that is meant to at first login to Twitter. I get a 200 response if I check it, but I don't redirect to a logged in Twitter account after succeeding, instead it stays on the same page.
url = 'https://twitter.com/login/error?redirect_after_login=%2F'
r = requests.session()
# Need the headers below as so Twitter doesn't reject the request.
headers = {
'Host': "twitter.com",
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
'Accept-Language': "en-US,en;q=0.5",
'Accept-Encoding': "gzip, deflate, br",
'Referer': "https://twitter.com/login/error?redirect_after_login=%2F",
'Upgrade-Insecure-Requests': "1",
'Connection': "keep-alive"
}
login_data = {"session[username_or_email]":"Username", "session[password]":"Password"}
response = r.post(url, data=login_data, headers=headers, allow_redirects=True)
How do I go about redirecting to my account upon successful POST request to the logged in state. Am I not using the correct headers or something like that? I've not done a huge amount of web stuff before, so I'm sorry if it's something really obvious.
Note: (I cannot use Twitter API to do this) & The referrer is the error page because that's where I'm logging in from - unless of course I'm wrong in doing that.

Perhaps the GET parameter redirect_after_login will use kind of javascript or html meta refresh redirection instead of HTTP redirection, so if it's the case, the requests python module will not handle it correctly.
So once you retrieve your authentication token from your first request, you could make again the second request to https://twiter.com/ without to forget your specify your security token from your HTTP request fields. You can find more information about REST API of twitter here: https://dev.twitter.com/overview/api
But the joy of python is to have libraries for everything, so I suggest you to take a look here:
https://github.com/bear/python-twitter
It's a library to communicate with the REST API of twitter.

How to get response from post with python

I'm trying to check if a current #hotmail.com address is taken.
However, I'm not getting the response I would have gotten using chrome developer tools.
#!/usr/bin/python
import urllib
import urllib2
import requests
cookies = {
'MC0': '1449950274804',
'mkt': 'en-US',
'MSFPC': 'ID=a9b016cd39838248bbf321ea5ad1ecae&CS=1&LV=201512&V=1',
'wlv': 'A|ekIL-d:s*cAHzDg.2+1+0+3',
'HIC': '7c5d20284ecdbbaa||0|||',
'wlxS': 'wpc=1&WebIM=1',
'RVC': 'm=1&v=17.5.9510.1001&t=12/12/2015 20:37:45',
'amcanary': '0',
'CkTst': 'MX1449957709484',
'LDH': '9',
'wla42': 'KjEsN0M1RDIwMjg0RUNEQkJBQSwsLDAsLTEsLTE=',
'LN': 'u9GMx1450021043143',
}
headers = {
'Origin': 'https://signup.live.com',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8,ja;q=0.6',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36',
'canary': 'aeIntzIq6OCS9qOE2KKP2G6Q7yCCPLAQVPIw0oy2Vksln3bbwVR9I8DcpfzC9RiCnNiJBw4YxtWsqJfnx0PeR9ovjRG+bF1jKkyPVWUTyuDTO5UkwRNNJFTIdeaClMgHtATSy+gI99ojsAKwuRFBMNbOgCwZIMCRCmky/voftX/63gjTqC9V5Ry/bECc2P66ouDZNC7TA/KN6tfsmszelEoSrmvU7LAKDoZnkhRQjpn6WYGxUzr5S+UYXExa32AY:1:3c',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'application/json',
'Referer': 'https://signup.live.com/signup?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {"signInName":"testfoobar1234#outlook.com","uaid":"f1d115020fc94af6ba17e722277cdcb8","performDisambigCheck":"true","includeSuggestions":"true","uiflvr":"1001","scid":"100118","hpgid":"200407"}
asdf = requests.post('https://signup.live.com/API/CheckAvailableSigninNames?wa=wsignin1.0&rpsnv=12&ct=1450038320&rver=6.4.6456.0&wp=MBI_SSL_SHARED&wreply=https', headers=headers, cookies=cookies, data=data)
print(asdf.json())
This is what chrome gives me when checking testfoobar1234#hotmail.com:
This is what my script is giving me testfoobar1234#hotmail.com:

If you want to connect via python script on your local machine to login.live.com with right credentials but cookies from your Chrome -- it's will not work.
What you want to do: read emails, send email, or just get contacts from address book. Algorithms in script will be different. Example, Mails available via outlook.com system, contacts located in people.live.com (and API as I right remember).
If you want emulate login like Chrome do, you need:
Get and collect all cookies from outlook.com main page, don't forget about all redirects:) - via your python script
Send request with collected cookies and credentials, to login.live.com (outlook will redirect to it).
But, from my experience -- last Outlook version (regular and Outlook Preview systems) in 90% detects wrong attempt of login and send to you page with confirm login question (code or email). That way you will have unstable solution. Do you really want to do it?
If you just want to parse JSON right you need:
import json
data = json.loads(asdf.text)
print(data)
If you want to see, how much actions produced by browser, just install Firebug and disable cleaning "Network" panel, then see how many requests processed before you logged in into your account.
But, for see all traffic suggest to use Firefox + Firebug + Tamper Data.
And also, I think more quicker will be use exists libs like Selenium for browser emulation.

Python httplib2: HTTPS login fails

I am trying to use httplib2 to log in to a web page. I am able to log in to the page by simply opening the following URL in a Chrome incognito window:
https://domain.com/auth?name=USERNAME&pw=PASSWORD
I tried the following code to emulate this login with httplib2:
from httplib2 import Http
h = Http(disable_ssl_certificate_validation=True)
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Unfortunately, this request does not lead to a successful login.
I tried changing the request headers to match those provided by Chrome:
headers = {
'Host': 'domain.com',
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8'
}
resp, content = h.request('https://domain.com/auth?name=USERNAME&pw=PASSWORD', 'GET', headers=headers)
This changes the response slightly, but still does not lead to a successful login.
I tried inspecting the actual network traffic with Wireshark but since it's HTTPS and thus encrypted, I can't see the actual traffic.
Does anybody know what the difference in requests between Chrome and httplib2 could be? Maybe httplib2 changes some of my headers?

Following Games Brainiac's comment, I ended up simply using Python Requests instead of httplib2. The following requests code works out of the box:
import requests
session = requests.Session()
response = session.get('https://domain.com/auth?name=USERNAME&pw=PASSWORD')
Further requests with the same username/password can simply be performed on the Session object:
...
next_response = session.get('https://domain.com/someOtherPage')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python requests fail to scrape a web page by 403 Forbidden - python

Related

Python Requests login: i have error 403 but the request looks correct

Requests Python - Cross Origin API access denied even with headers, user-agent, and token. How do I get authroized?

Python requests returning succesful POST request, but not redirecting page on Twitter

How to get response from post with python

Python httplib2: HTTPS login fails

Categories

Resources