About handling a redirection in python - python

I am new to python and am trying to learn some new modules. Fortunately or unfortunately, I picked up the urllib2 module and started using it with one URL that's causing me problems.
To begin with, I created the Request object and then called Read() on the response object. It was failing. Turns out its getting redirected but the error code is still 200. Not sure what's going on. Here is the code --
def get_url_data(url):
print "Getting URL " + url
user_agent = "Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20100101 Firefox/14.0.1"
headers = { 'User-Agent' : user_agent }
request = urllib2.Request(url, str(headers) )
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print response.geturl()
print response.info()
print response.getcode()
return False;
else:
print response
print response.info()
print response.getcode()
print response.geturl()
return response
I am calling the above function with http://www.chilis.com".
I was expecting to receive a 301, 302, or 303 but instead I see 200. Here is the response I see --
Getting URL http://www.chilis.com
<addinfourl at 4354349896 whose fp = <socket._fileobject object at 0x1037513d0>>
Cache-Control: private
Server: Microsoft-IIS/7.5
SPRequestGuid: 48bbff39-f8b1-46ee-a70c-bcad16725a4d
X-SharePointHealthScore: 0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 14.0.0.6120
X-MS-InvokeApp: 1; RequireReadOnly
Date: Wed, 13 Feb 2013 11:21:27 GMT
Connection: close
Content-Length: 0
Set-Cookie: BIGipServerpool_http_chilis.com=359791882.20480.0000; path=/
200
http://www.chilis.com/(X(1)S(q24tqizldxqlvy55rjk5va2j))/Pages/ChilisVariationRoot.aspx?AspxAutoDetectCookieSupport=1
Can someone explain what this URL is and how to handle this? I know I can use the "Handling Redirects" section from Diveintopython.net but there also with the code on that page I see the same response 200.
EDIT: Using the code from DiveintoPython, I see its a temporary redirection. What I don't understand is why the HTTP Errorcode from code is 200. Isn't that supposed to be the actual return code?
EDIT2: Now that I see it better, its not a weird redirection at all. I am editing the title.
EDIT3: If urllib2 follows the redirection automatically, I am not sure why the following code does not get the front page for chilis.com.
docObj = get_url_data(url)
doc = docObj.read()
soup = BeautifulSoup(doc, 'lxml')
print(soup.prettify())
If I use the URL that the browser eventually ends up getting redirected to it works (http://www.chilis.com/EN/Pages/home.aspx").

urllib2 automatically follows redirects, so the information you're seeing is from the page that was redirected to.
If you don't want it to follow redirect, you'll need to subclass urllib2.HTTPRedirectHandler. Here's a relevant SO posting on how to do that: How do I prevent Python's urllib(2) from following a redirect
Regarding EDIT 3: it looks like www.chilis.com requires accepting cookies. This can be implemented using urllib2, but I would suggest installing the requests module (http://pypi.python.org/pypi/requests/).
The following seems to do exactly what you want (without any error handling):
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())

Related

How to catch redirects with curl or requests

I am new to web scraping and have stumbled upon an unexpected challenge. The goal is to input an incomplete URL string for a website and "catch" the corrected URL output returned by the website's redirect function. The specific website that I referring to is Marine Traffic.
When searching for a specific vessel profile, a proper query string should contain the paramters shipid, mmsi and imo. For example, this link will return a webpage with the profile for a specific vessel:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA/_:97e0de64144a0d7abfc154ea3bd1010e
As it turns out, a query string with only the imo parameter will redirect to the exact same url. So, for example, the following query will redirect to the same one as above:
https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
My question is, using cURL in bash or another such tool such as the python requests library, how could one catch the redirect URL in an automated way? Curling the first URL returns the full html, while curling the second URL throws an Access Denied error. Why is this allowed in the browser? What is the workaround for this, if any, and what are some best practices for catching redirect URLs (using either python or bash)?
curl https://www.marinetraffic.com/en/ais/details/ships/imo:9337987
#returns Access Denied
Note: Adding a user agent to curl --user-agent 'Chrome/79' does not get around the issue. The error is avoided but nothing is returned.
You can try .url on response object:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
url = "https://www.marinetraffic.com/en/ais/details/ships/imo:9337987"
r = requests.get(url, headers=headers)
print(r.url)
Prints:
https://www.marinetraffic.com/en/ais/details/ships/shipid:368574/mmsi:308248000/imo:9337987/vessel:AL_GHARIYA

Python Unshort URL without response.history

Im using this code to unshorten all urls, it works correctly but can't get it to work on this particular one "https://www.shareasale-analytics.com/u.cfm?d=654202&m=52031&u=1363577&shrsl_analytics_sscid=41k4%5F9si0z&shrsl_analytics_sstid=41k4%5F9si0z" --> URL contains aff link
response = requests.get(url, timeout=15)
if response.history:
url_new = response.url
It simply does not find the final url. The result should be https://www.gearbest.com/other-novelty-lights/pp_009234504925.html
The redirect for this particular URL is being done by JS. This works out well when using browsers, python requests module cannot follow these redirects.
I used POSTMAN to actually find this out in the first place. These are the steps i performed -
Used browser (firefox) to verify that redirect does work. -> Worked.
Used postman to see the actual response. This is what postman received -
<head></head>
<body>
<script LANGUAGE="JavaScript1.2">
window.location.replace('https:\/\/www.gearbest.com\/other-novelty-lights\/pp_009234504925.html?wid=1433363&sscid=41k4_9si0z&utm_source=shareasale&utm_medium=shareasale&utm_campaign=shareasale&sascid=41k4_9si0z&userID=1363577')
</script>
</body>
</html>
I trimmed the whitespaces from the response.
So it is clear that JS is redirecting this further.
To make this work, you will need to perform 2 steps -
Update user agent in headers so that response contains the html with information about JS redirect.
Follow the redirect yourself.
Hope, this helps!
The issue you're seeing is the redirect is performed via Javascript rather than a regular HTTP redirect; also in order to receive the JS code you need to change your user agent:
import re
import requests
url = "https://www.shareasale-analytics.com/u.cfm?d=654202&m=52031&u=1363577&shrsl_analytics_sscid=41k4%5F9si0z&shrsl_analytics_sstid=41k4%5F9si0z"
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"}
response = requests.get(url, headers=headers)
if response.history:
url_new = response.url
else:
matches = re.findall("window.location.replace\('(.*)'\)", response.content.decode(), re.DOTALL)
if matches:
match = matches[0]
url_new = match.strip().replace("\\", "")
After that just retrieve the new URL using a simple regex.

python http client module error / inconsistent

I'm getting the following output
301 Moved Permanently --- when using http.client
200 --- when using requests
URL handling "http://i.imgur.com/fyxDric.jpg" passed as arg through command
What I expect is give me 200 status ok response.
This is the body
if scheme == 'http':
print('Ruuning in the http')
conn = http.client.HTTPConnection("www.i.imgur.com")
conn.request("GET", urlparse(url).path)
conn_resp = conn.getresponse()
body = conn_resp.read()
print(conn_resp.status, conn_resp.reason, body)
When using the requests
headers = {'User-Agent': 'Mozilla/5.0 Chrome/54.0.2840.71 Safari/537.36'}
response = requests.get(url, allow_redirects=False)
print(response.status_code)
You are trying to hit imgur over http, but imgur redirects all its request to process over https.
Due to this redirect the issue is occurring.
http module doesnt inherently handle the redirects you need to handle the redirects, where as requests module handles these redirects by itself.
The documentation on the http module includes in its first sentence "It is normally not used directly." Unlike requests it doesn't action the 301 response and follow the redirection in the headers. It instead returns the 301, which you would have to process yourself.

python-requests give me diffrent response from what I see in the browser, Why?

I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.

How to programmatically retrieve access_token from client-side OAuth flow using Python?

This question was posted on StackApps, but the issue may be more a programming issue than an authentication issue, hence it may deserve a better place here.
I am working on an desktop inbox notifier for StackOverflow, using the API with Python.
The script I am working on first logs the user in on StackExchange, and then requests authorisation for the application. Assuming the application has been authorised through web-browser interaction of the user, the application should be able to make requests to the API with authentication, hence it needs the access token specific to the user. This is done with the URL: https://stackexchange.com/oauth/dialog?client_id=54&scope=read_inbox&redirect_uri=https://stackexchange.com/oauth/login_success.
When requesting authorisation via the web-browser the redirect is taking place and an access code is returned after a #. However, when requesting this same URL with Python (urllib2), no hash or key is returned in the response.
Why is it my urllib2 request is handled differently from the same request made in Firefox or W3m? What should I do to programmatically simulate this request and retrieve the access_token?
Here is my script (it's experimental) and remember: it assumes the user has already authorised the application.
#!/usr/bin/env python
import urllib
import urllib2
import cookielib
from BeautifulSoup import BeautifulSoup
from getpass import getpass
# Define URLs
parameters = [ 'client_id=54',
'scope=read_inbox',
'redirect_uri=https://stackexchange.com/oauth/login_success'
]
oauth_url = 'https://stackexchange.com/oauth/dialog?' + '&'.join(parameters)
login_url = 'https://openid.stackexchange.com/account/login'
submit_url = 'https://openid.stackexchange.com/account/login/submit'
authentication_url = 'http://stackexchange.com/users/authenticate?openid_identifier='
# Set counter for requests:
counter = 0
# Build opener
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
def authenticate(username='', password=''):
'''
Authenticates to StackExchange using user-provided username and password
'''
# Build up headers
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0'
headers = {'User-Agent' : user_agent}
# Set Data to None
data = None
# 1. Build up URL request with headers and data
request = urllib2.Request(login_url, data, headers)
response = opener.open(request)
# Build up POST data for authentication
html = response.read()
fkey = BeautifulSoup(html).findAll(attrs={'name' : 'fkey'})[0].get('value').encode()
values = {'email' : username,
'password' : password,
'fkey' : fkey}
data = urllib.urlencode(values)
# 2. Build up URL for authentication
request = urllib2.Request(submit_url, data, headers)
response = opener.open(request)
# Check if logged in
if response.url == 'https://openid.stackexchange.com/user':
print ' Logged in! :) '
else:
print ' Login failed! :( '
# Find user ID URL
html = response.read()
id_url = BeautifulSoup(html).findAll('code')[0].text.split('"')[-2].encode()
# 3. Build up URL for OpenID authentication
data = None
url = authentication_url + urllib.quote_plus(id_url)
request = urllib2.Request(url, data, headers)
response = opener.open(request)
# 4. Build up URL request with headers and data
request = urllib2.Request(oauth_url, data, headers)
response = opener.open(request)
if '#' in response.url:
print 'Access code provided in URL.'
else:
print 'No access code provided in URL.'
if __name__ == '__main__':
username = raw_input('Enter your username: ')
password = getpass('Enter your password: ')
authenticate(username, password)
To respond to comments below:
Tamper data in Firefox requests the above URL (as oauth_url in the code) with the following headers:
Host=stackexchange.com
User-Agent=Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:9.0.1) Gecko/20100101 Firefox/9.0.1
Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language=en-us,en;q=0.5
Accept-Encoding=gzip, deflate
Accept-Charset=ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection=keep-alive
Cookie=m=2; __qca=P0-556807911-1326066608353; __utma=27693923.1085914018.1326066609.1326066609.1326066609.1; __utmb=27693923.3.10.1326066609; __utmc=27693923; __utmz=27693923.1326066609.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); gauthed=1; ASP.NET_SessionId=nt25smfr2x1nwhr1ecmd4ok0; se-usr=t=z0FHKC6Am06B&s=pblSq0x3B0lC
In the urllib2 request the header provides the user-agent value only. The cookie is not passed explicitly, but the se-usr is available in the cookie jar at the time of the request.
The response headers will be first the redirect:
Status=Found - 302
Server=nginx/0.7.65
Date=Sun, 08 Jan 2012 23:51:12 GMT
Content-Type=text/html; charset=utf-8
Connection=keep-alive
Cache-Control=private
Location=https://stackexchange.com/oauth/login_success#access_token=OYn42gZ6r3WoEX677A3BoA))&expires=86400
Set-Cookie=se-usr=t=kkdavslJe0iq&s=pblSq0x3B0lC; expires=Sun, 08-Jul-2012 23:51:12 GMT; path=/; HttpOnly
Content-Length=218
Then the redirect will take place through another request with the fresh se-usr value from that header.
I don't know how to catch the 302 in urllib2, it handles it by itself (which is great). It would be nice however to see if the access token as provided in the location header would be available.
There's nothing special in the last response header, both Firefox and Urllib return something like:
Server: nginx/0.7.65
Date: Sun, 08 Jan 2012 23:48:16 GMT
Content-Type: text/html; charset=utf-8
Connection: close
Cache-Control: private
Content-Length: 5664
I hope I didn't provide confidential info, let me know if I did :D
The token does not appear because of the way urllib2 handles the redirect. I am not familiar with the details so I won't elaborate here.
The solution is to catch the 302 before the urllib2 handles the redirect. This can be done by sub-classing the urllib2.HTTPRedirectHandler to get the redirect with its hashtag and token. Here is a short example of subclassing the handler:
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print "Going through 302:\n"
print headers
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
In the headers the location attribute will provide the redirect URL in full length, i.e. including the hashtag and token:
Output extract:
...
Going through 302:
Server: nginx/0.7.65
Date: Mon, 09 Jan 2012 20:20:11 GMT
Content-Type: text/html; charset=utf-8
Connection: close
Cache-Control: private
Location: https://stackexchange.com/oauth/login_success#access_token=K4zKd*HkKw5Opx(a8t12FA))&expires=86400
Content-Length: 218
...
More on catching redirects with urllib2 on StackOverflow (of course).

Categories