HEAD request in python not working as desired - python

I am trying to check the status code of any URL in Python using the following code
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
when I use it like this:
response = urllib2.urlopen(HeadRequest("http://www.nativeseeds.org/"))
it throws following exception:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
However when I open the above URL "http://www.nativeseeds.org/" in firefox/restclient, it returns 200 status code.
Any help will be highly appreciated.

After some investigating, the website requires that both Accept and User-Agent request headers are set. Otherwise, it returns a 503. This is terribly broken. It also appears to be doing user-agent sniffing. I get a 403 when using curl:
$ curl --head http://www.nativeseeds.org/
HTTP/1.1 403 Forbidden
Date: Wed, 26 Sep 2012 14:54:59 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=PjZxNjvNmn6IlVh4Ac-tH0; path=/
Vary: Accept-Encoding
Content-Type: text/html
but works fine if I set the user-agent to Firefox:
$ curl --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --head http://www.nativeseeds.org/
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:55:57 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=ykOpGnEE%2CQOMUaVJLnM7W0; path=/
Last-Modified: Wed, 26 Sep 2012 14:56:27 GMT
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
It appears to work using the requests module:
>>> import requests
>>> r = requests.head('http://www.nativeseeds.org/')
>>> import pprint; pprint.pprint(r.headers)
{'cache-control': 'post-check=0, pre-check=0',
'content-encoding': 'gzip',
'content-length': '20',
'content-type': 'text/html; charset=utf-8',
'date': 'Wed, 26 Sep 2012 14:58:05 GMT',
'expires': 'Mon, 1 Jan 2001 00:00:00 GMT',
'last-modified': 'Wed, 26 Sep 2012 14:58:09 GMT',
'p3p': 'CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"',
'pragma': 'no-cache',
'server': 'Apache',
'set-cookie': 'f65129b0cd2c5e10c387f919ac90ad66=2NtRrDBra9jPsszChZXDm2; path=/',
'vary': 'Accept-Encoding'}

The problem you see has nothing to do with Python. The website itself seems to require something more than just a HEAD request. Even a simple telnet session results in the error:
$ telnet www.nativeseeds.org 80
Trying 208.113.230.85...
Connected to www.nativeseeds.org (208.113.230.85).
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.nativeseeds.org
HTTP/1.1 503 Service Temporarily Unavailable
Date: Wed, 26 Sep 2012 14:29:33 GMT
Server: Apache
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=iso-8859-1
Try adding some more headers; the http command line client does get a 200 response:
$ http -v head http://www.nativeseeds.org
HEAD / HTTP/1.1
Host: www.nativeseeds.org
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Accept-Encoding: identity, deflate, compress, gzip
Accept: */*
User-Agent: HTTPie/0.2.2
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:33:21 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=34hOijDSzeskKYtULx9V83; path=/
Last-Modified: Wed, 26 Sep 2012 14:33:23 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20
Content-Type: text/html; charset=utf-8

Reading urllib2 docs, get_method only returns 'GET' or 'POST'.
You may be interested in this.

Related

Python - How to save output from loop to multiple callable variables

I have the following Python code where items is a string of joined XML data produced from two website requests/responses:
items = ET.fromstring(new)
for item in list(items):
url = item.find("url")
endpoint = url.text
##
resp = item.find("response")
response = resp.text
responses = response.split("\n")
index = responses.index('')
indexed = responses[:index]
print(endpoint, *indexed, sep = "\n")
which prints:
https://www.youtube.com/sw.js_data
HTTP/2 200 OK
Content-Type: application/json; charset=utf-8
X-Content-Type-Options: nosniff
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Mon, 14 Mar 2022 17:59:34 GMT
Content-Disposition: attachment; filename="response.bin"; filename*=UTF-8''response.bin
Strict-Transport-Security: max-age=31536000
X-Frame-Options: SAMEORIGIN
Cross-Origin-Opener-Policy-Report-Only: same-origin; report-to="ATmXEA_XZXH6CdbrmjUzyTbVgxu22C8KYH7NsxKbRt94"
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-platform=*, ch-ua-platform-version=*
Accept-Ch: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Server: ESF
X-Xss-Protection: 0
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
https://www.google.com/client_204?&atyp=i&biw=1440&bih=849&dpr=1.5&ei=Z4IvYpTtF5LU9AP1nIOICQ
HTTP/2 204 No Content
Content-Type: text/html; charset=UTF-8
Strict-Transport-Security: max-age=31536000
Content-Security-Policy: object-src 'none';base-uri 'self';script-src 'nonce-9KQUw4dRjvKnx/zTrOblTQ==' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/cdt1
Bfcache-Opt-In: unload
Date: Mon, 14 Mar 2022 17:59:10 GMT
Server: gws
Content-Length: 0
X-Xss-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2022-03-14-17; expires=Wed, 13-Apr-2022 17:59:10 GMT; path=/; domain=.google.com; Secure; SameSite=none
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
Basically, I would like to be able to individually evaluate the data that is produced from the above code to where I could check to ensure header values are in each response from the websites. So in this example, the code would check the set of headers produced from the first website first (youtube) and say, all headers look good. Then check the set of headers produced from the second website (google) and say, missing Strict-Transport-Security header (for example). The goal of this code is that it would be able to run validate through these website responses no matter how many are loaded into the initial string and tell me if any headers are missing.
Is there an easy way to do this? I would think at some point each output (list of headers) from each website would be saved to variables that can be referenced/called? Maybe this is getting messy and will not be easy to do - not sure! Also happy to take any advice on making this code a little bit cleaner if there's a more efficient way to do what I am trying to do.
Thank you!
Full XML string below:
<?xml version='1.0' encoding='utf8'?>
<items burpVersion="2022.2.3" exportTime="Mon Mar 14 14:28:18 EDT 2022">
<item>
<time>Mon Mar 14 13:59:37 EDT 2022</time>
<url>https://www.youtube.com/sw.js_data</url>
<host ip="142.250.190.142">www.youtube.com</host>
<port>443</port>
<protocol>https</protocol>
<method>GET</method>
<path>/sw.js_data</path>
<extension>null</extension>
<request base64="false">GET /sw.js_data HTTP/2
Host: www.youtube.com
Accept: */*
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: cors
Sec-Fetch-Dest: empty
Referer: https://www.youtube.com/sw.js
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
</request>
<status>200</status>
<responselength>3524</responselength>
<mimetype>JSON</mimetype>
<response base64="false">HTTP/2 200 OK
Content-Type: application/json; charset=utf-8
X-Content-Type-Options: nosniff
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Mon, 14 Mar 2022 17:59:34 GMT
Content-Disposition: attachment; filename="response.bin"; filename*=UTF-8''response.bin
Strict-Transport-Security: max-age=31536000
X-Frame-Options: SAMEORIGIN
Cross-Origin-Opener-Policy-Report-Only: same-origin; report-to="ATmXEA_XZXH6CdbrmjUzyTbVgxu22C8KYH7NsxKbRt94"
Permissions-Policy: ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-platform=*, ch-ua-platform-version=*
Accept-Ch: Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Full-Version, Sec-CH-UA-Full-Version-List, Sec-CH-UA-Model, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version
Server: ESF
X-Xss-Protection: 0
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
)]}'
[["yt.sw.adr",null,[[["en","US","US","75.188.116.252",null,null,1,null,[],null,null,"","",null,null,"","QUFFLUhqbnREclEzblJmc25GVF9XSXQ1dFZQSm9sRGlmQXxBQ3Jtc0tuU3huS1RoOHQyaFlqN0dLdm4wcGMweXp0OURWQU5RbEJKRko1TlhGYjBoZ3N1Nnpla3QxUFRkN19uaWxoQVZTV0FRUGh0cUw2ckRWbmh5bGhxYkRjNFc2cUREbjB4MnFxMEpval9HUXNZeWU5d1Ztaw\u003d\u003d","CgtaVS1FWnl4ZTJEZyiGhb6RBg%3D%3D"],"Vf114d778||"]]</response>
<comment />
</item>
<item>
<time>Mon Mar 14 13:59:14 EDT 2022</time>
<url>https://www.google.com/client_204?&atyp=i&biw=1440&bih=849&dpr=1.5&ei=Z4IvYpTtF5LU9AP1nIOICQ</url>
<host ip="172.217.4.36">www.google.com</host>
<port>443</port>
<protocol>https</protocol>
<method>GET</method>
<path>/client_204?&atyp=i&biw=1440&bih=849&dpr=1.5&ei=Z4IvYpTtF5LU9AP1nIOICQ</path>
<extension>null</extension>
<request base64="false">GET /client_204?&atyp=i&biw=1440&bih=849&dpr=1.5&ei=Z4IvYpTtF5LU9AP1nIOICQ HTTP/2
Host: www.google.com
Sec-Ch-Ua: "(Not(A:Brand";v="8", "Chromium";v="99"
Sec-Ch-Ua-Mobile: ?0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36
Sec-Ch-Ua-Arch: "x86"
Sec-Ch-Ua-Full-Version: "99.0.4844.51"
Sec-Ch-Ua-Platform-Version: "10.0.0"
Sec-Ch-Ua-Bitness: "64"
Sec-Ch-Ua-Model:
Sec-Ch-Ua-Platform: "Windows"
Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
X-Client-Data: CJDnygE=
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: no-cors
Sec-Fetch-Dest: image
Referer: https://www.google.com/
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
</request>
<status>204</status>
<responselength>781</responselength>
<mimetype />
<response base64="false">HTTP/2 204 No Content
Content-Type: text/html; charset=UTF-8
Strict-Transport-Security: max-age=31536000
Content-Security-Policy: object-src 'none';base-uri 'self';script-src 'nonce-9KQUw4dRjvKnx/zTrOblTQ==' 'strict-dynamic' 'report-sample' 'unsafe-eval' 'unsafe-inline' https: http:;report-uri https://csp.withgoogle.com/csp/gws/cdt1
Bfcache-Opt-In: unload
Date: Mon, 14 Mar 2022 17:59:10 GMT
Server: gws
Content-Length: 0
X-Xss-Protection: 0
X-Frame-Options: SAMEORIGIN
Set-Cookie: 1P_JAR=2022-03-14-17; expires=Wed, 13-Apr-2022 17:59:10 GMT; path=/; domain=.google.com; Secure; SameSite=none
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
</response>
<comment />
</item>
</items>
Update: have continued messing with the code for the past couple days with still no luck. Any and all thoughts welcome!
Simply save output to a single dictionary variable of many items. Because your text split requires multiple steps, consider a defined method.
# DEFINED METHOD TO SPLIT RESPONSE BY LINE BREAKS
def split_text(resp):
responses = resp.split("\n")
index = responses.index('')
indexed = responses[:index]
return indexed
# PARSE XML FILE
doc = ET.fromstring(new)
# RETRIEVE ITEM NODES WITH DICTIONARY COMPREHENSION
website_items = {
item.find("url").text: split_text(item.find("response").text)
for item in doc.findall(".//item")
}
# REVIEW SAVED DATA WITH URLS AS KEYS
website_items["https://www.youtube.com/sw.js_data"]
website_items["https://www.google.com/client_204?&atyp=i&biw=1440&bih=849&dpr=1.5&ei=Z4IvYpTtF5LU9AP1nIOICQ"]

Open URL using python requests only then proceed to download file

I'm trying to download a file from a website using Python's request module.
However the site will allow me to download the file only if the download link is clicked directly from the download page.
So using requests, I tried hitting the download page's URL first using requests.get() then proceeding to download the file. But unfortunately this doesn't seem to work. A text asking me to open the download page first simply gets written into file.torrent"
import requests
def download(username, password):
with requests.Session() as session:
session.post('https://website.net/forum/login.php', data={'login_username': username, 'login_password': password})
# Download page URL
requests.get('https://website.net/forum/viewtopic.php?t=2508126')
# The download URL itself
response = requests.get('https://website.net/forum/dl.php?t=2508126')
with open('file.torrent', 'wb') as f:
f.write(response.content)
download(username='XXXXX', password='YYYYY')
Response when downloading directly from the download page (works) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Disposition: attachment; filename="[website.net].t2508126.torrent"
Content-Length: 33641
Content-Type: application/x-bittorrent; name="[website.net].t2508126.torrent"
Date: Thu, 14 Feb 2019 07:57:08 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 07:57:09 GMT
Pragma: no-cache
Server: nginx
Set-Cookie: bb_dl=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; path=/forum/; domain=.website.net
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131012-3061288864-1; bb_dl=2508126
Host: website.net
Referer: https://website.net/forum/viewtopic.php?t=2508126
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
Response when opening the download link on it's own (doesn't work) :
General :
Request URL: https://website.net/forum/dl.php?t=2508126
Request Method: GET
Status Code: 200 OK
Remote Address: 185.37.128.136:443
Referrer Policy: no-referrer-when-downgrade
Response Headers :
Cache-Control: no-store, no-cache, must-revalidate
Cache-Control: post-check=0, pre-check=0
Content-Type: text/html; charset=windows-1251
Date: Thu, 14 Feb 2019 08:03:29 GMT
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Last-Modified: Thu, 14 Feb 2019 08:03:29 GMT
Pragma: no-cache
Server: nginx
Transfer-Encoding: chunked
Request Headers :
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Cookie: bb_t=a%3A3%3A%7Bi%3A2507902%3Bi%3A1550052944%3Bi%3A2508011%3Bi%3A1550120230%3Bi%3A2508126%3Bi%3A1550125516%3B%7D; bb_data=1-27969311-wXVPJGcedLE1I2mM9H0u-3106784170-1550128652-1550131390-3061288864-1
Host: website.net
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3701.0 Safari/537.36
Query String Parameters :
t: 2508126
This works for me:
data={'login_username': username, 'login_password': password, 'login': ''}
and using session.get() instead of requests.get()

Python Scraping, Web page doesnot exist but the website redirects to another page

i am trying to find a way to know that if a web page exists or not. there are plenty of methods like httlib2, urlparse and using requests . but in my case the website redirects me to the home page if the webpage doesnot exist
e.g
https://www.thenews.com.pk/latest/category/sports/2015-09-21
Is there any method to catch that ?
The URL you mention gives a Redirect return code (307) which you can catch. See here:
$ curl -i https://www.thenews.com.pk/latest/category/sports/2015-09-21
HTTP/1.1 307 Temporary Redirect
Date: Sun, 26 Mar 2017 10:13:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=ddcd246615efb68a7c72c73f480ea81971490523219; expires=Mon, 26-Mar-18 10:13:39 GMT; path=/; domain=.thenews.com.pk; HttpOnly
Set-Cookie: bf_session=b02fb5b6cc732dc6c3b60332288d0f1d4f9f7360; expires=Sun, 26-Mar-2017 11:13:39 GMT; Max-Age=3600; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: https://www.thenews.com.pk/
X-Cacheable: YES
X-Varnish: 654909723
Age: 0
Via: 1.1 varnish
X-Age: 0
X-Cache: MISS
Access-Control-Allow-Origin: *
Server: cloudflare-nginx
CF-RAY: 345956a8be8a7289-AMS
You can check if the final url is the one you get redirected to, as well as if there was any history of redirects.
>>> import requests
>>> target_url = "https://www.thenews.com.pk/latest/category/sports/2015-09-21"
>>> response = requests.get(target_url)
>>> response.history[0].url
u'https://www.thenews.com.pk/latest/category/sports/2015-09-21'
>>> response.url
u'https://www.thenews.com.pk/'
>>> response.history and response.url == 'https://www.thenews.com.pk/' != target_url
True

Python/Feedparser: reading RSS feed fails

I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.
I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like
headers = {
'Content-type' : 'text/xml',
'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}
request = urllib2.Request(url)
I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.
What am I missing here?
So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:
$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
While without the UA, you get:
$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

Mechanize for Python and authorization without first getting a 401 error

I'm trying to use Mechanize to automate interactions with a very picky legacy system. In particular, after the first login page the authorization must be sent with every request it knocks you out of the system. Unfortunately, Mechanize seems content on only sending the authorization after first getting a 401 Unauthorized error. Is there any way to have it send authorization every time?
Here's some sample code:
br.add_password("http://example.com/securepage", "USERNAME", "PASSWORD", "/MYREALM")
br.follow_link(link_to_secure_page) # where the url is the previous URL
Here's the response I get from debugging Mechanize:
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 401 Unauthorized\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:04 GMT
header: Connection: close
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=US-ASCII
header: Content-Length: 210
header: WWW-Authenticate: Basic realm="/MYREALM"
header: Cache-control: no-cache
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nAuthorization: Basic VVNFUk5BTUU6UEFTU1dPUkQ=\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:07 GMT
header: Connection: close
header: Last-Modified: Thu, 08 Dec 2011 03:08:06 GMT
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=UTF-8
header: Content-Length: 33333
header: Cache-control: no-cache
The problem is that contrary to what should happen in modern web application with a GET request, by hitting the 401 error first I get the wrong page. I've confirmed with CURL and urllib2 that if I hit the URL directly by passing in the auth header on the first request I get the correct page.
Any hints on how to tell mechanize to always send the auth headers and avoid the first 401 error? This needs to be fixed on the client side. I can't modify the server.
from base64 import b64encode
import mechanize
url = 'http://192.168.3.5/table.js'
username = 'admin'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
# # I needed to change to Mozilla for mine, but most do not
# br.addheaders= [('User-agent', 'Mozilla/5.0')]
br.addheaders.append(
('Authorization', 'Basic %s' % b64login )
)
br.open(url)
r = br.response()
data = r.read()
print data

Categories