pywikipedia bot with https and http authentication

pywikipedia bot with https and http authentication - python

I'm having trouble getting my bot to login to a MediaWiki install on the intranet. I believe it is due to the http authentication protecting the wiki.
Facts:
The wiki root is: https://local.example.com/mywiki/
When visiting the wiki with a web browser, a popup comes up asking for enterprise credentials (I assume this is basic access authentication)
This is what I have in my user-config.py:
mylang = 'en'
family = 'mywiki'
usernames['mywiki']['en'] = u'Bot'
authenticate['local.example.com'] = ('user', 'pass')
This is what I have in mywiki_family.py:
# -*- coding: utf-8 -*-
import family, config
# The Wikimedia family that is known as mywiki
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'mywiki'
self.langs = { 'en' : 'local.example.com'}
def scriptpath(self, code):
return '/mywiki'
def version(self, code):
return '1.13.5'
def isPublic(self):
return False
def hostname(self, code):
return 'local.example.com'
def protocol(self, code):
return 'https'
def path(self, code):
return '/mywiki/index.php'
When I execute login.py -v -v, I get this:
urllib2.urlopen(urllib2.Request('https://local.example.com/w/index.php?title=Special:Userlogin&useskin=monobook&action=submit', wpSkipCookieCheck=1&wpPassword=XXXX&wpDomain=&wpRemember=1&wpLoginattempt=Aanmelden%20%26%20Inschrijven&wpName=Bot, {'Content-type': 'application/x-www-form-urlencoded', 'User-agent': 'PythonWikipediaBot/1.0'})):
(Redundant traceback info here)
urllib2.HTTPError: HTTP Error 401: Unauthorized
(I'm not sure why it has 'local.example.com/w' instead of '/mywiki'.)
I thought it might be trying to authenticate to example.com instead of example.com/wiki, so I changed the authenticate line to:
authenticate['local.example.com/mywiki'] = ('user', 'pass')
But then I get an HTTP 401.2 error back from IIS:
You do not have permission to view this directory or page using the credentials that you supplied because your Web browser is sending a WWW-Authenticate header field that the Web server is not configured to accept.
Any help on how to get this working would be appreciated.
Update After fixing my family file, it now says:
Getting information for site mywiki:en
('http error', 401, 'Unauthorized', )
WARNING: Could not open 'https://local.example.com/mywiki/index.php?title=Non-existing_page&action=edit&useskin=monobook'. Maybe the server or your connection is down. Retrying in 1 minutes...
I looked at the HTTP headers on a plan urllib2.ulropen call and it's using WWW-Authenticate: Negotiate WWW-Authenticate: NTLM. I'm guessing urllib2 and thus pywikipedia don't support this?
Update Added a tasty bounty for help in getting this to work. I can authenticate using python-ntlm. How do I integrate this into pywikipedia?

Well the fact that login.py tries accessing '\w' instead of your path shows that there is a family configuration issue.
Your code is indented strangely: is scriptpath a member of the new Family class? as in:
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'mywiki'
self.langs = { 'en' : 'local.example.com'}
def scriptpath(self, code):
return '/mywiki'
def version(self, code):
return '1.13.5'
def isPublic(self):
return False
def hostname(self, code):
return 'local.example.com'
def protocol(self, code):
return 'https'
?
I believe that something is wrong with your family file. A good way to check is to do in a python console:
import wikipedia
site = wikipedia.getSite('en', 'mywiki')
print site.login_address()
as long as the relative address is wrong, showing '/w' instead of '/mywiki', it means that the family file is still not configured correctly, and that the bot won't work :)
Update: how to integrate ntlm in pywikipedia?
I just had a look at the basic example here. I would integrate the code before that line in login.py:
response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))
You want to write something of the like:
from ntlm import HTTPNtlmAuthHandler
user = 'DOMAIN\User'
password = "Password"
url = self.site.protocol() + '://' + self.site.hostname()
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))
I would test this and integrate it directly into pywikipedia codebase if only I had an available ntlm setup...
Whatever happens, please do not vanish with your solution: we're interested, at pywikipedia, by your solution :)

I am guessing the problem you have is that the server expects basic authentication and you are not handling that in your client. Michael Foord wrote a good article about handling basic authentication in Python.
You did not provide enough information for me to be sure about this, so if that does not work, please provide some additional information, like network dump of you connection attempt.

Related

OAuth for third party API on Jupyter?

I'm using Python on a Jupyter notebook for data analysis, and want access to a third party API (Mendeley) that uses OAuth. There used to be a work-around with a server on Heroku that produced a token manually, but that's been discontinued recently.
This must be an insanely common problem, but I can't find a maintained library that supports it. Most Python OAuth libraries are server-only; there's a well-supported JupyterHub-OAuthenticator, but IFAICS that is using OAuth for a different purpose.
ipyauth looks the business, but it's not been updated much and it's not documented how to extend it for new services. That situation usually means there's something better-supported available.
What is the currently-maintained Jupyter-Python-ThirdPartyAPI library, please?

Well, one answer turns out to be just to use the requests package, and to copy and paste the redirected URL each time:
from requests_oauthlib import OAuth2Session
scope = 'all'
redirect_uri='http://localhost:8888/'
oauth = OAuth2Session('YourApplicationApiIdNumber', redirect_uri=redirect_uri, scope=scope)
authorization_url, state = oauth.authorization_url(
"https://api.mendeley.com/oauth/authorize" )
print( 'Please go to %s to authorize access, and copy the final localhost URL' % authorization_url )
assert(False) # Stop processing until this is done.
... into a variable:
authorization_response = 'http://localhost:8888/tree?TheStuffWereInterestedIn'
... And then take it from there:
token = oauth.fetch_token(
'https://api.mendeley.com/oauth/token',
authorization_response=authorization_response.replace('http', 'https').replace(',',''),
client_secret='YourClientSecret')
r = oauth.get('https://api.mendeley.com/documents?sort=last_modified&order=desc&limit=500', timeout=30)
You have to configure the Mendeley application interface with the callback URL. This is http://localhost:8888/ , because that's something that Jupyter can display without losing the additional OAuth2 parameters. But Requests OAuth implementation doesn't accept non-https links (nor a trailing comma that Jupyter adds occasionally), so we fudge it as shown.
I guess this approach will work with pretty much any OAuth2 API. Certainly Requests lists quite a few.

Not sure if this helps or not, but I had to do an Oauth using PKCE Authorization flow with client_id, a registered callback url and no secret. And it had to run in Jupyter. The user is required to login into the authorization provider's website as part of the redirect (I think. I get mixed up with the terms because for me the Auth provider and the resource were from the same source). To do this I used the python HTTPServer and request handler to intercept the callback url. The Httpserver then hands over the relevant authorization code to my OAuth client when fetching the access token.
from http.server import BaseHTTPRequestHandler, HTTPServer
from urllib import parse
import random
import string
import hashlib
import base64
from typing import Any
import webbrowser
from authlib.integrations.requests_client import OAuth2Session
config = {
'scopes': ['openid', 'profile'],
'port' : 8888,
'redirect_url': 'http://localhost:8888/auth/callback',
'access_token_url': 'https://some_oauth_proivder.com/oauth2/access',
'auth_code_url': 'https://some_oauth_proivder.com/oauth2/authz/',
'client_id': 'providedByOauthprovider'}
def generate_code() -> tuple[str, str]:
rand = random.SystemRandom()
code_verifier = ''.join(rand.choices(string.ascii_letters + string.digits, k=128))
code_sha_256 = hashlib.sha256(code_verifier.encode('utf-8')).digest()
b64 = base64.urlsafe_b64encode(code_sha_256)
code_challenge = b64.decode('utf-8').replace('=', '')
return (code_verifier, code_challenge)
def login(config: dict[str, Any]) -> str:
class OAuthHttpServer(HTTPServer):
def __init__(self, *args, **kwargs):
HTTPServer.__init__(self, *args, **kwargs)
self.authorization_code = ""
class OAuthHttpHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
self.wfile.write("Redirecting to the My Auth provider login\n".encode("UTF-8"))
parsed = parse.urlparse(self.path)
client.fetch_token(url=config['access_token_url'],
authorization_response=self.path,
state=state,
code_verifier=code_verifier,
grant_type = 'authorization_code')
self.wfile.write("""
<html>
<body>
<h2>Authorization request to My Auth provider has been completed.</h1>
<h3>You may close this tab or window now.</h3>
</body>
</html>
""".encode("UTF-8"))
self.wfile.write('<script> setTimeout("window.close()", 2500);</script>'.encode("UTF-8")) #Timeout only works if already logged for some reason
with OAuthHttpServer(('', config["port"]), OAuthHttpHandler) as httpd:
client = OAuth2Session(client_id=config['client_id'],
scope=config['scopes'],
redirect_uri=config['redirect_url'],
code_challenge_method='S256')
code_verifier, code_challenge = generate_code()
auth_uri, state = client.create_authorization_url(config['auth_code_url'], code_verifier=code_verifier)
webbrowser.open_new(auth_uri)
httpd.handle_request()
clear_output()
print("Logged in successfully")
return client.token['access_token']
I borrowed a lot of this from a sample from this guy, but I had to swap out the Oauth client because it didn't work for me. And I had to rearrange the http request handler a little.
https://github.com/CamiloTerevinto/Blog/tree/main/Samples

403 when retrieving a WSDL via Python SUDS

I can't seem to get SUDS to download a WSDL that requires Basic auth credentials. My code is simple:
wsdl_url = 'https://example.com/ChangeRequest.do?WSDL'
self.client = Client(wsdl_url, username=username, password=password)
I've also tried:
from suds.transport.https import HttpAuthenticated
wsdl_url = 'https://example.com/ChangeRequest.do?WSDL'
credentials = dict(username=username, password=password)
t = HttpAuthenticated(**credentials)
self.client = Client(url=wsdl_url, transport=t)
In both cases, the service returns a 403 Forbidden error. I can go down into the SUDS code in http.py and add this line to the call:
u2request.add_header('Authorization','Basic xxxxxxxxxxxxxxxxxxxx')
This works. What am I doing wrong to get SUDS to pass my credentials when downloading the WSDL?
Note: I try to connect to the WSDL directly using both Chrome's Postman plugin and SoapUI, and the service works as well. So I know the credentials are correct.

I encountered a similar issue (suds v0.4, wsdl, 403), and found out that it was because the server I'm trying to access blocks any requests with the header User-Agent set like Python-urllib* (suds is using urllib2, hence the default header). Explicitly change the header solves the issue.
Particular to my solution: I overrode the open method of a transport class, and set client options, like the following code snippet. Note that we need to explicitly set for open and subsequent requests separately. Please advice better ways to circumvent this if you know any. And hope this post could help save someone's time in the future.
import urllib2
import suds
from suds.transport.https import HttpAuthenticated
from suds.transport import TransportError
URL = 'https://example.com/ChangeRequest.do?WSDL'
class HttpHeaderModify(HttpAuthenticated):
def open(self, request):
try:
url = request.url
u2request = urllib2.Request(url, headers={'User-Agent': 'Mozilla'})
self.proxy = self.options.proxy
return self.u2open(u2request)
except urllib2.HTTPError, e:
raise TransportError(str(e), e.code, e.fp)
transport = HttpHeaderModify()
client = Client(URL, transport=transport, timeout=10)
# Subsequent requests' header needs to be set again here. The overridden transport
# class only handles opening of the client.
client.set_options(headers={'User-Agent': 'Mozilla'})
P.S. Though my problem may not be the same, searching for "403 suds" pops up this SO question, so I decide just post my solution here.
reference post that gave me the right direction: https://bitbucket.org/jurko/suds/issues/27/client-request-for-wsdl-does-not-use-given

I used to have this issue before and compare with the soap UI header.
Found that suds missing to include the header (Host).
client.set_options(headers={'Host': 'value'})
And issue fixed.

HTTPError: HTTP Error 401: basic auth failed. Bing Search

I have made a code to get urls from bing search. It gives the error mentioned above.
import urllib
import urllib2
accountKey = 'mykey'
username =accountKey
queryBingFor = "'JohnDalton'"
quoted_query = urllib.quote(queryBingFor)
rootURL = "https://api.datamarket.azure.com/Bing/Search/"
searchURL = rootURL + "Image?$format=json&Query=" + quoted_query
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, searchURL,username,accountKey)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)
readURL = urllib2.urlopen(searchURL).read()
I have made the username = authKey as someone told me it has to be same for both. Anyways, i didn't get a username when i made the bing webmaster account. Or is it just my email. Excuse me if i have made novice mistakes. I've just started Python.

In the absence of any other information, it seems unlikely that what is effectively your username and password would be the same thing if this site actually needs this form of authorisation.
Are you able to make it work by doing a request in your browser like the following?
https://mykey:mykey#api.datamarket.azure.com/Bing/Search/Image?$format=json&Query=blah
If so then at lerast it sounds like the credentials are right and that its the way you are using them in python that's wrong, but more likely the above will fail with the same error, suggesting the credentials themselves are not valid.
Also see this question, which suggests there may be a problem is the site doesn't do 'standard' auth: urllib2 HTTPPasswordMgr not working - Credentials not sent error
It also suggests that you might need to pass the top level URL of the site tot he password manager rather than the specific search URL.
Finally, it might be worth adapting this code:
http://www.voidspace.org.uk/python/articles/authentication.shtml
for your site to check the auth realm and scheme the site is sending you to check they're supported.

How to request pages from website that uses OpenID?

This question has been asked here before. The accepted answer was probably obvious to both questioner and answerer---but not to me. I have commented on the above question to get more precisions, but there was no response. I also approached the meta Q&A for help on how to bring back questions from their grave, and got no answer either.
The answer to the here above question was:
From the client's perspective, an OpenID login is very similar to any other web-based login. There isn't a defined protocol for the client; it is an ordinary web session that varies based on your OpenID provider. For this reason, I doubt that any such libraries exist. You will probably have to code it yourself.
I know how to log onto a website with Python already, using the Urllib2 module. But that's not enough for me to guess how to authenticate to an OpenID.
I'm actually trying to get my StackOverflow inbox in json format, for which I need to be logged in.
Could someone provide a short intro or a link to a nice tutorial on how to do that?

Well I myself don't know much about OpenID but your post (and the bounty!!) got me interested.
This link tells the exact flow of OpenID authentication sequence (Atleast for v1.0. The new version is 2.0). From what I could make out, the steps would be something like
You fetch the login page of stackoverflow that will also provide an option to login using OpenID (As a form field).
You send ur openID which is actually a form of uri and NOT username/email(If it is Google profile it is your profile ID)
Stackoverflow will then connect to your ID provider (in this case google) and send you a redirect to google login page and another link to where you should redirect later (lets say a)
You can login to the google provided page conventionally (using POST method from Python)
Google provides a cryptographic token (Not pretty sure about this step) in return to your login request
You send the new request to a with this token.
Stackoverflow will contact google with this token. If authenticity established, it will return a session ID
Later requests to STackOverflow should have this session ID
No idea about logging out!!
This link tells about various responses in OpenID and what they mean. So maybe it will come in handy when your code your client.
Links from the wiki page OpenID Explained
EDIT: Using Tamper Data Add on for Firefox, the following sequence of events can be constructed.
User sends a request to the SO login page. On entering the openID in the form field the resulting page sends a 302 redirecting to a google page.
The redirect URL has a lot of OpenID parameters (which are for the google server). One of them is return_to=https://stackoverflow.com/users/authenticate/?s=some_value.
The user is presented with the google login page. On login there are a few 302's which redirect the user around in google realm.
Finally a 302 is received which redirects user to stackoverflow's page specified in 'return_to' earlier
During this entire series of operation a lot of cookie's have been generated which must be stored correctly
On accessing the SO page (which was 302'd by google), the SO server processes your request and in the response header sends a field "Set-Cookie" to set cookies named gauth and usr with a value along with another 302 to stackoverflow.com. This step completes your login
Your client simply stores the cookie usr
You are logged in as long as you remeber to send the Cookie usr with any request to SO.
You can now request your inbox just remeber to send the usr cookie with the request.
I suggest you start coding your python client and study the responses carefully. In most cases it will be a series of 302's with minimal user intervention (except for filling out your Google username and password and allowing the site page).
However to make it easier, you could just login to SO using your browser, copy all the cookie values and make a request using urllib2 with the cookie values set.
Of course in case you log out on the browser, you will have to login again and change the cookie value in your python program.

I know this is close to archeology, digging a post that's two years old, but I just wrote a new enhanced version of the code from the validated answer, so I thought it may be cool to share it here, as this question/answers has been a great help for me to implement that.
So, here's what's different:
it uses the new requests library that is an enhancement over urllib2 ;
it supports authenticating using google's and stackexchange's openid provider.
it is way shorter and simpler to read, though it has less printouts
here's the code:
#!/usr/bin/env python
import sys
import urllib
import requests
from BeautifulSoup import BeautifulSoup
def get_google_auth_session(username, password):
session = requests.Session()
google_accounts_url = 'http://accounts.google.com'
authentication_url = 'https://accounts.google.com/ServiceLoginAuth'
stack_overflow_url = 'http://stackoverflow.com/users/authenticate'
r = session.get(google_accounts_url)
dsh = BeautifulSoup(r.text).findAll(attrs={'name' : 'dsh'})[0].get('value').encode()
auto = r.headers['X-Auto-Login']
follow_up = urllib.unquote(urllib.unquote(auto)).split('continue=')[-1]
galx = r.cookies['GALX']
payload = {'continue' : follow_up,
'followup' : follow_up,
'dsh' : dsh,
'GALX' : galx,
'pstMsg' : 1,
'dnConn' : 'https://accounts.youtube.com',
'checkConnection' : '',
'checkedDomains' : '',
'timeStmp' : '',
'secTok' : '',
'Email' : username,
'Passwd' : password,
'signIn' : 'Sign in',
'PersistentCookie' : 'yes',
'rmShown' : 1}
r = session.post(authentication_url, data=payload)
if r.url != authentication_url: # XXX
print "Logged in"
else:
print "login failed"
sys.exit(1)
payload = {'oauth_version' : '',
'oauth_server' : '',
'openid_username' : '',
'openid_identifier' : ''}
r = session.post(stack_overflow_url, data=payload)
return session
def get_so_auth_session(email, password):
session = requests.Session()
r = session.get('http://stackoverflow.com/users/login')
fkey = BeautifulSoup(r.text).findAll(attrs={'name' : 'fkey'})[0]['value']
payload = {'openid_identifier': 'https://openid.stackexchange.com',
'openid_username': '',
'oauth_version': '',
'oauth_server': '',
'fkey': fkey,
}
r = session.post('http://stackoverflow.com/users/authenticate', allow_redirects=True, data=payload)
fkey = BeautifulSoup(r.text).findAll(attrs={'name' : 'fkey'})[0]['value']
session_name = BeautifulSoup(r.text).findAll(attrs={'name' : 'session'})[0]['value']
payload = {'email': email,
'password': password,
'fkey': fkey,
'session': session_name}
r = session.post('https://openid.stackexchange.com/account/login/submit', data=payload)
# check if url changed for error detection
error = BeautifulSoup(r.text).findAll(attrs={'class' : 'error'})
if len(error) != 0:
print "ERROR:", error[0].text
sys.exit(1)
return session
if __name__ == "__main__":
prov = raw_input('Choose your openid provider [1 for StackOverflow, 2 for Google]: ')
name = raw_input('Enter your OpenID address: ')
pswd = getpass('Enter your password: ')
if '1' in prov:
so = get_so_auth_session(name, pswd)
elif '2' in prov:
so = get_google_auth_session(name, pswd)
else:
print "Error no openid provider given"
r = so.get('http://stackoverflow.com/inbox/genuwine')
print r.json()
the code is also available as a github gist
HTH

This answer sums up what others have said below, especially RedBaron, plus adding a method I used to get to the StackOverflow Inbox using Google Accounts.
Using the Tamper Data developer tool of Firefox and logging on to StackOVerflow, one can see that OpenID works this way:
StackOverflow requests authentication from a given service (here Google), defined in the posted data;
Google Accounts takes over and checks for an already existing cookie as proof of authentication;
If no cookie is found, Google requests authentication and sets a cookie;
Once the cookie is set, StackOverflow acknowledges authentication of the user.
The above sums up the process, which in reality is more complicated, since many redirects and cookie exchanges occur indeed.
Because reproducing the same process programmatically proved somehow difficult (and that might just be my illiteracy), especially trying to hunt down the URLs to call with all locale specifics etc. I opted for loging on to Google Accounts first, getting a well deserved cookie and then login onto Stackoverflow, which would use the cookie for authentication.
This is done simply using the following Python modules: urllib, urllib2, cookielib and BeautifulSoup.
Here is the (simplified) code, it's not perfect, but it does the trick. The extended version can be found on Github.
#!/usr/bin/env python
import urllib
import urllib2
import cookielib
from BeautifulSoup import BeautifulSoup
from getpass import getpass
# Define URLs
google_accounts_url = 'http://accounts.google.com'
authentication_url = 'https://accounts.google.com/ServiceLoginAuth'
stack_overflow_url = 'https://stackoverflow.com/users/authenticate'
genuwine_url = 'https://stackoverflow.com/inbox/genuwine'
# Build opener
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
def request_url(request):
'''
Requests given URL.
'''
try:
response = opener.open(request)
except:
raise
return response
def authenticate(username='', password=''):
'''
Authenticates to Google Accounts using user-provided username and password,
then authenticates to StackOverflow.
'''
# Build up headers
user_agent = 'Mozilla/5.0 (Ubuntu; X11; Linux i686; rv:8.0) Gecko/20100101 Firefox/8.0'
headers = {'User-Agent' : user_agent}
# Set Data to None
data = None
# Build up URL request with headers and data
request = urllib2.Request(google_accounts_url, data, headers)
response = request_url(request)
# Build up POST data for authentication
html = response.read()
dsh = BeautifulSoup(html).findAll(attrs={'name' : 'dsh'})[0].get('value').encode()
auto = response.headers.getheader('X-Auto-Login')
follow_up = urllib.unquote(urllib.unquote(auto)).split('continue=')[-1]
galx = jar._cookies['accounts.google.com']['/']['GALX'].value
values = {'continue' : follow_up,
'followup' : follow_up,
'dsh' : dsh,
'GALX' : galx,
'pstMsg' : 1,
'dnConn' : 'https://accounts.youtube.com',
'checkConnection' : '',
'checkedDomains' : '',
'timeStmp' : '',
'secTok' : '',
'Email' : username,
'Passwd' : password,
'signIn' : 'Sign in',
'PersistentCookie' : 'yes',
'rmShown' : 1}
data = urllib.urlencode(values)
# Build up URL for authentication
request = urllib2.Request(authentication_url, data, headers)
response = request_url(request)
# Check if logged in
if response.url != request._Request__original:
print '\n Logged in :)\n'
else:
print '\n Log in failed :(\n'
# Build OpenID Data
values = {'oauth_version' : '',
'oauth_server' : '',
'openid_username' : '',
'openid_identifier' : 'https://www.google.com/accounts/o8/id'}
data = urllib.urlencode(values)
# Build up URL for OpenID authetication
request = urllib2.Request(stack_overflow_url, data, headers)
response = request_url(request)
# Retrieve Genuwine
data = None
request = urllib2.Request(genuwine_url, data, headers)
response = request_url(request)
print response.read()
if __name__ == '__main__':
username = raw_input('Enter your Gmail address: ')
password = getpass('Enter your password: ')
authenticate(username, password)

You need to implement cookies on any "login" page, in Python you use cookiejar. For example:
jar = cookielib.CookieJar()
myopener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
#myopener now supports cookies.
....

I made a simple script that logins to stackoverflow.com using Mozilla Firefox cookies. It's not entirely automated, because you need to login manually, but it's all i managed to do.
Scipt is actual for latest versions of FF ( i'm using 8.0.1 ), but you need to get latest sqlite dll, because default one that comes with python 2.7 can't open DB. You can get it here: http://www.sqlite.org/sqlite-dll-win32-x86-3070900.zip
import urllib2
import webbrowser
import cookielib
import os
import sqlite3
import re
from time import sleep
#login in Firefox. Must be default browser. In other cases log in manually
webbrowser.open_new('http://stackoverflow.com/users/login')
#wait for user to log in
sleep(60)
#Process profiles.ini to get path to cookies.sqlite
profile = open(os.path.join(os.environ['APPDATA'],'Mozilla','Firefox','profiles.ini'), 'r').read()
COOKIE_DB = os.path.join(os.environ['APPDATA'],'Mozilla','Firefox','Profiles',re.findall('Profiles/(.*)\n',profile)[0],'cookies.sqlite')
CONTENTS = "host, path, isSecure, expiry, name, value"
#extract cookies for specific host
def get_cookies(host):
cj = cookielib.LWPCookieJar()
con = sqlite3.connect(COOKIE_DB)
cur = con.cursor()
sql = "SELECT {c} FROM moz_cookies WHERE host LIKE '%{h}%'".format(c=CONTENTS, h=host)
cur.execute(sql)
for item in cur.fetchall():
c = cookielib.Cookie(0, item[4], item[5],
None, False,
item[0], item[0].startswith('.'), item[0].startswith('.'),
item[1], False,
item[2],
item[3], item[3]=="",
None, None, {})
cj.set_cookie(c)
return cj
host = 'stackoverflow'
cj = get_cookies(host)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = opener.open('http://stackoverflow.com').read()
# if username in response - Auth successful
if 'Stanislav Golovanov' in response:
print 'Auth successful'

How do I authenticate a urllib2 script in order to access HTTPS web services from a Django site?

everybody.
I'm working on a django/mod_wsgi/apache2 website that serves sensitive information using https for all requests and responses. All views are written to redirect if the user isn't authenticated. It also has several views that are meant to function like RESTful web services.
I'm now in the process of writing a script that uses urllib/urllib2 to contact several of these services in order to download a series of very large files. I'm running into problems with 403: FORBIDDEN errors when attempting to log in.
The (rough-draft) method I'm using for authentication and log in is:
def login( base_address, username=None, password=None ):
# prompt for the username (if needed), password
if username == None:
username = raw_input( 'Username: ' )
if password == None:
password = getpass.getpass( 'Password: ' )
log.info( 'Logging in %s' % username )
# fetch the login page in order to get the csrf token
cookieHandler = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener( urllib2.HTTPSHandler(), cookieHandler )
urllib2.install_opener( opener )
login_url = base_address + PATH_TO_LOGIN
log.debug( "login_url: " + login_url )
login_page = opener.open( login_url )
# attempt to get the csrf token from the cookie jar
csrf_cookie = None
for cookie in cookieHandler.cookiejar:
if cookie.name == 'csrftoken':
csrf_cookie = cookie
break
if not cookie:
raise IOError( "No csrf cookie found" )
log.debug( "found csrf cookie: " + str( csrf_cookie ) )
log.debug( "csrf_token = %s" % csrf_cookie.value )
# login using the usr, pwd, and csrf token
login_data = urllib.urlencode( dict(
username=username, password=password,
csrfmiddlewaretoken=csrf_cookie.value ) )
log.debug( "login_data: %s" % login_data )
req = urllib2.Request( login_url, login_data )
response = urllib2.urlopen( req )
# <--- 403: FORBIDDEN here
log.debug( 'response url:\n' + str( response.geturl() ) + '\n' )
log.debug( 'response info:\n' + str( response.info() ) + '\n' )
# should redirect to the welcome page here, if back at log in - refused
if response.geturl() == login_url:
raise IOError( 'Authentication refused' )
log.info( '\t%s is logged in' % username )
# save the cookies/opener for further actions
return opener
I'm using the HTTPCookieHandler to store Django's authentication cookies on the script-side so I can access the web services and get through my redirects.
I know that the CSRFmiddleware for Django is going to bump me out if I don't pass the csrf token along with the log in information, so I pull that first from the first page/form load's cookiejar. Like I mentioned, this works with the http/development version of the site.
Specifically, I'm getting a 403 when trying to post the credentials to the login page/form over the https connection. This method works when used on the development server which uses an http connection.
There is no Apache directory directive that prevents access to that area (that I can see). The script connects successfully to the login page without post data so I'm thinking that would leave Apache out of the problem (but I could be wrong).
The python installations I'm using are both compiled with SSL.
I've also read that urllib2 doesn't allow https connections via proxy. I'm not very experienced with proxies, so I don't know if using a script from a remote machine is actually a proxy connection and whether that would be the problem. Is this causing the access problem?
From what I can tell, the problem is in the combination of cookies and the post data, but I'm unclear as to where to take it from here.
Any help would be appreciated. Thanks

Please excuse my answering my own question, but - for the record this seems to have solved it:
It turns out I needed to set the HTTP Referer header to the login page url in the request where I post the login information.
req.add_header( 'Referer', login_url )
The reason is explained on the Django CSRF documentation - specifically, step 4.
Due to our somewhat peculiar server setup where we use HTTPS on the production side and DEBUG=False, I wasn't seeing the csrf_failure reason for failure (in this case: 'Referer checking failed - no referer') that is normally output in the DEBUG info. I ended up printing that failure reason to the Apache error_log and STFW'd on it. That lead me to code.djangoproject/.../csrf.py and the Referer header fix.

This works on my django setup on https which is inspired by yours. I'm starting to think that the problem is outside this code... Is the server saying anything? I might very well be looking into apache.
I'm using the following code from my local machine to my server using ssl on nginx, so apache might be the place to look. I suppose one way to narrow it down is to try your script on my login page :) Shoot me an email!
import urllib
import urllib2
import contextlib
def login(login_url, username, password):
"""
Login to site
"""
cookies = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(cookies)
urllib2.install_opener(opener)
opener.open(login_url)
try:
token = [x.value for x in cookies.cookiejar if x.name == 'csrftoken'][0]
except IndexError:
return False, "no csrftoken"
params = dict(username=username, password=password, \
this_is_the_login_form=True,
csrfmiddlewaretoken=token,
)
encoded_params = urllib.urlencode(params)
with contextlib.closing(opener.open(login_url, encoded_params)) as f:
html = f.read()
print html
# we're in.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.