Authenticate Scrapy HTTP Proxy

Authenticate Scrapy HTTP Proxy - python

I can set an http proxy using request.meta['proxy'], but how do I authenticate the proxy?
This does not work to specify user and pass:
request.meta['proxy'] = 'http://user:pass#123.456.2323:2222'
From looking around, I may have to send request.headers['Proxy-Authorization'], but what format do I send it in?

username and password are base64 encoded in the form "username:password"
import base64
# Set the location of the proxy
proxy_string = choice(self._get_proxies_from_file('proxies.txt')) # user:pass#ip:port
proxy_items = proxy_string.split('#')
request.meta['proxy'] = "http://%s" % proxy_items[1]
# setup basic authentication for the proxy
user_pass=base64.encodestring(proxy_items[0])
request.headers['Proxy-Authorization'] = 'Basic ' + user_pass

The w3lib module has a very convenient function for this usecase.
from w3lib.http import basic_auth_header
request.meta["proxy"] = "http://192.168.1.1:8050"
request.headers["Proxy-Authorization"] = basic_auth_header(proxy_user, proxy_pass)
This is also mentioned in a blog article of Zyte (the maintainers of scrapy)

Related

How to make a HTTP Basic Authentication post request where password is generated by TOTP in Python?

The task is to make a HTTP post request to url sending a json string data over, url is protected by HTTP Basic Authentication, I need to provide an Authorization: field in header, and emailAdd is the userid of Basic Authentication, password is generated by TOTP, where digits is 10-digit, time step is 30s, T0 is 0, hash function uses sha512, and shared secret is emailAdd + "ABCD".
My code is like:
import requests, base64, json
from passlib.totp import TOTP
from requests.auth import HTTPBasicAuth
totp = TOTP(key = base64.b32encode(shared_secret.encode()), digits = 10, alg = "sha512")
password = totp.generate().token
r = requests.Session()
#auth = base64.b64encode((''.join(userid) + ':' + ''.join(password)).encode())
r.headers.update({"Content-Type": "application/json"}) #this is task required
#r.headers.update({"Authorization": "Basic " + auth.decode()})
res = r.post(url, json = data, auth = (userid, password))
print(res.status_code)
print(res.content)
But I failed authentication, I think the password should be correct, and there is something wrong with post request, can anyone please tell me what the problem is? Also, I'm in a different time zone from the server, does it make any difference on TOTP generated password? And I'm running python on windows, if that matters.
Thanks

How do I authenticate a urllib2 script in order to access HTTPS web services from a Django site?

everybody.
I'm working on a django/mod_wsgi/apache2 website that serves sensitive information using https for all requests and responses. All views are written to redirect if the user isn't authenticated. It also has several views that are meant to function like RESTful web services.
I'm now in the process of writing a script that uses urllib/urllib2 to contact several of these services in order to download a series of very large files. I'm running into problems with 403: FORBIDDEN errors when attempting to log in.
The (rough-draft) method I'm using for authentication and log in is:
def login( base_address, username=None, password=None ):
# prompt for the username (if needed), password
if username == None:
username = raw_input( 'Username: ' )
if password == None:
password = getpass.getpass( 'Password: ' )
log.info( 'Logging in %s' % username )
# fetch the login page in order to get the csrf token
cookieHandler = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener( urllib2.HTTPSHandler(), cookieHandler )
urllib2.install_opener( opener )
login_url = base_address + PATH_TO_LOGIN
log.debug( "login_url: " + login_url )
login_page = opener.open( login_url )
# attempt to get the csrf token from the cookie jar
csrf_cookie = None
for cookie in cookieHandler.cookiejar:
if cookie.name == 'csrftoken':
csrf_cookie = cookie
break
if not cookie:
raise IOError( "No csrf cookie found" )
log.debug( "found csrf cookie: " + str( csrf_cookie ) )
log.debug( "csrf_token = %s" % csrf_cookie.value )
# login using the usr, pwd, and csrf token
login_data = urllib.urlencode( dict(
username=username, password=password,
csrfmiddlewaretoken=csrf_cookie.value ) )
log.debug( "login_data: %s" % login_data )
req = urllib2.Request( login_url, login_data )
response = urllib2.urlopen( req )
# <--- 403: FORBIDDEN here
log.debug( 'response url:\n' + str( response.geturl() ) + '\n' )
log.debug( 'response info:\n' + str( response.info() ) + '\n' )
# should redirect to the welcome page here, if back at log in - refused
if response.geturl() == login_url:
raise IOError( 'Authentication refused' )
log.info( '\t%s is logged in' % username )
# save the cookies/opener for further actions
return opener
I'm using the HTTPCookieHandler to store Django's authentication cookies on the script-side so I can access the web services and get through my redirects.
I know that the CSRFmiddleware for Django is going to bump me out if I don't pass the csrf token along with the log in information, so I pull that first from the first page/form load's cookiejar. Like I mentioned, this works with the http/development version of the site.
Specifically, I'm getting a 403 when trying to post the credentials to the login page/form over the https connection. This method works when used on the development server which uses an http connection.
There is no Apache directory directive that prevents access to that area (that I can see). The script connects successfully to the login page without post data so I'm thinking that would leave Apache out of the problem (but I could be wrong).
The python installations I'm using are both compiled with SSL.
I've also read that urllib2 doesn't allow https connections via proxy. I'm not very experienced with proxies, so I don't know if using a script from a remote machine is actually a proxy connection and whether that would be the problem. Is this causing the access problem?
From what I can tell, the problem is in the combination of cookies and the post data, but I'm unclear as to where to take it from here.
Any help would be appreciated. Thanks

Please excuse my answering my own question, but - for the record this seems to have solved it:
It turns out I needed to set the HTTP Referer header to the login page url in the request where I post the login information.
req.add_header( 'Referer', login_url )
The reason is explained on the Django CSRF documentation - specifically, step 4.
Due to our somewhat peculiar server setup where we use HTTPS on the production side and DEBUG=False, I wasn't seeing the csrf_failure reason for failure (in this case: 'Referer checking failed - no referer') that is normally output in the DEBUG info. I ended up printing that failure reason to the Apache error_log and STFW'd on it. That lead me to code.djangoproject/.../csrf.py and the Referer header fix.

This works on my django setup on https which is inspired by yours. I'm starting to think that the problem is outside this code... Is the server saying anything? I might very well be looking into apache.
I'm using the following code from my local machine to my server using ssl on nginx, so apache might be the place to look. I suppose one way to narrow it down is to try your script on my login page :) Shoot me an email!
import urllib
import urllib2
import contextlib
def login(login_url, username, password):
"""
Login to site
"""
cookies = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(cookies)
urllib2.install_opener(opener)
opener.open(login_url)
try:
token = [x.value for x in cookies.cookiejar if x.name == 'csrftoken'][0]
except IndexError:
return False, "no csrftoken"
params = dict(username=username, password=password, \
this_is_the_login_form=True,
csrfmiddlewaretoken=token,
)
encoded_params = urllib.urlencode(params)
with contextlib.closing(opener.open(login_url, encoded_params)) as f:
html = f.read()
print html
# we're in.

python http request with token

how and with which python library is it possible to make an httprequest (https) with a user:password or a token?
basically the equivalent to curl -u user:pwd https://www.mysite.com/
thank you

use python requests : Http for Humans
import requests
requests.get("https://www.mysite.com/", auth=('username','pwd'))
you can also use digest auth...

If you need to make thread-safe requests, use pycurl (the python interface to curl):
import pycurl
from StringIO import StringIO
response_buffer = StringIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, "https://www.yoursite.com/")
# Setup the base HTTP Authentication.
curl.setopt(curl.USERPWD, '%s:%s' % ('youruser', 'yourpassword'))
curl.setopt(curl.WRITEFUNCTION, response_buffer.write)
curl.perform()
curl.close()
response_value = response_buffer.getvalue()
Otherwise, use urllib2 (see other responses for more info) as it's builtin and the interface is much cleaner.

class urllib2.HTTPSHandler
A class to handle opening of HTTPS URLs.
21.6.7. HTTPPasswordMgr Objects
These methods are available on HTTPPasswordMgr and HTTPPasswordMgrWithDefaultRealm objects.
HTTPPasswordMgr.add_password(realm, uri, user, passwd)
uri can be either a single URI, or a sequence of URIs. realm, user and passwd must be strings. This causes (user, passwd) to be used as authentication tokens when authentication for realm and a super-URI of any of the given URIs is given.
HTTPPasswordMgr.find_user_password(realm, authuri)
Get user/password for given realm and URI, if any. This method will return (None, None) if there is no matching user/password.
For HTTPPasswordMgrWithDefaultRealm objects, the realm None will be searched if the given realm has no matching user/password.

Check our urllib2. The examples at the bottom will probably be of interest.
http://docs.python.org/library/urllib2.html

Read https url from Python with basic access authentication

How do you open https url in Python?
import urllib2
url = "https://user:password#domain.com/path/
f = urllib2.urlopen(url)
print f.read()
gives:
httplib.InvalidURL: nonnumeric port: 'password#domain.com'

This has never failed me
import urllib2, base64
username = 'foo'
password = 'bar'
auth_encoded = base64.encodestring('%s:%s' % (username, password))[:-1]
req = urllib2.Request('https://somewebsite.com')
req.add_header('Authorization', 'Basic %s' % auth_encoded)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError, http_e:
# etc...
pass

Please read about the urllib2 password manager and the basic authentication handler as well as the digest authentication handler.
http://docs.python.org/library/urllib2.html#abstractbasicauthhandler-objects
http://docs.python.org/library/urllib2.html#httpdigestauthhandler-objects
Your urllib2 script must actually provide enough information to do HTTP authentication. Usernames, Passwords, Domains, etc.

If you want to pass username and password information to urllib2 you'll need to use an HTTPBasicAuthHandler.
Here's a tutorial showing you how to do it.

You cannot pass credentials to urllib2.open like that. In your case, user is interpreted as the domain name, while password#domain.com is interpreted as the port number.

How to specify an authenticated proxy for a python http connection?

What's the best way to specify a proxy with username and password for an http connection in python?

This works for me:
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://
username:password#proxyurl:proxyport'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
conn = urllib2.urlopen('http://python.org')
return_str = conn.read()

Use this:
import requests
proxies = {"http":"http://username:password#proxy_ip:proxy_port"}
r = requests.get("http://www.example.com/", proxies=proxies)
print(r.content)
I think it's much simpler than using urllib. I don't understand why people love using urllib so much.

Setting an environment var named http_proxy like this: http://username:password#proxy_url:port

The best way of going through a proxy that requires authentication is using urllib2 to build a custom url opener, then using that to make all the requests you want to go through the proxy. Note in particular, you probably don't want to embed the proxy password in the url or the python source code (unless it's just a quick hack).
import urllib2
def get_proxy_opener(proxyurl, proxyuser, proxypass, proxyscheme="http"):
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, proxyurl, proxyuser, proxypass)
proxy_handler = urllib2.ProxyHandler({proxyscheme: proxyurl})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler(password_mgr)
return urllib2.build_opener(proxy_handler, proxy_auth_handler)
if __name__ == "__main__":
import sys
if len(sys.argv) > 4:
url_opener = get_proxy_opener(*sys.argv[1:4])
for url in sys.argv[4:]:
print url_opener.open(url).headers
else:
print "Usage:", sys.argv[0], "proxy user pass fetchurls..."
In a more complex program, you can seperate these components out as appropriate (for instance, only using one password manager for the lifetime of the application). The python documentation has more examples on how to do complex things with urllib2 that you might also find useful.

Or if you want to install it, so that it is always used with urllib2.urlopen (so you don't need to keep a reference to the opener around):
import urllib2
url = 'www.proxyurl.com'
username = 'user'
password = 'pass'
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# None, with the "WithDefaultRealm" password manager means
# that the user/pass will be used for any realm (where
# there isn't a more specific match).
password_mgr.add_password(None, url, username, password)
auth_handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
print urllib2.urlopen("http://www.example.com/folder/page.html").read()

Here is the method use urllib
import urllib.request
# set up authentication info
authinfo = urllib.request.HTTPBasicAuthHandler()
proxy_support = urllib.request.ProxyHandler({"http" : "http://ahad-haam:3128"})
# build a new opener that adds authentication and caching FTP handlers
opener = urllib.request.build_opener(proxy_support, authinfo,
urllib.request.CacheFTPHandler)
# install it
urllib.request.install_opener(opener)
f = urllib.request.urlopen('http://www.python.org/')
"""

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Authenticate Scrapy HTTP Proxy - python

Related

How to make a HTTP Basic Authentication post request where password is generated by TOTP in Python?

How do I authenticate a urllib2 script in order to access HTTPS web services from a Django site?

python http request with token

Read https url from Python with basic access authentication

How to specify an authenticated proxy for a python http connection?

Categories

Resources