How to download data from a password protected website

How to download data from a password protected website - python

I'm using request in python to try and download this file:
http://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/N55W003.SRTMGL1.hgt.zip there are 14000 such files hence why I need to automate the process. The other techniques I've found online don't seem to work. I assume due the websites they are designed for using a different authentication method. I don't know much about web development so I can't work out how this authentication works.
Edit
This is the code:
import json
import requests
from requests.auth import HTTPBasicAuth
file = open("srtm30m_bounding_boxes.json", 'r')
strjson = file.read()
x = json.loads(strjson)
filenamelist = []
url = "http://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/N55W003.SRTMGL1.hgt.zip"
for i in range(14295):
filenamelist.append(x['features'][i]['properties']['dataFile'])
filenamelist[i] = "http://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/" + filenamelist[i]
jar = requests.cookies.RequestsCookieJar()
jar.set('urs_user_already_logged', 'yes')
jar.set('_urs-gui_session','8b972449036e60e3d83a6a819b93124d')
r = requests.get(url, cookies=jar)
And this is the error I get when I run the code:
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

The simplest thing is to provide the username and password in the URL before the host, e.g.:
requests.get('http://{username}:{password}#e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/{subpath}'.format(username=username, password=password, subpath=filenamelist[i]))
You can also supply the username/password as the auth parameter to get:
requests.get('http://e4ftl01.cr.usgs.gov/MEASURES/SRTMGL1.003/2000.02.11/{subpath}'.format(subpath=filenamelist[i]), auth=(username, password))
totalhack is right that https is more secure, and it seems to work on this site. This form of authentication transmits the username and password as plaintext, so anyone who can observe the http request would also be able to steal your login. https encrypts the username / password since it encrypts the entire request.

Related

How to make a HTTP Basic Authentication post request where password is generated by TOTP in Python?

The task is to make a HTTP post request to url sending a json string data over, url is protected by HTTP Basic Authentication, I need to provide an Authorization: field in header, and emailAdd is the userid of Basic Authentication, password is generated by TOTP, where digits is 10-digit, time step is 30s, T0 is 0, hash function uses sha512, and shared secret is emailAdd + "ABCD".
My code is like:
import requests, base64, json
from passlib.totp import TOTP
from requests.auth import HTTPBasicAuth
totp = TOTP(key = base64.b32encode(shared_secret.encode()), digits = 10, alg = "sha512")
password = totp.generate().token
r = requests.Session()
#auth = base64.b64encode((''.join(userid) + ':' + ''.join(password)).encode())
r.headers.update({"Content-Type": "application/json"}) #this is task required
#r.headers.update({"Authorization": "Basic " + auth.decode()})
res = r.post(url, json = data, auth = (userid, password))
print(res.status_code)
print(res.content)
But I failed authentication, I think the password should be correct, and there is something wrong with post request, can anyone please tell me what the problem is? Also, I'm in a different time zone from the server, does it make any difference on TOTP generated password? And I'm running python on windows, if that matters.
Thanks

403 when retrieving a WSDL via Python SUDS

I can't seem to get SUDS to download a WSDL that requires Basic auth credentials. My code is simple:
wsdl_url = 'https://example.com/ChangeRequest.do?WSDL'
self.client = Client(wsdl_url, username=username, password=password)
I've also tried:
from suds.transport.https import HttpAuthenticated
wsdl_url = 'https://example.com/ChangeRequest.do?WSDL'
credentials = dict(username=username, password=password)
t = HttpAuthenticated(**credentials)
self.client = Client(url=wsdl_url, transport=t)
In both cases, the service returns a 403 Forbidden error. I can go down into the SUDS code in http.py and add this line to the call:
u2request.add_header('Authorization','Basic xxxxxxxxxxxxxxxxxxxx')
This works. What am I doing wrong to get SUDS to pass my credentials when downloading the WSDL?
Note: I try to connect to the WSDL directly using both Chrome's Postman plugin and SoapUI, and the service works as well. So I know the credentials are correct.

I encountered a similar issue (suds v0.4, wsdl, 403), and found out that it was because the server I'm trying to access blocks any requests with the header User-Agent set like Python-urllib* (suds is using urllib2, hence the default header). Explicitly change the header solves the issue.
Particular to my solution: I overrode the open method of a transport class, and set client options, like the following code snippet. Note that we need to explicitly set for open and subsequent requests separately. Please advice better ways to circumvent this if you know any. And hope this post could help save someone's time in the future.
import urllib2
import suds
from suds.transport.https import HttpAuthenticated
from suds.transport import TransportError
URL = 'https://example.com/ChangeRequest.do?WSDL'
class HttpHeaderModify(HttpAuthenticated):
def open(self, request):
try:
url = request.url
u2request = urllib2.Request(url, headers={'User-Agent': 'Mozilla'})
self.proxy = self.options.proxy
return self.u2open(u2request)
except urllib2.HTTPError, e:
raise TransportError(str(e), e.code, e.fp)
transport = HttpHeaderModify()
client = Client(URL, transport=transport, timeout=10)
# Subsequent requests' header needs to be set again here. The overridden transport
# class only handles opening of the client.
client.set_options(headers={'User-Agent': 'Mozilla'})
P.S. Though my problem may not be the same, searching for "403 suds" pops up this SO question, so I decide just post my solution here.
reference post that gave me the right direction: https://bitbucket.org/jurko/suds/issues/27/client-request-for-wsdl-does-not-use-given

I used to have this issue before and compare with the soap UI header.
Found that suds missing to include the header (Host).
client.set_options(headers={'Host': 'value'})
And issue fixed.

HTTPError: HTTP Error 401: basic auth failed. Bing Search

I have made a code to get urls from bing search. It gives the error mentioned above.
import urllib
import urllib2
accountKey = 'mykey'
username =accountKey
queryBingFor = "'JohnDalton'"
quoted_query = urllib.quote(queryBingFor)
rootURL = "https://api.datamarket.azure.com/Bing/Search/"
searchURL = rootURL + "Image?$format=json&Query=" + quoted_query
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, searchURL,username,accountKey)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
urllib2.install_opener(opener)
readURL = urllib2.urlopen(searchURL).read()
I have made the username = authKey as someone told me it has to be same for both. Anyways, i didn't get a username when i made the bing webmaster account. Or is it just my email. Excuse me if i have made novice mistakes. I've just started Python.

In the absence of any other information, it seems unlikely that what is effectively your username and password would be the same thing if this site actually needs this form of authorisation.
Are you able to make it work by doing a request in your browser like the following?
https://mykey:mykey#api.datamarket.azure.com/Bing/Search/Image?$format=json&Query=blah
If so then at lerast it sounds like the credentials are right and that its the way you are using them in python that's wrong, but more likely the above will fail with the same error, suggesting the credentials themselves are not valid.
Also see this question, which suggests there may be a problem is the site doesn't do 'standard' auth: urllib2 HTTPPasswordMgr not working - Credentials not sent error
It also suggests that you might need to pass the top level URL of the site tot he password manager rather than the specific search URL.
Finally, it might be worth adapting this code:
http://www.voidspace.org.uk/python/articles/authentication.shtml
for your site to check the auth realm and scheme the site is sending you to check they're supported.

urllib2/pycurl in Django: Fetch XML, check HTTP status, check HTTPS connection

I need to make an API call (of sorts) in Django as a part of the custom authentication system we require. A username and password is sent to a specific URL over SSL (using GET for those parameters) and the response should be an HTTP 200 "OK" response with the body containing XML with the user's info.
On an unsuccessful auth, it will return an HTTP 401 "Unauthorized" response.
For security reasons, I need to check:
The request was sent over an HTTPS connection
The server certificate's public key matches an expected value (I use 'certificate pinning' to defend against broken CAs)
Is this possible in python/django using pycurl/urllib2 or any other method?

Using M2Crypto:
from M2Crypto import SSL
ctx = SSL.Context('sslv3')
ctx.set_verify(SSL.verify_peer | SSL.verify_fail_if_no_peer_cert, depth=9)
if ctx.load_verify_locations('ca.pem') != 1:
raise Exception('No CA certs')
c = SSL.Connection(ctx)
c.connect(('www.google.com', 443)) # automatically checks cert matches host
c.send('GET / \n')
c.close()
Using urllib2_ssl (it goes without saying but to be explicit: use it at your own risk):
import urllib2, urllib2_ssl
opener = urllib2.build_opener(urllib2_ssl.HTTPSHandler(ca_certs='ca.pem'))
xml = opener.open('https://example.com/').read()
Related: Making HTTPS Requests secure in Python.
Using pycurl:
c = pycurl.Curl()
c.setopt(pycurl.URL, "https://example.com?param1=val1&param2=val2")
c.setopt(pycurl.HTTPGET, 1)
c.setopt(pycurl.CAINFO, 'ca.pem')
c.setopt(pycurl.SSL_VERIFYPEER, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 2)
c.setopt(pycurl.SSLVERSION, 3)
c.setopt(pycurl.NOBODY, 1)
c.setopt(pycurl.NOSIGNAL, 1)
c.perform()
c.close()
To implement 'certificate pinning' provide different 'ca.pem' for different domains.

httplib2 can do https requests with certificate validation:
import httplib2
http = httplib2.Http(ca_certs='/path/to/cert.pem')
try:
http.request('https://...')
except httplib2.SSLHandshakeError, e:
# do something
Just make sure that your httplib2 is up to date. The one which is shipped with my distribution (ubuntu 10.04) does not have ca_certs parameter.
Also in similar question to yours there is an example of certificate validation with pycurl.

pywikipedia bot with https and http authentication

I'm having trouble getting my bot to login to a MediaWiki install on the intranet. I believe it is due to the http authentication protecting the wiki.
Facts:
The wiki root is: https://local.example.com/mywiki/
When visiting the wiki with a web browser, a popup comes up asking for enterprise credentials (I assume this is basic access authentication)
This is what I have in my user-config.py:
mylang = 'en'
family = 'mywiki'
usernames['mywiki']['en'] = u'Bot'
authenticate['local.example.com'] = ('user', 'pass')
This is what I have in mywiki_family.py:
# -*- coding: utf-8 -*-
import family, config
# The Wikimedia family that is known as mywiki
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'mywiki'
self.langs = { 'en' : 'local.example.com'}
def scriptpath(self, code):
return '/mywiki'
def version(self, code):
return '1.13.5'
def isPublic(self):
return False
def hostname(self, code):
return 'local.example.com'
def protocol(self, code):
return 'https'
def path(self, code):
return '/mywiki/index.php'
When I execute login.py -v -v, I get this:
urllib2.urlopen(urllib2.Request('https://local.example.com/w/index.php?title=Special:Userlogin&useskin=monobook&action=submit', wpSkipCookieCheck=1&wpPassword=XXXX&wpDomain=&wpRemember=1&wpLoginattempt=Aanmelden%20%26%20Inschrijven&wpName=Bot, {'Content-type': 'application/x-www-form-urlencoded', 'User-agent': 'PythonWikipediaBot/1.0'})):
(Redundant traceback info here)
urllib2.HTTPError: HTTP Error 401: Unauthorized
(I'm not sure why it has 'local.example.com/w' instead of '/mywiki'.)
I thought it might be trying to authenticate to example.com instead of example.com/wiki, so I changed the authenticate line to:
authenticate['local.example.com/mywiki'] = ('user', 'pass')
But then I get an HTTP 401.2 error back from IIS:
You do not have permission to view this directory or page using the credentials that you supplied because your Web browser is sending a WWW-Authenticate header field that the Web server is not configured to accept.
Any help on how to get this working would be appreciated.
Update After fixing my family file, it now says:
Getting information for site mywiki:en
('http error', 401, 'Unauthorized', )
WARNING: Could not open 'https://local.example.com/mywiki/index.php?title=Non-existing_page&action=edit&useskin=monobook'. Maybe the server or your connection is down. Retrying in 1 minutes...
I looked at the HTTP headers on a plan urllib2.ulropen call and it's using WWW-Authenticate: Negotiate WWW-Authenticate: NTLM. I'm guessing urllib2 and thus pywikipedia don't support this?
Update Added a tasty bounty for help in getting this to work. I can authenticate using python-ntlm. How do I integrate this into pywikipedia?

Well the fact that login.py tries accessing '\w' instead of your path shows that there is a family configuration issue.
Your code is indented strangely: is scriptpath a member of the new Family class? as in:
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'mywiki'
self.langs = { 'en' : 'local.example.com'}
def scriptpath(self, code):
return '/mywiki'
def version(self, code):
return '1.13.5'
def isPublic(self):
return False
def hostname(self, code):
return 'local.example.com'
def protocol(self, code):
return 'https'
?
I believe that something is wrong with your family file. A good way to check is to do in a python console:
import wikipedia
site = wikipedia.getSite('en', 'mywiki')
print site.login_address()
as long as the relative address is wrong, showing '/w' instead of '/mywiki', it means that the family file is still not configured correctly, and that the bot won't work :)
Update: how to integrate ntlm in pywikipedia?
I just had a look at the basic example here. I would integrate the code before that line in login.py:
response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))
You want to write something of the like:
from ntlm import HTTPNtlmAuthHandler
user = 'DOMAIN\User'
password = "Password"
url = self.site.protocol() + '://' + self.site.hostname()
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
response = urllib2.urlopen(urllib2.Request(self.site.protocol() + '://' + self.site.hostname() + address, data, headers))
I would test this and integrate it directly into pywikipedia codebase if only I had an available ntlm setup...
Whatever happens, please do not vanish with your solution: we're interested, at pywikipedia, by your solution :)

I am guessing the problem you have is that the server expects basic authentication and you are not handling that in your client. Michael Foord wrote a good article about handling basic authentication in Python.
You did not provide enough information for me to be sure about this, so if that does not work, please provide some additional information, like network dump of you connection attempt.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download data from a password protected website - python

Related

How to make a HTTP Basic Authentication post request where password is generated by TOTP in Python?

403 when retrieving a WSDL via Python SUDS

HTTPError: HTTP Error 401: basic auth failed. Bing Search

urllib2/pycurl in Django: Fetch XML, check HTTP status, check HTTPS connection

pywikipedia bot with https and http authentication

Categories

Resources