Python urllib2, how to avoid errors - need help - python

I am using python urllib2 to download pages from the web. I am not using any kind of user_agent etc. I am getting below sample errors. Can someone tell me a easy way to avoid them.
http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code: 403
http://www.spiritus-temporis.com/marc-platt-dancer-/
The server couldn't fulfill the request.
Error code: 503
http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code: 500
http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason: timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
File "download.py", line 43, in <module>
localFile.write(response.read())
File "/usr/lib/python2.6/socket.py", line 327, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.6/httplib.py", line 517, in read
return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)
Thank you
Bala

Many web resources require some kind of cookie or other authentication to access, your 403 status codes are most likely the result of this.
503 errors tend to mean you're rapidly accessing resources from a server in a loop and you need to wait briefly before attempting another access.
The 500 example doesn't even appear to exist...
The timeout error may not need the "!!", I can only load the resource without it.
I recommend you read up on http status codes.

For those more complicated tasks, You might want to consider using mechanize, twill or even Selenium or Windmill, which will support more compliated scenerios, including cookies or javascript support.
For random website, it might be tricky to work around with urllib2 only (signed cookies, anyone?).

Related

Why does match_hostname in python 3.6's SSL module give hostname of local server when issuing remote request?

I'm running into an infrequent bug when issuing a remote request using python's requests module. The stack trace indicates the exception is being thrown down in the ssl module of our python 3.6.6 install. For context, this is a RHEL 6.10 server OpenSSL version is OpenSSL 1.0.1e-fips 11 Feb 2013.
While I don't think it's relevant, the request is being issued from a web server that is sitting behind a WAF that is handling SSL offloading for inbound requests. Again, this is an outbound request that's failing, but I thought I'd mention the above just in case.
The issue is that sometimes, these outbound requests to different domains, such as api.powerbi.com, or api.adp.com are throwing a CertificateError indicating the hostname of whatever certificate was resolved does not match the requested hostname. The strange thing is, the reported hostname of the certificate is our own website.
That was was little confusing to word, but basically it's like outbound requests from our web server are sometimes attempting to load our own SSL certificate, instead of the one for the external URL we're trying to reach.
SSLError at /reports/power-bi/shipped-invoice-list/
HTTPSConnectionPool(host='api.powerbi.com', port=443): Max retries exceeded with url: /v1.0/myorg/groups/12387aasd0-xxxx-1234-82d7-e73929720a25/reports/ (Caused by SSLError(CertificateError("hostname 'api.powerbi.com' doesn't match either of '*.ourdomain.com', 'ourdomain.com'",),))
Traceback (most recent call last):
File "/u/home/deploy/.virtualenvs/ourdomain.com/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
chunked=chunked,
File "/u/home/deploy/.virtualenvs/ourdomain.com/lib/python3.6/site-packages/urllib3/connectionpool.py", line 376, in _make_request
self._validate_conn(conn)
File "/u/home/deploy/.virtualenvs/ourdomain.com/lib/python3.6/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/u/home/deploy/.virtualenvs/ourdomain.com/lib/python3.6/site-packages/urllib3/connection.py", line 386, in connect
_match_hostname(cert, self.assert_hostname or server_hostname)
File "/u/home/deploy/.virtualenvs/ourdomain.com/lib/python3.6/site-packages/urllib3/connection.py", line 396, in _match_hostname
match_hostname(cert, asserted_hostname)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/ssl.py", line 327, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
I am wondering if anyone has seen anything like this before? Whenever I search for these types of errors, I'm mainly getting results about invalid SSL certificates, but nothing about a local SSL certificate being used to valid a request to an external resource.
We are not doing anything special with the requests module, no sessions, etc. It's simply....
import requests
response = requests.post('http://api.powerbi.com...', etc
I should add that we do also issue outbound requests to ourdomain.com from this same application, so I was almost wondering if something in HTTPSConnectionPool is reusing an existing connection that was previously used against ourdomain.com, but is not being used for api.powerbi.com and the SSL isn't being reset? I am not even sure if that is how the connection pooling works, but it was a thought.
Barring that, it would seem like something in our WAF might be occasionally routing outbound requests intended for an external domain, back into our own local hostname somehow.

API GET request returns 404

I'm currently developing a Django web application which is supposed to add some functionality to online shops based on InSales (a popular Russian web platform). I use the official InSales lib for Python called pyinsales to get objects like orders and products from registered shops.
The InSales API is based on REST requests with XML. I use the code below to get information about orders in the Django shell:
from install.models import Shop
from insales import InSalesApi
shop = Shop.objects.get(shop_url='shop-url.myinsales.ru')
api = InSalesApi(shop.shop_url, 'trackpost', shop.password)
orders = api.get_orders()
Here shop.shop_url is the shop URL ("oh, really?"), trackpost is the name of my app and shop.password is the password needed to connect. Password is generated by MD5 (that's an InSales rule). And here I get an error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/local/lib/python3.5/site-packages/insales/api.py", line 32, in get_orders
return self._get('/admin/orders.xml', {'per_page': per_page, 'page': page}) or []
File "/usr/local/lib/python3.5/site-packages/insales/api.py", line 291, in _get
return self._req('get', endpoint, qargs)
File "/usr/local/lib/python3.5/site-packages/insales/api.py", line 307, in _req
response = getattr(self.connection, method)(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/insales/connection.py", line 85, in get
return self.request('GET', path, qargs=qargs)
File "/usr/local/lib/python3.5/site-packages/insales/connection.py", line 70, in request
(method, path, resp.status, body), resp.status)
insales.connection.ApiError: GET request to /admin/orders.xml?page=1&per_page=25 returned: 404
b'<?xml version="1.0" encoding="UTF-8"?>\n<errors>\n <error></error>\n</errors>\n'
I've already checked everything for mistakes. Password is generated properly (according to official documentation), shop URL is correct and all the methods from the lib are used correctly. InSales tech support doesn't response, so now I have no idea about what is happening.
I don't want you to debug this issue, but I'd like to know what can cause the 404 error (except obvious things, like incorrect URL or password). Thanks everybody who tries to answer.
404 mean that the server couldn't find what you requested.
https://en.wikipedia.org/wiki/HTTP_404
So it's not an authentication issue, it seems like there are no orders at that particular endpoint.
Have you tried using the python "requests" package instead of using the "pyinsales" module to request data from the endpoint directly? That way you can customize your own headers, etc.
You might also try testing the endpoints in a program like postman to ensure that the endpoints are valid before you try to hit them programmatically.
Analysing the pyinsales code on github, I realised that url should be a subdomain name on myinsales domain, so if the full url is http://shop.myinsales.ru/ then the first argument in pyinsales should be shop. That's a pity no one working on the module pointed that in readme, but such things happen

Debugging IBM Watson APIs on Bluemix

I am trying to use IBM Watson's conversation API on Bluemix and am getting the following exception. I am not able to find any documentation as to how to debug it.
Error:
"conversation.py", line 14, in <module>
response = conversation.message(workspace_id=workspace_id, message_input={'text': 'Will you be able to convert an html file?'})
File "/usr/local/lib/python2.7/dist-packages/watson_developer_cloud/conversation_v1_experimental.py", line 45, in message
json=data, accept_json=True)
File "/usr/local/lib/python2.7/dist-packages/watson_developer_cloud/watson_developer_cloud_service.py", line 263, in request
raise WatsonException(error_message)
watson_developer_cloud.watson_developer_cloud_service.WatsonException: Error: NLU service error: Processing error, Code: 500
I checked the credentials and workspace id and they all seem fine.
Any idea on how to debug will be highly appreciated.
I know this is an old question but I don't want to leave it unanswered.
As #jonrsharpe said, Error: NLU service error: Processing error, Code: 500 means that there was a server side error.
When you hit a 5XX error there is nothing you can't do more than trying the request again.

Python Requests SSL issue

It's been days now since I started to look for a solution for this.
I've been trying to use requests to make an https request through a proxy with no luck.
Altough this is included in a bigger project of mine, it all boils done to this:
import requests
prox = 'xxx.xxx.xxx.xxx:xxx' # fill in valid proxy
r = requests.get('https://ipdb.at', proxies={'http': prox,
'https': prox})
I tried the kward verify=False but I keep getting the same error or a variation of it:
Traceback (most recent call last):
File "/Users/mg/Desktop/proxy_service.py", line 21, in <module>
verify=False)
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 279, in request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 374, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 213, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
[Finished in 20.4s with exit code 1]
I'm using the latest versions of requests and openssl and 2.7.3 for python. I tested it on mac mountain lion 10.8.2 and also on windows seven.
I googled this out, saw others having similar issues. I found related pull requests on the libraries requests and urllib3, also saw information about this being a bad implementation of the http connect verb. I found a fix to use a custom adapter (i don't recall exactly but I think it's the right term) to use a custom ssl protocol. Tried them all, none helped.
So, I'm looking for more info. Do you have any idea what is going on, I can I fix it, etc, etc...
All help is welcome and thank you all!
PS: I tried several proxies, and I'm sure they are not the issue.
I had this same problem but when trying to use oauth2client (Google library for using oauth2 against their APIs). After a day of faffing ("experimentation"), I discovered that I had the env variable set as follows:
https_proxy=https://your-proxy:port
Changing this to
https_proxy=http://your-proxy:port
completely fixed the problem for me. Presumably this means that the initial handshake from the client to the proxy is now unencrypted, but after that the https must flow directly. I don't believe my company proxy was configured to talk https anyway.
Note however that my browsers all behaved fine with the previous setting. Perhaps they silently tried to establish the proxy connection over http after failing to use https.
Googling shows examples of setting https_proxy to both http or https addresses in equal measure. I guess it depends on how the proxy is configured.
The "use for all protocols" button in browser/OS configuration pop-ups makes it easy to accidentally use the setting I had.
However I'm not sure if that is your problem. Your code suggests that you already use exactly the same proxy setting for both http and https.
By the way, this was all on Ubuntu 12.04.1.
The answer here worked for me
using requests with TLS doesn't give SNI support
I Installed the three dependent libraries and included the monkey patch tip in the second answer

Trying to access the Internet using urllib2 in Python

I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2. Just as a test, I tried this code:
import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()
Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>" or "...". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
I'd appreciate any feedback. Is there a different tool than urllib2 to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.
With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.
So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.
I assume you're using Python 2.x. From the Python docs on urllib :
# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen, so it's probably not going to work.
If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).
This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.
That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib
import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
html = response.read()
I get a 404 error almost immediately (no hanging):
>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
...
urllib2.HTTPError: HTTP Error 404: Not Found
If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:
>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
...
urllib2.URLError: <urlopen error timed out>

Categories