Trying to access the Internet using urllib2 in Python - python

I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2. Just as a test, I tried this code:
import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()
Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>" or "...". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
I'd appreciate any feedback. Is there a different tool than urllib2 to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.

With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.
So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.
I assume you're using Python 2.x. From the Python docs on urllib :
# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen, so it's probably not going to work.
If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).

This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.

That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib
import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
html = response.read()

I get a 404 error almost immediately (no hanging):
>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
...
urllib2.HTTPError: HTTP Error 404: Not Found
If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:
>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
...
urllib2.URLError: <urlopen error timed out>

Related

How do i solve HTTP Error 429: Too Many Requests? [duplicate]

I am trying to use Python to login to a website and gather information from several webpages and I get the following error:
Traceback (most recent call last):
File "extract_test.py", line 43, in <module>
response=br.open(v)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code
I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?
Here's my code:
import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open
urls_list=[first,second,third,fourth]
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()
for url in urls_list:
br.open(url)
print re.findall("Some String")
Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3
Writing this piece of code when requesting fixed my problem:
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.
As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:
1) Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.
2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.
3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.
Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:
class TooManyRequests(Exception):
"""Too many requests"""
#task(
rate_limit='10/s',
autoretry_for=(ConnectTimeout, TooManyRequests,),
retry_backoff=True)
def api(*args, **kwargs):
r = requests.get('placeholder-external-api')
if r.status_code == 429:
raise TooManyRequests()
if response.status_code == 429:
time.sleep(int(response.headers["Retry-After"]))
Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.
There is a brief blog post demonstrating a way to use tor along with urllib2:
http://blog.flip-edesign.com/?p=119
I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.
Check out this article
In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.

Pygoogle voice not logging In

Google just updated their google voice platform. Which seems to directly correlate when my googlevoice login stopped working.
I have tried the following:
allowing captcha as suggested here (pygooglevoice-login-error)
Adapting a 2.7 solution here with no luck Python Google Voice
Logging out of my session that is voice.logout()
Uninstalled pygooglevoice and reinstalled.
Tried a different google voice account.
This code was working perfectly up until the google voice website makeover.
python 3.5.2 windows Server2012R2
from googlevoice import Voice
from googlevoice.util import input
voice = Voice()
voice.login(email='email#gmail.com', passwd='mypassword')
def sendText(phoneNumber,text):
try:
voice.send_sms(phoneNumber, text)
except Exception:
pass
sendText(phoneNumber=[aaabbbcccc],text="Hello from Google Voice!")
voice.logout()
Error Log:
Traceback (most recent call last):
File voice.py, line 95, in login
assert self.special
AssertionError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
line 7, in <module>
voice.login(email='********', passwd='*******')
File voice.py, line 97, in login
raise LoginError
googlevoice.util.LoginError
I've got the same issue. It looks like the page being sent back is a drastically different, javascript/ajax solution than what was sent before.
I've been messing with it a bit and tracked it to the missing the "special" session token that was included before. PyGoogleVoice is searching for the string literal "_rnr_se" within the page HTML sent back from google to scrape the session value. That string is not found, which causes it to think the login failed. From what I can tell, PGV needs that token to make the url/function calls back to imitate the web client.
There's a javascript function that's retrieving that variable now, instead of it being passed back, hardcoded in the HTML page.
gc.net.XhrManager = function(xsrfToken, notification, loadNotification) {
goog.events.EventTarget.call(this);
this.xsrfToken_ = xsrfToken;
this.notification_ = notification;
this.loadNotification_ = loadNotification;
this.logger_ = goog.debug.Logger.getLogger("gc.Xhr");
this.xhrManager_ = new goog.net.XhrManager(0);
this.activeRequests_ = new goog.structs.Map;
this.eventHandler_ = new goog.events.EventHandler(this);
this.eventHandler_.listen(this.xhrManager_, goog.net.EventType.SUCCESS, this.onRequestSuccess_);
this.eventHandler_.listen(this.xhrManager_, goog.net.EventType.ERROR, this.onRequestError_);
};
And then when making calls, it's using the value like so:
gc.net.XhrManager.prototype.sendPost = function(id, url, queryData, opt_successCallback, opt_errorCallback) {
this.sendAnalyticsEvent_(url, queryData);
id = goog.string.buildString(id, this.idGenerator_.getNextUniqueId());
if (goog.isDefAndNotNull(queryData) && !(queryData instanceof goog.Uri.QueryData)) {
throw Error("queryData parameter must be of type goog.Uri.QueryData");
}
var uri = new goog.Uri(url), completeQueryData = queryData || new goog.Uri.QueryData;
completeQueryData.set("_rnr_se", this.xsrfToken_);
this.activeRequests_.set(id, {queryData:completeQueryData, onSuccess:opt_successCallback, onError:opt_errorCallback});
this.xhrManager_.send(id, uri.toString(), "POST", completeQueryData.toString());
};
I figured I'd share my findings so others can help tinker with the new code and figure out how to retrieve and interact with this new version. It may not be too far off, once we can find the new way to capture that xsrfToken or _rnr_se value.
I'm a bit short on time at the current moment, but would love to get this working again. It's probably a matter of messing with firebug, etc. to watch how the session gets started in browser via javascript and have PGV mimic the new URLs, etc.
Per Ward Mundy:
New version of gvoice command line sms text messaging is available, which is fixed to work with Google's new modernized "AngularJS" gvoice web interface. It was a small change to get it working, in case anyone is wondering.
Paste these commands into your shell to upgrade:
cd ~
git clone https://github.com/pettazz/pygooglevoice
cd pygooglevoice
python setup.py install
cp -p bin/gvoice /usr/bin/.
pip install --upgrade BeautifulSoup
https://pbxinaflash.com/community/threads/sms-with-google-voice-is-back-again.19717/page-2#post-129617

How to read JSON from URL in Python?

I am trying to use Python to get a JSON file from the Web. If I open the URL in my browser (Mozilla or Chromium) I do see the JSON. But when I do the following with the Python:
response = urllib2.urlopen(url)
data = json.loads(response.read())
I get an error message that tells me the following (after translation in English): Errno 10060, a connection troughs an error, since the server after a certain time period did not react, or the connection was erroneous, or the host did not react.
ADDED
It looks like there are many people who faced the described problem. There are also some answers to the similar (or the same) question. For example here we can see the following solution:
import requests
r = requests.get("http://www.google.com", proxies={"http": "http://61.233.25.166:80"})
print(r.text)
It is already a step forward for me (I think that it is very likely that the proxy is the reason of the problem). However, I still did not get it done since I do not know URL of my proxy and I probably will need user name and password. Howe can I find them? How did it happen that my browsers have them I do not?
ADDED 2
I think I am now one step further. I have used this site to find out what my proxy is: http://www.whatismyproxy.com/
Then I have used the following code:
proxies = {'http':'my_proxy.blabla.com/'}
r = requests.get(url, proxies = proxies)
print r
As a result I get
<Response [404]>
Looks not so good, but at least I think that my proxy is correct, because when I randomly change the address of the proxy I get another error:
Cannot connect to proxy
So, I can connect to proxy but something is not found.
I think there might be something wrong, when you're trying to get the json from the online source(URL). Just to make things clear, here is a small code snippet
#!/usr/bin/env python
try:
# For Python 3+
from urllib.request import urlopen
except ImportError:
# For Python 2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = str(response.read())
return json.loads(data)
If you still get a connection error, You can try a couple of steps:
Try to urlopen() a random site from the Interpreter (Interactive Mode). If you are able to grab the source code you're good. If not check internet conditions or try the request module. Check here
Check and see if the json in the URL is in the correct syntax. For sample json syntax check here
Try the simplejson module.
Edit 1:
if you want to access websites using a system wide proxy you will have to use a proxy handler to use loopback(local host) to connect to that proxy.. A sample code is shown below.
proxy = urllib2.ProxyHandler({
'http': '127.0.0.1',
'https': '127.0.0.1'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
# this way you can send both http and https request using proxies
urllib2.urlopen('http://www.google.com')
urllib2.urlopen('https://www.google.com')
I have not not worked a lot with ProxyHandler. I just know the theory and code. I am sure there are better ways to access websites through proxies; One which does not involve installing the opener everytime you run the program. But hopefully it will point you in the right direction.

Python Requests SSL issue

It's been days now since I started to look for a solution for this.
I've been trying to use requests to make an https request through a proxy with no luck.
Altough this is included in a bigger project of mine, it all boils done to this:
import requests
prox = 'xxx.xxx.xxx.xxx:xxx' # fill in valid proxy
r = requests.get('https://ipdb.at', proxies={'http': prox,
'https': prox})
I tried the kward verify=False but I keep getting the same error or a variation of it:
Traceback (most recent call last):
File "/Users/mg/Desktop/proxy_service.py", line 21, in <module>
verify=False)
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 279, in request
resp = self.send(prep, stream=stream, timeout=timeout, verify=verify, cert=cert, proxies=proxies)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 374, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 213, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
[Finished in 20.4s with exit code 1]
I'm using the latest versions of requests and openssl and 2.7.3 for python. I tested it on mac mountain lion 10.8.2 and also on windows seven.
I googled this out, saw others having similar issues. I found related pull requests on the libraries requests and urllib3, also saw information about this being a bad implementation of the http connect verb. I found a fix to use a custom adapter (i don't recall exactly but I think it's the right term) to use a custom ssl protocol. Tried them all, none helped.
So, I'm looking for more info. Do you have any idea what is going on, I can I fix it, etc, etc...
All help is welcome and thank you all!
PS: I tried several proxies, and I'm sure they are not the issue.
I had this same problem but when trying to use oauth2client (Google library for using oauth2 against their APIs). After a day of faffing ("experimentation"), I discovered that I had the env variable set as follows:
https_proxy=https://your-proxy:port
Changing this to
https_proxy=http://your-proxy:port
completely fixed the problem for me. Presumably this means that the initial handshake from the client to the proxy is now unencrypted, but after that the https must flow directly. I don't believe my company proxy was configured to talk https anyway.
Note however that my browsers all behaved fine with the previous setting. Perhaps they silently tried to establish the proxy connection over http after failing to use https.
Googling shows examples of setting https_proxy to both http or https addresses in equal measure. I guess it depends on how the proxy is configured.
The "use for all protocols" button in browser/OS configuration pop-ups makes it easy to accidentally use the setting I had.
However I'm not sure if that is your problem. Your code suggests that you already use exactly the same proxy setting for both http and https.
By the way, this was all on Ubuntu 12.04.1.
The answer here worked for me
using requests with TLS doesn't give SNI support
I Installed the three dependent libraries and included the monkey patch tip in the second answer

Python urllib2, how to avoid errors - need help

I am using python urllib2 to download pages from the web. I am not using any kind of user_agent etc. I am getting below sample errors. Can someone tell me a easy way to avoid them.
http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code: 403
http://www.spiritus-temporis.com/marc-platt-dancer-/
The server couldn't fulfill the request.
Error code: 503
http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code: 500
http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason: timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
File "download.py", line 43, in <module>
localFile.write(response.read())
File "/usr/lib/python2.6/socket.py", line 327, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.6/httplib.py", line 517, in read
return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)
Thank you
Bala
Many web resources require some kind of cookie or other authentication to access, your 403 status codes are most likely the result of this.
503 errors tend to mean you're rapidly accessing resources from a server in a loop and you need to wait briefly before attempting another access.
The 500 example doesn't even appear to exist...
The timeout error may not need the "!!", I can only load the resource without it.
I recommend you read up on http status codes.
For those more complicated tasks, You might want to consider using mechanize, twill or even Selenium or Windmill, which will support more compliated scenerios, including cookies or javascript support.
For random website, it might be tricky to work around with urllib2 only (signed cookies, anyone?).

Categories