How do i solve HTTP Error 429: Too Many Requests? [duplicate] - python

I am trying to use Python to login to a website and gather information from several webpages and I get the following error:
Traceback (most recent call last):
File "extract_test.py", line 43, in <module>
response=br.open(v)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code
I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?
Here's my code:
import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open
urls_list=[first,second,third,fourth]
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()
for url in urls_list:
br.open(url)
print re.findall("Some String")

Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3

Writing this piece of code when requesting fixed my problem:
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.

As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:
1) Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.
2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.
3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.
Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:
class TooManyRequests(Exception):
"""Too many requests"""
#task(
rate_limit='10/s',
autoretry_for=(ConnectTimeout, TooManyRequests,),
retry_backoff=True)
def api(*args, **kwargs):
r = requests.get('placeholder-external-api')
if r.status_code == 429:
raise TooManyRequests()

if response.status_code == 429:
time.sleep(int(response.headers["Retry-After"]))

Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.
There is a brief blog post demonstrating a way to use tor along with urllib2:
http://blog.flip-edesign.com/?p=119

I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.
Check out this article

In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.

Related

How to see if user is connected to internet or NOT?

I am building a BlogApp and I am stuck on a Problem.
What i am trying to do
I am trying to implement a feature that ,`If user is connected to internet then everything works fine BUT if user is not connected to internet then show a message like "You're not Connected to Internet".
What have i tried
I also tried channels but then i think internet connection are far away from Django-channels.
I also tried this :
url = "http://127.0.0.1:8000/"
timeout = 5
try:
request = requests.get(url, timeout=timeout)
print("Connected to the Internet")
except (requests.ConnectionError, requests.Timeout) as exception:
print("No INTERNET")
But it is keep showing me :
'Response' object has no attribute 'META'
I don't know what to do.
Any help would be Appreciated.
Thank You in Advance
It is not easy to know whether you're connected to the internet. In fact it is not even clear what this means. It depends a lot on the context.
In many practical cases it means, that your network setup is setup such, that you can access a DNS server and that you can access at least one machine on the internet.
You could just use one known url like for example "https://google.com" or "https://stackoverflow.com".
However this means that:
your test will fail if given service is for any reason down
you create requests to a server that isn't yours.
If you know, that the application should access your special web service, then you could use the url of your special web service:
url = "https://your_special_webservice.yourdomain"
Side information:
If you put the code in your question into a django view, that handles http requests, then you should probably write something like:
request = requests.get(url, timeout=timeout)
instead of
response = requests.get(url, timeout=timeout)
Otherwise you will overwrite the request object, of your django view
and this is probably what provoked your error message:
'Response' object has no attribute 'META'

HTTP REST Gateway to AMQP Request-Response, Without Web Sockets Or Polling

I've struggled for two days to understand how REST API Gateways should return GET requests to browsers when the backend service runs on AMQP (without using Web Sockets or polling).
Have successfully RPC'ed betweeen AMQP service (with RabbitMqs reply_to & correlation_id), but with Flask HTTP request waiting I'm still lost.
gateway.py - Response Handler Inside The HTTP Handler, Times out
def products_get():
def handler(ch=None, method=None, properties=None, body=None):
if body:
return body
return False
return_queue = 'products.get.return'
broker.channel.queue_declare(return_queue)
broker.channel.basic_consume(handler, return_queue)
broker.publish(exchange='', routing_key='products.get', body='Request data', properties=pika.BasicProperties(reply_to=return_queue))
now = time.time() # for timeout. Not having this returns 'no content' immediately
while time.time() < now + 1:
if handler():
return handler()
return 'Time out'
POST/PUT can simply send the AMQP message, return 200/201/201 immediately and the service work at its own pace. A separate REST interface just for GET requests seems implausible, but don't know the other options.
Regards
I think what you're asking is "how to perform asynchronous GET requests". and I reckon that the answer is - you can't. and should not. its bad practice or bad design. and it does not scale.
Why are you trying to get your GET response payload from AMQP?
If the paylaod (the content of the response) can be pulled from some DB, just pull it from there. that's called a synchronous request.
If the payload must be processed in some backend, send it away and don't have the requester wait for a response. You could assign some ID and have the requester ask again later (or collect some callback URL from the requester and have your backend POST the response once its ready - less common design).
EDIT:
so, given that you have to work with AMQP-backed backend, I would do something a little more elaborate: spawn a thread or a process in your front end that would constantly consume from AMQP and store the results locally or in some db. and serve GET results based on data that you stored locally. if the data isn't yet available, just return 404. ideally you'll need to re-shape your API: split it into "post" requests (that would trigger work at the backend) and "get" requests (that would return the results if they're available).

How to read JSON from URL in Python?

I am trying to use Python to get a JSON file from the Web. If I open the URL in my browser (Mozilla or Chromium) I do see the JSON. But when I do the following with the Python:
response = urllib2.urlopen(url)
data = json.loads(response.read())
I get an error message that tells me the following (after translation in English): Errno 10060, a connection troughs an error, since the server after a certain time period did not react, or the connection was erroneous, or the host did not react.
ADDED
It looks like there are many people who faced the described problem. There are also some answers to the similar (or the same) question. For example here we can see the following solution:
import requests
r = requests.get("http://www.google.com", proxies={"http": "http://61.233.25.166:80"})
print(r.text)
It is already a step forward for me (I think that it is very likely that the proxy is the reason of the problem). However, I still did not get it done since I do not know URL of my proxy and I probably will need user name and password. Howe can I find them? How did it happen that my browsers have them I do not?
ADDED 2
I think I am now one step further. I have used this site to find out what my proxy is: http://www.whatismyproxy.com/
Then I have used the following code:
proxies = {'http':'my_proxy.blabla.com/'}
r = requests.get(url, proxies = proxies)
print r
As a result I get
<Response [404]>
Looks not so good, but at least I think that my proxy is correct, because when I randomly change the address of the proxy I get another error:
Cannot connect to proxy
So, I can connect to proxy but something is not found.
I think there might be something wrong, when you're trying to get the json from the online source(URL). Just to make things clear, here is a small code snippet
#!/usr/bin/env python
try:
# For Python 3+
from urllib.request import urlopen
except ImportError:
# For Python 2
from urllib2 import urlopen
import json
def get_jsonparsed_data(url):
response = urlopen(url)
data = str(response.read())
return json.loads(data)
If you still get a connection error, You can try a couple of steps:
Try to urlopen() a random site from the Interpreter (Interactive Mode). If you are able to grab the source code you're good. If not check internet conditions or try the request module. Check here
Check and see if the json in the URL is in the correct syntax. For sample json syntax check here
Try the simplejson module.
Edit 1:
if you want to access websites using a system wide proxy you will have to use a proxy handler to use loopback(local host) to connect to that proxy.. A sample code is shown below.
proxy = urllib2.ProxyHandler({
'http': '127.0.0.1',
'https': '127.0.0.1'
})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
# this way you can send both http and https request using proxies
urllib2.urlopen('http://www.google.com')
urllib2.urlopen('https://www.google.com')
I have not not worked a lot with ProxyHandler. I just know the theory and code. I am sure there are better ways to access websites through proxies; One which does not involve installing the opener everytime you run the program. But hopefully it will point you in the right direction.

With Bottle, how could I just peek the head of http request instead of receiving whole http request?

I don't know if it is possible with Bottle.
My website (powered by Bottle) allow users to upload image files. But I limited the size of it to 100K. I use the following code in web server to do that.
uploadLimit = 100 # 100k
uploadLimitInByte = uploadLimit* 2**10
print("before call request.headers.get('Content-Length')")
contentLen = request.headers.get('Content-Length')
if contentLen:
contentLen = int(contentLen)
if contentLen > uploadLimitInByte:
return HTTPResponse('upload limit is 100K')
But when I clicked upload button in web browser to upload a file with its size like 2MB, it seems the server is receiving the whole 2MB http request.
I expect the above code just receive http headers instead of receiving whole http request. That could not prevent wasting time on receving unecessary bytes

Python urllib2 timeout when using Tor as proxy?

I am using Python's urllib2 with Tor as a proxy to access a website. When I
open the site's main page it works fine but when I try to view the login page
(not actually log-in but just view it) I get the following error...
URLError: <urlopen error (10060, 'Operation timed out')>
To counteract this I did the following:
import socket
socket.setdefaulttimeout(None).
I still get the same timeout error.
Does this mean the website is timing out on the server side? (I don't know much
about http processes so sorry if this is a dumb question)
Is there any way I can correct it so that Python is able to view the page?
Thanks,
Rob
According to the Python Socket Documentation the default is no timeout so specifying a value of "None" is redundant.
There are a number of possible reasons that your connection is dropping. One could be that your user-agent is "Python-urllib" which may very well be blocked. To change your user agent:
request = urllib2.Request('site.com/login')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux i686; it-IT; rv:1.9.0.2) Gecko/2008092313 Ubuntu/9.04 (jaunty) Firefox/3.5')
You may also want to try overriding the proxy settings before you try and open the url using something along the lines of:
proxy = urllib2.ProxyHandler({"http":"http://127.0.0.1:8118"})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
I don't know enough about Tor to be sure, but the timeout may not happen on the server side, but on one of the Tor nodes somewhere between you and the server. In that case there is nothing you can do other than to retry the connection.
urllib2.urlopen(url[, data][, timeout])
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS, FTP and FTPS connections.
http://docs.python.org/library/urllib2.html

Categories