Currently using an API that rate limits me to 3000 requests per 10 seconds. I have 10,000 urls that are fetched using Tornado due to it's asynchronous IO nature.
How do I go about implementing a rate limit to reflect the API limit?
from tornado import ioloop, httpclient
i = 0
def handle_request(response):
print(response.code)
global i
i -= 1
if i == 0:
ioloop.IOLoop.instance().stop()
http_client = httpclient.AsyncHTTPClient()
for url in open('urls.txt'):
i += 1
http_client.fetch(url.strip(), handle_request, method='HEAD')
ioloop.IOLoop.instance().start()
You can check where does the value of i lies in the interval of 3000 requests. For example, if i is in between 3000 and 6000, you can set the timeout of 10 seconds on every request until 6000. After 6000, just double the timeout. And so on.
http_client = AsyncHTTPClient()
timeout = 10
interval = 3000
for url in open('urls.txt'):
i += 1
if i <= interval:
# i is less than 3000
# just fetch the request without any timeout
http_client.fetch(url.strip(), handle_request, method='GET')
continue # skip the rest of the loop
if i % interval == 1:
# i is now 3001, or 6001, or so on ...
timeout += timeout # double the timeout for next 3000 calls
loop = ioloop.IOLoop.current()
loop.call_later(timeout, callback=functools.partial(http_client.fetch, url.strip(), handle_request, method='GET'))
Note: I only tested this code with small number of requests. It might be possible that the value of i would change because you're subtracting i in handle_request function. If that's the case, you should maintain another variable similar to i and perform subtraction on that.
Related
Looking to add delay between Threadpoolexecutor workers
checked = 0
session_number = 0.00
def parseCombo(file_name, config_info, threads2):
try:
global checked
global session_number
### some non related code here
if config_info["Parsing"] == "First":
system("title "+ config_info["name"] + " [+] Start Parsing...")
with concurrent.futures.ThreadPoolExecutor(max_workers=threads2) as executor:
for number in numbers:
number = number.strip()
executor.submit(call_master, number, config_info)
time.sleep(20)
So basically, with those threads I'm making some requests to API that limit 2 requests per second.
The max workers i can set is 2, if I raise the API block all requests.
Each task take 60 seconds to complete, what I want to do, is set max_workers to 10, then the executor start with 2 tasks "workers" and wait 3 seconds to start the next 2 and so on, without waiting for the first 2 to finish.
Next thing is to shutdown the executor when one the task return value.
def beta(url, number, config_info):
try:
## some non related code here
global checked
global session_number
session_number = session_number
checked +=1
if '999' in text:
track = re.findall('\999[0-9]+\.?[0-9]*', text)
num = float(track[0].replace("999", ""))
session_number += num
log(number, track[0], config_info, url, text)
Is there a way to shutdown executor from def beta?
If possible, need it shutdown when this line is triggered
log(number, track[0], config_info, url, text)
I'm creating a very simple HTTP server in python 2 for testing.
I would like to randomly delay by a fixed amount the reply to a GET request without closing the connection to the server or shutting down the server first.
Here is the current code I would like to modify:
# Only runs under Python 2
import BaseHTTPServer
import time
from datetime import datetime
class SimpleRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def do_GET(self):
print "incoming request: " + self.path
self.wfile.write('HTTP-1.0 200 Okay\r\n\r\n')
self.wfile.write(modes(self.path))
def run(server_class = BaseHTTPServer.HTTPServer,
handler_class = SimpleRequestHandler):
server_address = ('', 80)
httpd = server_class(server_address, handler_class)
httpd.serve_forever()
def modes(argument):
# returns value based on time of day to simulate data
curr_time = datetime.now()
seconds = int(curr_time.strftime('%S'))
if seconds < 20:
return "reply1"
elif seconds >= 20 and seconds < 40:
return "reply2"
else:
return "reply3"
run ( )
The delay would be be added every, say, 3rd or 4th time a GET is received but when the delay times out, a properly formed response would be sent.
Thanks.
I have a checkin list which contains about 600000 checkins, and there is a url in each checkin, I need to expand them back to original long ones. I do so by
now = time.time()
files_without_url = 0
for i, checkin in enumerate(NYC_checkins):
try:
foursquare_url = urllib2.urlopen(re.search("(?P<url>https?://[^\s]+)", checkin[5]).group("url")).url
except:
files_without_url += 1
if i%1000 == 0:
print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
now = time.time()
But this takes too long time: from 0 to 1000 checkins, it takes 3241 seconds! Is this normal? What's the most efficient way to expand url by Python?
MODIFIED: Some Urls are from Bitly while some others are not, and I am not sure where they come from. In this case, I wanna simply use urllib2 module.
for your information, here is an example of checkin[5]:
I'm at The Diner (2453 18th Street NW, Columbia Rd., Washington) w/ 4 others. http...... (this is the short url)
I thought I would expand on my comment regarding the use of multiprocessing to speed up this task.
Let's start with a simple function that will take a url and resolve it as far as possible (following redirects until it gets a 200 response code):
import requests
def resolve_url(url):
try:
r = requests.get(url)
except requests.exceptions.RequestException:
return (url, None)
if r.status_code != 200:
longurl = None
else:
longurl = r.url
return (url, longurl)
This will either return a (shorturl, longurl) tuple, or it will
return (shorturl, None) in the event of a failure.
Now, we create a pool of workers:
import multiprocessing
pool = multiprocessing.Pool(10)
And then ask our pool to resolve a list of urls:
resolved_urls = []
for shorturl, longurl in pool.map(resolve_url, urls):
resolved_urls.append((shorturl, longurl))
Using the above code...
With a pool of 10 workers, I can resolve 500 URLs in 900 seconds.
If I increase the number of workers to 100, I can resolve 500 URLs in 30 seconds.
If I increase the number of workers to 200, I can resolve 500 URLs in 25 seconds.
This is hopefully enough to get you started.
(NB: you could write a similar solution using the threading module rather than multiprocessing. I usually just grab for multiprocessing first, but in this case either would work, and threading might even be slightly more efficient.)
Thread are most appropriate in case of network I/O. But you could try the following first.
pat = re.compile("(?P<url>https?://[^\s]+)") # always compile it
missing_urls = 0
bad_urls = 0
def check(checkin):
match = pat.search(checkin[5])
if not match:
global missing_urls
missing_urls += 1
else:
url = match.group("url")
try:
urllib2.urlopen(url) # don't lookup .url if you don't need it later
except URLError: # or just Exception
global bad_urls
bad_urls += 1
for i, checkin in enumerate(NYC_checkins):
check(checkin)
print(bad_urls, missing_urls)
If you get no improvement, now that we have a nice check function, create a threadpool and feed it. Speedup is guaranteed. Using processes for network I/O is pointless
pub.py
import redis
import datetime
import time
import json
import sys
import threading
import gevent
from gevent import monkey
monkey.patch_all()
def main(chan):
redis_host = '10.235.13.29'
r = redis.client.StrictRedis(host=redis_host, port=6379)
while True:
def getpkg():
package = {'time': time.time(),
'signature' : 'content'
}
return package
#test 2: complex data
now = json.dumps(getpkg())
# send it
r.publish(chan, now)
print 'Sending {0}'.format(now)
print 'data type is %s' % type(now)
time.sleep(1)
def zerg_rush(n):
for x in range(n):
t = threading.Thread(target=main, args=(x,))
t.setDaemon(True)
t.start()
if __name__ == '__main__':
num_of_chan = 10
zerg_rush(num_of_chan)
cnt = 0
stop_cnt = 21
while True:
print 'Waiting'
cnt += 1
if cnt == stop_cnt:
sys.exit(0)
time.sleep(30)
sub.py
import redis
import threading
import time
import json
import gevent
from gevent import monkey
monkey.patch_all()
def callback(ind):
redis_host = '10.235.13.29'
r = redis.client.StrictRedis(host=redis_host, port=6379)
sub = r.pubsub()
sub.subscribe(str(ind))
start = False
avg = 0
tot = 0
sum = 0
while True:
for m in sub.listen():
if not start:
start = True
continue
got_time = time.time()
decoded = json.loads(m['data'])
sent_time = float(decoded['time'])
dur = got_time - sent_time
tot += 1
sum += dur
avg = sum / tot
print decoded #'Recieved: {0}'.format(m['data'])
file_name = 'logs/sub_%s' % ind
f = open(file_name, 'a')
f.write('processing no. %s' % tot)
f.write('it took %s' % dur)
f.write('current avg: %s\n' % avg)
f.close()
def zerg_rush(n):
for x in range(n):
t = threading.Thread(target=callback, args=(x,))
t.setDaemon(True)
t.start()
def main():
num_of_chan = 10
zerg_rush(num_of_chan)
while True:
print 'Waiting'
time.sleep(30)
if __name__ == '__main__':
main()
I am testing redis pubsub to replace the use of rsh to communicate with remote boxes.
One of the things I have tested for was the number of channels affecting latency of publish and pubsub.listen().
Test: One publisher and one subscriber per channel (publisher publish every one second). Incremented the number of channels from and observed the latency (The duration from the moment publisher publish a message to the moment subscriber got the message via listen)
num of chan--------------avg latency in seconds
10:----------------------------------0.004453
50:----------------------------------0.005246
100:---------------------------------0.0155
200:---------------------------------0.0221
300:---------------------------------0.0621
Note: tested on 2 CPU + 4GB RAM + 1 NICsĀ RHEL6.4 VM.
What can I do to maintain low latency with high number of channels?
Redis is single-threaded so increasing more cpus wont help. maybe more RAM? if so, how much more?
Anything I can do code-wise or bottleneck is in Redis itself?
Maybe the limitation comes from the way my test codes are written with threading?
EDIT:
Redis Cluster vs ZeroMQ in Pub/Sub, for horizontally scaled distributed systems
Accepted answer says "You want to minimize latency, I guess. The number of channels is irrelevant. The key factors are the number of publishers and number of subscribers, message size, number of messages per second per publisher, number of messages received by each subscriber, roughly. ZeroMQ can do several million small messages per second from one node to another; your bottleneck will be the network long before it's the software. Most high-volume pubsub architectures therefore use something like PGM multicast, which ZeroMQ supports."
From my testings, i dont know if this is true. (The claim that the number of channels is irrelevant)
For example, i did a testing.
1) One channel. 100 publishers publishing to a channel with 1 subscriber listening. Publisher publishing one second at a time. latency was 0.00965 seconds
2) Same testing except 1000 publishers. latency was 0.00808 seconds
Now during my channel testing:
300 channels with 1 pub - 1 sub resulted in 0.0621 and this is only 600 connections which is less than above testing yet significantly slow in latency
I want to handle the Search-API rate limit of 180 requests / 15 minutes. The first solution I came up with was to check the remaining requests in the header and wait 900 seconds. See the following snippet:
results = search_interface.cursor(search_interface.search, q=k, lang=lang, result_type=result_mode)
while True:
try:
tweet = next(results)
if limit_reached(search_interface):
sleep(900)
self.writer(tweet)
def limit_reached(search_interface):
remaining_rate = int(search_interface.get_lastfunction_header('X-Rate-Limit-Remaining'))
return remaining_rate <= 2
But it seems, that the header information are not reseted to 180 after it reached the two remaining requests.
The second solution I came up with was to handle the twython exception for rate limitation and wait the remaining amount of time:
results = search_interface.cursor(search_interface.search, q=k, lang=lang, result_type=result_mode)
while True:
try:
tweet = next(results)
self.writer(tweet)
except TwythonError as inst:
logger.error(inst.msg)
wait_for_reset(search_interface)
continue
except StopIteration:
break
def wait_for_reset(search_interface):
reset_timestamp = int(search_interface.get_lastfunction_header('X-Rate-Limit-Reset'))
now_timestamp = datetime.now().timestamp()
seconds_offset = 10
t = reset_timestamp - now_timestamp + seconds_offset
logger.info('Waiting {0} seconds for Twitter rate limit reset.'.format(t))
sleep(t)
But with this solution I receive this message INFO: Resetting dropped connection: api.twitter.com" and the loop will not continue with the last element of the generator. Have somebody faced the same problems?
Regards.
just rate limit yourself is my suggestion (assuming you are constantly hitting the limit ...)
QUERY_PER_SEC = 15*60/180.0 #180 per 15 minutes
#~5 seconds per query
class TwitterBot:
last_update=0
def doQuery(self,*args,**kwargs):
tdiff = time.time()-self.last_update
if tdiff < QUERY_PER_SEC:
time.sleep(QUERY_PER_SEC-tdiff)
self.last_update = time.time()
return search_interface.cursor(*args,**kwargs)