Concurrently hit URLS and process results - python

I have a requirement where I need to hit up to 2000 URLs per minute and save the response to a database. The URLS need to be hit within 5 seconds of the start of every minute (but the response can wait). Then, at the next minute, the same will happen and so on. So, it's time critical.
I've tried using Python multiprocessing and threading to solve the problem. However, some URLs may take up to 30 minutes to respond, which blocks all other URLs from being processed.
I'm also open to using something lower level such as C, but don't know where to start.
Any guidance in the right direction will help, thanks.

You need something lighter than a thread, since if each URL can block for a long time then you'll need to send them all simultaneously instead of via a thread pool.
gevent is a Python wrapper around the eventlib loop that's good at this sort of thing. From their docs:
>>> import gevent
>>> from gevent import socket
>>> urls = ['www.google.com', 'www.example.com', 'www.python.org']
>>> jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
>>> gevent.joinall(jobs, timeout=2)
>>> [job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']

I am not sure if I have understood the problem correctly, but if you are using 'n' processes and if all 'n' of them get stuck on a response, then changing the language will not solve your issue. Since the bottleneck is the server which you are requesting, and not your local driver code. You can eliminate this dependency by switching to an asynchronous mechanism. Do not wait for the response! Let a callback handle it for you!
EDIT: You might want to have a look at https://github.com/kennethreitz/grequests

Related

Put a time limit on a request

I have a program and in order verify that the user doesnt download such big files using input i need a time limit on how long each request is allowed to take.
Does anyone know a good way to put a time limit(/lifetime) on each python requests get requests so if it takes 10 seconds an exception will be thrown.
Thanks
You can define your own timeout like:
requests.get('https://github.com/', timeout=0.001)
You can pass an additional timeout parameter to every request you make. This is always recommended as it will make your code more robust to hanging indefinitely in case you don't receive a response from the other end.
requests.get('https://github.com/', timeout=0.001)
Read the official python request documentation for timeouts here.

Hanging threads using async requests with connectionpool

For a project we need to request data through an API (HTTP/1.1), depending on what you find you can then send a request.post with instructions after which the API will send back a response. I made the program multithreaded so that the main program keeps requesting data while in case I want to post an instruction I spawn a thread to do that. (request data only takes 1sec where posting instruction and getting reply can take upto 3sec to respond)
However the problem I am walking into is that sometimes 1 of my threads hangs and only finishes if I issue command thread.join().I can see that it hangs as the data that i get in the main thread should resemble my previous instructions (send by the threads), (I allow a 5second period for the server to resemble the instructions I send prior, so it is not the case that the server is not yet updated). If I would now send the same instructions again I will find that now both instructions make it to the server (the hanging one, and the newly issued one). So somehow sending the new instructions has as a side-effect the previous instructions also get send.
The problem looks related to threading as my code doesn't hang when just executed serial. Looking at posts like this didn't help as I do not know in advance what my instruction needs to be for my asynchronous requests. It is important that I make use of persistent connections and reuse them as that saves alot of time on handshakes etc.
My questions:
What is a proper way of handling a connectionpool of persistent connections in a multithreaded way. (so it doesn't hang)
How can I debug/troubleshoot the Thread to find out where it hangs.
Requests gets recommended as a package often but maybe there are others, better suited for this kind of application?
Example Code:
import requests
from threading import Thread
req = requests.Session()
adapter = requests.adapters.HTTPAdapter(pool_connections = 10, pool_maxsize=10)
req.mount('',adapter)
url='http://python-requests.org'
def main(url=''):
thread_list=[]
counter=0
while True:
resp = req.get(url)
interesting = 1 #
if interesting:
instructions = {}
a = Thread(target = send_instructions, kwargs = dict(url = url, instructions = instructions))
a.start()
thread_list.append(a)
tmp=[]
for x in thread_list:
if x.isAlive():
tmp.append(x)
thread_list = tmp
if counter>10:
break
counter+=1
def send_instructions(url='', instructions=''):
resp=req.post(url, headers = instructions)
print(resp)
main(url)

Issue with sending POST requests using the library requests

import requests
while True:
try:
posting = requests.post(url,json = data,headers,timeout = 3.05)
except requests.exceptions.ConnectionError as e:
continue
# If a read_timeout error occurs, start from the beginning of the loop
except requests.exceptions.ReadTimeout as e:
continue
a link to more code : Multiple accidental POST requests in Python
This code is using requests library to perform POST requests indefinitely. I noticed that when try fails multiple of times and the while loop starts all over multiple of times, that when I can finally send the post request, I find out multiple of entries from the server side at the same second. I was writing to a txt file at the same time and it showed one entry only. Each entry is 5 readings. Is this an issue with the library itself? Is there a way to fix this?! No matter what kind of conditions that I put it still doesn't work :/ !
You can notice the reading at 12:11:13 has 6 parameters per second while at 12:14:30 (after the delay, it should be every 10 seconds) it is a few entries at the same second!!! 3 entries that make up 18 readings in one second, instead of 6 only!
It looks like the server receives your requests and acts upon them but fails to respond in time (3s is a pretty low timeout, a load spike/paging operation can easily make the server miss it unless it employs special measures). I'd suggest to
process requests asynchronously (e.g. spawn threads; Asynchronous Requests with Python requests discusses ways to do this with requests) and do not use timeouts (TCP has its own timeouts, let it fail instead).
reuse the connection(s) (TCP has quite a bit of overhead for connection establishing/breaking) or use UDP instead.
include some "hints" (IDs, timestamps etc.) to prevent the server from adding duplicate records. (I'd call this one a workaround as the real problem is you're not making sure if your request was processed.)
From the server side, you may want to:
Respond ASAP and act upon the info later. Do not let pending action prevent answering further requests.

urlopen freezes at random, timout is ignored

I have an API manager that connects to an URL and grabs some json. Very simple.
Cut from the method:
req = Request(url)
socket.setdefaulttimeout(timeout)
resp = urlopen(req, None, timeout)
data = resp.read()
resp.close()
It works fine most of the time, but at random intervals it takes 5 s to complete the request. Even when timeout is set to 0.5 or 1.0 or whatever.
I have logged it very closely so I am 100% sure that the line that takes time is number #3 (ie. resp = urlopen(req, None, timeout)).
Ive tried all solutions Ive found on the topic of timeout decorators and Timers etc.
(To list some of them:
Python urllib2.urlopen freezes script infinitely even though timeout is set,
How can I force urllib2 to time out?, Timing out urllib2 urlopen operation in Python 2.4, Timeout function if it takes too long to finish
)
But nothing works. My impression is that the thread freezes while urlopen does something and when its done it unfreezes and then all the timers and timeouts returns w timeout errors. but the execution time is still more then 5s.
I've found this old mailing list regarding urllib2 and handling of chunked encoding. So if the problem is still present then the solution might be to write a custom urlopen based on httplib.HTTP and not httplib.HTTPConnection.
Another possible solution is to try some multithreading magic....
Both solutions seem to aggresive. And it bugs me that the timeout does not work all the way.
It is very important that the execution time of the script does not exceed 0.5s. Anyone that knows why I am experiencing the freezes or maybe a way to help me?
Update based on accepted answer:
I changed the approach and use curl instead. Together w unix timeout it works just as I want. Example code follows:
t_timeout = str(API_TIMEOUT_TIME)
c_timeout = str(CURL_TIMEOUT_TIME)
cmd = ['timeout', t_timeout, 'curl', '--max-time', c_timeout, url]
prc = Popen(cmd, stdout=PIPE, stderr=PIPE)
response = prc.communicate()
Since curl only accepts int as timeout I added timeout. timeout accepts floats.
Looking through the source code, the timeout value is actually the maximum amount of time that Python will wait between receiving packets from the remote host.
So if you set the timeout to two seconds, and the remote host sends 60 packets at the rate of one packet per second, the timeout will never occur, although the overall process will still take 60 seconds.
Since the urlopen() function doesn't return until the remote host has finished sending all the HTTP headers, then if it sends the headers very slowly, there's not much you can do about it.
If you need an overall time limit, you'll probably have to implement your own HTTP client with non-blocking I/O.

python - faster downloading of ~500 webpages (loop)

For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
http://docs.python.org/library/threading.html
Python Package For Multi-Threaded Spider w/ Proxy Support?
http://code.google.com/p/workerpool/wiki/MassDownloader
You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.

Categories