This is a stripped down version of the script that causes continually increasing memory usage, I've seen it go past 600MB after 2 minutes:
import requests
import grequests
lines = (grequests.get(l.strip(), timeout=15) for l in open('links.txt') if len(l.strip()))
for r in grequests.imap(lines, size=20):
if r.ok:
print r.url
links.txt is a file containing a large number of urls, the problem happens with several large groups of urls that I have collected. It seems to me like that response objects may not be being deferenced?
I updated gevent, requests and grequests today, here are their versions:
In [2]: gevent.version_info
Out[2]: (1, 0, 0, 'beta', 3)
In [5]: requests.__version__
Out[5]: '0.13.5'
grequests doesn't have a version number that I could find.
Thanks in advance for any answers.
This answer is just an alias and link back for people who might need this link.
I use the imap function and requests.Session to reduce the memory usage while making 380k requests in my scripts.
From my point of view, it caused becouse you try to open all of the links at the same time. Try something like this:
links = set(links)
while links:
calls = (grequests.get(links.pop()) for x in range(200))
for r in calls:
...rest of your code
This code is not tested and you will find nicer soluution, this should be the proof that you simly try to open too many links at the same time and that causes your memory consumed.
The project's requests library dependency should be updated.
Older versions of requests, including the one used in the question example, would not pre-fetch any response content by default, leaving it up to you to consume the data. This leaves open references to the underlying socket, so that even if the request session is garbage collected, the socket won't be garbage collected until the response goes out of scope or response.content is called.
In later versions of requests, responses are pre-fetched by default and session connections are closed explicitly if the session was created ad hoc for fulfilling a module-level get/post/etc request such as those made by grequests when a session isn't passed in. This is covered in requests GitHub issue #520.
Related
I'm hosting a server on localhost and I want to fire hundreds of GET requests asynchronously. For this I am using grequests. Everything appears to work fine but I repeatedly get the warning:
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: date.jsontest.com
A search shows how the full pool issue can be avoided when creating a Session() in requests e.g. here. However, a couple of things:
Even if I don't take any steps to avoid the warning, I appear to consistently get the expected results. If I do use the workaround, any requests over the number of the pool_maxsize will give a warning.
The linked workaround will still result in the warning if the number of requests exceeds the pool size. I assumed there would be some kind of throttling to prevent the pool size being exceeded at any one time
I can't seem to find a way to disable the warning. requests.packages.urllib3.disable_warnings() doesn't seem to do anything.
So my questions are:
What does this warning actually mean? My interpretation is that it is simply dropping the requests from firing, but it doesn't seem to be the case.
Is this warning actually relevant for the grequests library, especially when I take steps to limit the pool size? Am I inviting unexpected behaviour and fluking my expected result in my tests?
Is there a way to disable it?
Some code to test:
import grequests
import requests
requests.packages.urllib3.disable_warnings() # Doesn't seem to work?
session = requests.Session()
# Hashing the below will cause 105 warnings instead of 5
adapter = requests.adapters.HTTPAdapter(pool_connections=100,
pool_maxsize=100)
session.mount('http://', adapter)
# Test query
query_list = ['http://date.jsontest.com/' for x in xrange(105)]
rs = [grequests.get(item, session=session) for item in query_list]
responses = grequests.map(rs)
print len([item.json() for item in responses])
1) What does this warning actually mean? My interpretation is that it
is simply dropping the requests from firing, but it doesn't seem to be
the case.
This is actually still unclear to me. Even firing one request was enough to get the warning but would still give me the expected response.
2) Is this warning actually relevant for the grequests library,
especially when I take steps to limit the pool size? Am I inviting
unexpected behaviour and fluking my expected result in my tests?
For the last part: yes. The server I was communicating with could handle 10 queries concurrently. With the following code I could send 400 or so requests in a single list comprehension and everything worked out fine (i.e. my server never got swamped so it must have been throttling in some way). After some tipping point in the number of requests, the code would stop firing any requests and simply give a list of None. It's not as though it even tried to get through the list, it didn't even fire the first query, it just blocks up.
sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(pool_connections=10,
pool_maxsize=10)
sess.mount('http://', adapter)
# Launching ~500 or more requests will suddenly cause this to fail
rs = [grequests.get(item[0], session=session) for item in queries]
responses = grequests.map(rs)
3) Is there a way to disable it?
Yes, if you want to be a doofus like me and hash it out in the source code. I couldn't find any other way to silence it, and it came back to bite me.
SOLUTION
The solution was a painless transition to using requests-futures instead. The following code behaves exactly as expected, gives no warnings and, thus far, scales to any number of queries that I throw at it.
from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers = 10)
fire_requests = [session.get(url) for url in queries]
responses = [item.result() for item in fire_requests]
import requests
while True:
try:
posting = requests.post(url,json = data,headers,timeout = 3.05)
except requests.exceptions.ConnectionError as e:
continue
# If a read_timeout error occurs, start from the beginning of the loop
except requests.exceptions.ReadTimeout as e:
continue
a link to more code : Multiple accidental POST requests in Python
This code is using requests library to perform POST requests indefinitely. I noticed that when try fails multiple of times and the while loop starts all over multiple of times, that when I can finally send the post request, I find out multiple of entries from the server side at the same second. I was writing to a txt file at the same time and it showed one entry only. Each entry is 5 readings. Is this an issue with the library itself? Is there a way to fix this?! No matter what kind of conditions that I put it still doesn't work :/ !
You can notice the reading at 12:11:13 has 6 parameters per second while at 12:14:30 (after the delay, it should be every 10 seconds) it is a few entries at the same second!!! 3 entries that make up 18 readings in one second, instead of 6 only!
It looks like the server receives your requests and acts upon them but fails to respond in time (3s is a pretty low timeout, a load spike/paging operation can easily make the server miss it unless it employs special measures). I'd suggest to
process requests asynchronously (e.g. spawn threads; Asynchronous Requests with Python requests discusses ways to do this with requests) and do not use timeouts (TCP has its own timeouts, let it fail instead).
reuse the connection(s) (TCP has quite a bit of overhead for connection establishing/breaking) or use UDP instead.
include some "hints" (IDs, timestamps etc.) to prevent the server from adding duplicate records. (I'd call this one a workaround as the real problem is you're not making sure if your request was processed.)
From the server side, you may want to:
Respond ASAP and act upon the info later. Do not let pending action prevent answering further requests.
I have a requirement where I need to hit up to 2000 URLs per minute and save the response to a database. The URLS need to be hit within 5 seconds of the start of every minute (but the response can wait). Then, at the next minute, the same will happen and so on. So, it's time critical.
I've tried using Python multiprocessing and threading to solve the problem. However, some URLs may take up to 30 minutes to respond, which blocks all other URLs from being processed.
I'm also open to using something lower level such as C, but don't know where to start.
Any guidance in the right direction will help, thanks.
You need something lighter than a thread, since if each URL can block for a long time then you'll need to send them all simultaneously instead of via a thread pool.
gevent is a Python wrapper around the eventlib loop that's good at this sort of thing. From their docs:
>>> import gevent
>>> from gevent import socket
>>> urls = ['www.google.com', 'www.example.com', 'www.python.org']
>>> jobs = [gevent.spawn(socket.gethostbyname, url) for url in urls]
>>> gevent.joinall(jobs, timeout=2)
>>> [job.value for job in jobs]
['74.125.79.106', '208.77.188.166', '82.94.164.162']
I am not sure if I have understood the problem correctly, but if you are using 'n' processes and if all 'n' of them get stuck on a response, then changing the language will not solve your issue. Since the bottleneck is the server which you are requesting, and not your local driver code. You can eliminate this dependency by switching to an asynchronous mechanism. Do not wait for the response! Let a callback handle it for you!
EDIT: You might want to have a look at https://github.com/kennethreitz/grequests
For starters I'm new to python so my code below may not be the cleanest. For a program I need to download about 500 webpages. The url's are stored in an array which is populated by a previous function. The downloading part goes something like this:
def downloadpages(num):
import urllib
for i in range(0,numPlanets):
urllib.urlretrieve(downloadlist[i], 'webpages/'+names[i]'.htm')
each file is only around 20KB but it takes at least 10 mins to download all of them. Downloading a single file of the total combined size should only take a minute or two. Is there a way I can speed this up? Thanks
Edit: To anyone who is interested, following the example at http://code.google.com/p/workerpool/wiki/MassDownloader and using 50 threads, the download time has been reduced to about 20 seconds from the original 10 minutes plus. The download speed continues to decrease as the threads are increased up until around 60 threads, after which the download time begins to rise again.
But you're not downloading a single file, here. You're downloading 500 separate pages, each connection involves overhead (for the initial connection), plus whatever else the server is doing (is it serving other people?).
Either way, downloading 500 x 20kb is not the same as downloading a single file of that size.
You can speed up execution significantly by using threads (be careful though, to not overload the server).
Intro material/Code samples:
http://docs.python.org/library/threading.html
Python Package For Multi-Threaded Spider w/ Proxy Support?
http://code.google.com/p/workerpool/wiki/MassDownloader
You can use greenlet to do so.
E.G with the eventlet lib:
urls = [url1, url2, ...]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)
All calls in the pools will be pseudo simulatneous.
Of course you must install eventlet with pip or easy_install before.
You have several implementations of greenlets in Python. You could do the same with gevent or another one.
In addition to using concurrency of some sort, make sure whatever method you're using to make the requests uses HTTP 1.1 connection persistence. That will allow each thread to open only a single connection and request all the pages over that, instead of having a TCP/IP setup/teardown for each request. Not sure if urllib2 does that by default; you might have to roll your own.
I'm building a download manager in python for fun, and sometimes the connection to the server is still on but the server doesn't send me data, so read method (of HTTPResponse) block me forever. This happens, for example, when I download from a server, which located outside of my country, that limit the bandwidth to other countries.
How can I set a timeout for the read method (2 minutes for example)?
Thanks, Nir.
If you're stuck on some Python version < 2.6, one (imperfect but usable) approach is to do
import socket
socket.setdefaulttimeout(10.0) # or whatever
before you start using httplib. The docs are here, and clearly state that setdefaulttimeout is available since Python 2.3 -- every socket made from the time you do this call, to the time you call the same function again, will use that timeout of 10 seconds. You can use getdefaulttimeout before setting a new timeout, if you want to save the previous timeout (including none) so that you can restore it later (with another setdefaulttimeout).
These functions and idioms are quite useful whenever you need to use some older higher-level library which uses Python sockets but doesn't give you a good way to set timeouts (of course it's better to use updated higher-level libraries, e.g. the httplib version that comes with 2.6 or the third-party httplib2 in this case, but that's not always feasible, and playing with the default timeout setting can be a good workaround).
You have to set it during HTTPConnection initialization.
Note: in case you are using an older version of Python, then you can install httplib2; by many, it is considered a superior alternative to httplib, and it does supports timeout.
I've never used it, though, and I'm just reporting what documentation and blogs are saying.
Setting the default timeout might abort a download early if it's large, as opposed to only aborting if it stops receiving data for the timeout value. HTTPlib2 is probably the way to go.
5 years later but hopefully this will help someone else...
I was wrecking my brain trying to figure this out. My problem was a server returning corrupt content and thus giving back less data than it thought it had.
I came up with a nasty solution that seems to be working properly. Here it goes:
# NOTE I directly disabling blocking is not necessary but it represents
# an important piece to the problem so I am leaving it here.
# http_response.fp._sock.socket.setblocking(0)
http_response.fp._sock.settimeout(read_timeout)
http_response.read(chunk_size)
NOTE This solution also works for the python requests ANY library that implements the normal python sockets (which should be all of them?). You just have to go a few levels deeper:
resp.raw._fp.fp._sock.socket.setblocking()
resp.raw._fp.fp._sock.settimeout(read_timeout)
resp.raw.read(chunk_size)
As of this writing, I have not tried the following but in theory it should work:
resp = requests.get(some_url, stream=True)
resp.raw._fp.fp._sock.socket.setblocking()
resp.raw._fp.fp._sock.settimeout(read_timeout)
for chunk in resp.iter_content(chunk_size):
# do stuff
Explanation
I stumbled upon this approach when reading this SO question for setting a timeout on socket.recv
At the end of the day, any http request has a socket. For the httplib that socket is located at resp.raw._fp.fp._sock.socket. The resp.raw._fp.fp._sock is a socket._fileobj (which I honestly didn't look far into) and I imagine it's settimeout method internally sets it on the socket attribute.