requests + grequests: is the "Connection pool is full, discarding connection:" warning relevant? - python

I'm hosting a server on localhost and I want to fire hundreds of GET requests asynchronously. For this I am using grequests. Everything appears to work fine but I repeatedly get the warning:
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: date.jsontest.com
A search shows how the full pool issue can be avoided when creating a Session() in requests e.g. here. However, a couple of things:
Even if I don't take any steps to avoid the warning, I appear to consistently get the expected results. If I do use the workaround, any requests over the number of the pool_maxsize will give a warning.
The linked workaround will still result in the warning if the number of requests exceeds the pool size. I assumed there would be some kind of throttling to prevent the pool size being exceeded at any one time
I can't seem to find a way to disable the warning. requests.packages.urllib3.disable_warnings() doesn't seem to do anything.
So my questions are:
What does this warning actually mean? My interpretation is that it is simply dropping the requests from firing, but it doesn't seem to be the case.
Is this warning actually relevant for the grequests library, especially when I take steps to limit the pool size? Am I inviting unexpected behaviour and fluking my expected result in my tests?
Is there a way to disable it?
Some code to test:
import grequests
import requests
requests.packages.urllib3.disable_warnings() # Doesn't seem to work?
session = requests.Session()
# Hashing the below will cause 105 warnings instead of 5
adapter = requests.adapters.HTTPAdapter(pool_connections=100,
pool_maxsize=100)
session.mount('http://', adapter)
# Test query
query_list = ['http://date.jsontest.com/' for x in xrange(105)]
rs = [grequests.get(item, session=session) for item in query_list]
responses = grequests.map(rs)
print len([item.json() for item in responses])

1) What does this warning actually mean? My interpretation is that it
is simply dropping the requests from firing, but it doesn't seem to be
the case.
This is actually still unclear to me. Even firing one request was enough to get the warning but would still give me the expected response.
2) Is this warning actually relevant for the grequests library,
especially when I take steps to limit the pool size? Am I inviting
unexpected behaviour and fluking my expected result in my tests?
For the last part: yes. The server I was communicating with could handle 10 queries concurrently. With the following code I could send 400 or so requests in a single list comprehension and everything worked out fine (i.e. my server never got swamped so it must have been throttling in some way). After some tipping point in the number of requests, the code would stop firing any requests and simply give a list of None. It's not as though it even tried to get through the list, it didn't even fire the first query, it just blocks up.
sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(pool_connections=10,
pool_maxsize=10)
sess.mount('http://', adapter)
# Launching ~500 or more requests will suddenly cause this to fail
rs = [grequests.get(item[0], session=session) for item in queries]
responses = grequests.map(rs)
3) Is there a way to disable it?
Yes, if you want to be a doofus like me and hash it out in the source code. I couldn't find any other way to silence it, and it came back to bite me.
SOLUTION
The solution was a painless transition to using requests-futures instead. The following code behaves exactly as expected, gives no warnings and, thus far, scales to any number of queries that I throw at it.
from requests_futures.sessions import FuturesSession
session = FuturesSession(max_workers = 10)
fire_requests = [session.get(url) for url in queries]
responses = [item.result() for item in fire_requests]

Related

Put a time limit on a request

I have a program and in order verify that the user doesnt download such big files using input i need a time limit on how long each request is allowed to take.
Does anyone know a good way to put a time limit(/lifetime) on each python requests get requests so if it takes 10 seconds an exception will be thrown.
Thanks
You can define your own timeout like:
requests.get('https://github.com/', timeout=0.001)
You can pass an additional timeout parameter to every request you make. This is always recommended as it will make your code more robust to hanging indefinitely in case you don't receive a response from the other end.
requests.get('https://github.com/', timeout=0.001)
Read the official python request documentation for timeouts here.

Python 'requests' GET in loop eventually throws [WinError 10048]

Disclaimer: This is similar to some other questions relating to this error but my program is not using any multi-threading/processing and I'm working with the 'requests' module instead of raw socket commands so none of the solutions I saw related to my issue.
I have a basic status-checking program running Python 3.4 on Windows that uses a GET request to pull some data off a status site hosted by a number of servers I have to keep watch over. The core code is setup like this:
import requests
import time
URL_LIST = [some, list, of, the, status, sites] # https:// sites
session = requests.session()
previous_data = ""
while 1:
data = ""
for url in URL_LIST:
headers = {'X-Auth-Token': Associated_Auth_Token}
try:
status = session.get(url, headers=headers).json()['status']
except ConnectionError:
status = "SERVER DOWN"
data += "%s \t%s\n" % (url, status)
if data != previous_data:
print(data)
previous_data = data
time.sleep(15)
...which typically runs just fine for hours (this script is intended to run 24/7 and has additional logging built in I left out here for simplicity and relevance) but eventually it crashes and throws the error mentioned in the title:
[WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
The servers I'm requesting from are notoriously slow at times (and sometimes go down entirely, hence the try/except) so my inclination would be that after looping this over and over eventually a request has not fully finished before the next request comes through and Windows tries to step on itself, but I don't see how that could happen with my code since it iterates serially through the URLs.
Also, if this is a TIME_WAIT issue as some other related posts ran into, I'd rather not have to wait for that to finish since I'd like to update every 15 seconds or better, so then I considered closing and opening a new requests session every so often since it typically works fine for hours before hitting a snag, but based off Lukasa's comment here:
To avoid getting sockets in TIME_WAIT, the best thing to do is to use a single Session object at as high a scope as you can and leave it open for the lifetime of your program. Requests will do its best to re-use the sockets as much as possible, which should prevent them lapsing into TIME_WAIT
...it sounds that is not a good idea - though when he says 'lifetime of your program' he may not intend the statement to include 24/7 use as in my case.
So instead of blindly trying things and waiting some number of hours for the program to crash again so I can see if the error changes, I wanted to consult the wealth of knowledge here first to see if anyone can see what's going wrong and knows how I should fix it.
Thanks!

Issue with sending POST requests using the library requests

import requests
while True:
try:
posting = requests.post(url,json = data,headers,timeout = 3.05)
except requests.exceptions.ConnectionError as e:
continue
# If a read_timeout error occurs, start from the beginning of the loop
except requests.exceptions.ReadTimeout as e:
continue
a link to more code : Multiple accidental POST requests in Python
This code is using requests library to perform POST requests indefinitely. I noticed that when try fails multiple of times and the while loop starts all over multiple of times, that when I can finally send the post request, I find out multiple of entries from the server side at the same second. I was writing to a txt file at the same time and it showed one entry only. Each entry is 5 readings. Is this an issue with the library itself? Is there a way to fix this?! No matter what kind of conditions that I put it still doesn't work :/ !
You can notice the reading at 12:11:13 has 6 parameters per second while at 12:14:30 (after the delay, it should be every 10 seconds) it is a few entries at the same second!!! 3 entries that make up 18 readings in one second, instead of 6 only!
It looks like the server receives your requests and acts upon them but fails to respond in time (3s is a pretty low timeout, a load spike/paging operation can easily make the server miss it unless it employs special measures). I'd suggest to
process requests asynchronously (e.g. spawn threads; Asynchronous Requests with Python requests discusses ways to do this with requests) and do not use timeouts (TCP has its own timeouts, let it fail instead).
reuse the connection(s) (TCP has quite a bit of overhead for connection establishing/breaking) or use UDP instead.
include some "hints" (IDs, timestamps etc.) to prevent the server from adding duplicate records. (I'd call this one a workaround as the real problem is you're not making sure if your request was processed.)
From the server side, you may want to:
Respond ASAP and act upon the info later. Do not let pending action prevent answering further requests.

urlopen freezes at random, timout is ignored

I have an API manager that connects to an URL and grabs some json. Very simple.
Cut from the method:
req = Request(url)
socket.setdefaulttimeout(timeout)
resp = urlopen(req, None, timeout)
data = resp.read()
resp.close()
It works fine most of the time, but at random intervals it takes 5 s to complete the request. Even when timeout is set to 0.5 or 1.0 or whatever.
I have logged it very closely so I am 100% sure that the line that takes time is number #3 (ie. resp = urlopen(req, None, timeout)).
Ive tried all solutions Ive found on the topic of timeout decorators and Timers etc.
(To list some of them:
Python urllib2.urlopen freezes script infinitely even though timeout is set,
How can I force urllib2 to time out?, Timing out urllib2 urlopen operation in Python 2.4, Timeout function if it takes too long to finish
)
But nothing works. My impression is that the thread freezes while urlopen does something and when its done it unfreezes and then all the timers and timeouts returns w timeout errors. but the execution time is still more then 5s.
I've found this old mailing list regarding urllib2 and handling of chunked encoding. So if the problem is still present then the solution might be to write a custom urlopen based on httplib.HTTP and not httplib.HTTPConnection.
Another possible solution is to try some multithreading magic....
Both solutions seem to aggresive. And it bugs me that the timeout does not work all the way.
It is very important that the execution time of the script does not exceed 0.5s. Anyone that knows why I am experiencing the freezes or maybe a way to help me?
Update based on accepted answer:
I changed the approach and use curl instead. Together w unix timeout it works just as I want. Example code follows:
t_timeout = str(API_TIMEOUT_TIME)
c_timeout = str(CURL_TIMEOUT_TIME)
cmd = ['timeout', t_timeout, 'curl', '--max-time', c_timeout, url]
prc = Popen(cmd, stdout=PIPE, stderr=PIPE)
response = prc.communicate()
Since curl only accepts int as timeout I added timeout. timeout accepts floats.
Looking through the source code, the timeout value is actually the maximum amount of time that Python will wait between receiving packets from the remote host.
So if you set the timeout to two seconds, and the remote host sends 60 packets at the rate of one packet per second, the timeout will never occur, although the overall process will still take 60 seconds.
Since the urlopen() function doesn't return until the remote host has finished sending all the HTTP headers, then if it sends the headers very slowly, there's not much you can do about it.
If you need an overall time limit, you'll probably have to implement your own HTTP client with non-blocking I/O.

"Memory Leak" with grequests?

This is a stripped down version of the script that causes continually increasing memory usage, I've seen it go past 600MB after 2 minutes:
import requests
import grequests
lines = (grequests.get(l.strip(), timeout=15) for l in open('links.txt') if len(l.strip()))
for r in grequests.imap(lines, size=20):
if r.ok:
print r.url
links.txt is a file containing a large number of urls, the problem happens with several large groups of urls that I have collected. It seems to me like that response objects may not be being deferenced?
I updated gevent, requests and grequests today, here are their versions:
In [2]: gevent.version_info
Out[2]: (1, 0, 0, 'beta', 3)
In [5]: requests.__version__
Out[5]: '0.13.5'
grequests doesn't have a version number that I could find.
Thanks in advance for any answers.
This answer is just an alias and link back for people who might need this link.
I use the imap function and requests.Session to reduce the memory usage while making 380k requests in my scripts.
From my point of view, it caused becouse you try to open all of the links at the same time. Try something like this:
links = set(links)
while links:
calls = (grequests.get(links.pop()) for x in range(200))
for r in calls:
...rest of your code
This code is not tested and you will find nicer soluution, this should be the proof that you simly try to open too many links at the same time and that causes your memory consumed.
The project's requests library dependency should be updated.
Older versions of requests, including the one used in the question example, would not pre-fetch any response content by default, leaving it up to you to consume the data. This leaves open references to the underlying socket, so that even if the request session is garbage collected, the socket won't be garbage collected until the response goes out of scope or response.content is called.
In later versions of requests, responses are pre-fetched by default and session connections are closed explicitly if the session was created ad hoc for fulfilling a module-level get/post/etc request such as those made by grequests when a session isn't passed in. This is covered in requests GitHub issue #520.

Categories