Python Requests Not Cleaning up Connections and Causing Port Overflow?

Python Requests Not Cleaning up Connections and Causing Port Overflow? - python

I'm doing something fairly outside of my comfort zone here, so hopefully I'm just doing something stupid.
I have an Amazon EC2 instance which I'm using to run a specialized database, which is controlled through a webapp inside of Tomcat that provides a REST API. On the same server, I'm running a Python script that uses the Requests library to make hundreds of thousands of simple queries to the database (I don't think it's possible to consolidate the queries, though I am going to try that next.)
The problem: after running the script for a bit, I suddenly get a broken pipe error on my SSH terminal. When I try to log back in with SSH, I keep getting "operation timed out" errors. So I can't even log back in to terminate the Python process and instead have to reboot the EC2 instance (which is a huge pain, especially since I'm using ephemeral storage)
My theory is that each time requests makes a REST call, it activates a pair of ports between Python and Tomcat, but that it never closes the ports when it's done. So python keeps trying to grab more and more ports and eventually either somehow grabs away and locks the SSH port (booting me off), or it just uses all the ports and that causes the network system to crap out somehow (as I said, I'm out of my depth.)
I also tried using httplib2, and was getting a similar problem.
Any ideas? If my port theory is correct, is there a way to force requests to surrender the port when it's done? Or otherwise is there at least a way to tell Ubuntu to keep the SSH port off-limits so that I can at least log back in and terminate the process?
Or is there some sort of best practice to using Python to make lots and lots of very simple REST calls?
Edit:
Solved...do:
s = requests.session()
s.config['keep_alive'] = False
Before making the request to force Requests to release connections when it's done.

My speculation:
https://github.com/kennethreitz/requests/blob/develop/requests/models.py#L539 sets conn to connectionpool.connection_from_url(url)
That leads to https://github.com/kennethreitz/requests/blob/develop/requests/packages/urllib3/connectionpool.py#L562, which leads to https://github.com/kennethreitz/requests/blob/develop/requests/packages/urllib3/connectionpool.py#L167.
This eventually leads to https://github.com/kennethreitz/requests/blob/develop/requests/packages/urllib3/connectionpool.py#L185:
def _new_conn(self):
"""
Return a fresh :class:`httplib.HTTPConnection`.
"""
self.num_connections += 1
log.info("Starting new HTTP connection (%d): %s" %
(self.num_connections, self.host))
return HTTPConnection(host=self.host, port=self.port)
I would suggest hooking a handler up to that logger, and listening for lines that match that one. That would let you see how many connections are being created.

Figured it out...Requests has a default 'Keep Alive' policy on connections which you have to explicitly override by doing
s = requests.session()
s.config['keep_alive'] = False
before you make a request.
From the doc:
"""
Keep-Alive
Excellent news — thanks to urllib3, keep-alive is 100% automatic within a session! Any requests that you make within a session will automatically reuse the appropriate connection!
Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set prefetch to True or read the content property of the Response object.
If you’d like to disable keep-alive, you can simply set the keep_alive configuration to False:
s = requests.session()
s.config['keep_alive'] = False
"""
There may be a subtle bug in Requests here because I WAS reading the .text and .content properties and it was still not releasing the connections. But explicitly passing 'keep alive' as false fixed the problem.

Related

How does Postges Server know to keep a database connection open

I wonder how does Postgres sever determine to close a DB connection, if I forgot at the Python source code side.
Does the Postgres server send a ping to the source code? From my understanding, this is not possible.

PostgreSQL indeed does something like that, although it is not a ping.
PostgreSQL uses a TCP feature called keepalive. Once enabled for a socket, the operating system kernel will regularly send keepalive messages to the other party (the peer), and if it doesn't get an answer after a couple of tries, it closes the connection.
The default timeouts for keepalive are pretty long, in the vicinity of two hours. You can configure the settings in PostgreSQL, see the documentation for details.
The default values and possible values vary according to the operating system used.
There is a similar feature available for the client side, but it is less useful and not enabled by default.

When your script quits your connection will close and the server will clean it up accordingly. Likewise, it's often the case in garbage collected languages like Python that when you stop using the connection and it falls out of scope it will be closed and cleaned up.
It is possible to write code that never releases these resources properly, that just perpetually creates new handles, something that can be problematic if you don't have something server-side that handles killing these after some period of idle time. Postgres doesn't do this by default, though it can be configured to, but MySQL does.
In short Postgres will keep a database connection open until you kill it either explicitly, such as via a close call, or implicitly, such as the handle falling out of scope and being deleted by the garbage collector.

Why do I constantly see "Resetting dropped connection" when uploading data to my database?

I'm uploading hundreds of millions of items to my database via a REST API from a cloud server on Heroku to a database in AWS EC2. I'm using Python and I am constantly seeing the following INFO log message in the logs.
[requests.packages.urllib3.connectionpool] [INFO] Resetting dropped connection: <hostname>
This "resetting of the dropped connection" seems to take many seconds (sometimes 30+ sec) before my code continues to execute again.
Firstly what exactly is happening here and why?
Secondly is there a way to stop the connection from dropping so that I am able to upload data faster?
Thanks for your help.
Andrew.

Requests uses Keep-Alive by default. Resetting dropped connection, from my understanding, means a connection that should be alive was dropped somehow. Possible reasons are:
Server doesn't support Keep-Alive.
There's no data transfer in established connections for a while, so server drops connections.
See https://stackoverflow.com/a/25239947/2142577 for more details.

The problem is really that the server has closed the connection even though the client has requested it be kept alive.
This is not necessarily because the server doesn't support keepalives, but could be that the server is configured to only allow a certain number of requests on a connection. This could be done to help spread out requests on different servers, but I think this practice is/was common as a practical defence against poorly written code that operates in the server (eg. PHP) that doesn't clean up after itself after serving a request (perhaps due to an error condition etc.)
If you think this is the case for you and you'd like to not see these logs (which are logged at INFO level), then you can add the following to quieten that part of the logging:
# Really don't need to hear about connections being brought up again after server has closed it
logging.getLogger("requests.packages.urllib3.connectionpool").setLevel(logging.WARNING)

This is common practice for services that expose RESTful APIs to avoid abuse (or DoS).
If you're stressing their API they'll drop your connection.
Try getting your script to sleep a bit every once in a while to avoid the drop.

python s3 boto connection.close causes an error

I have code that writes files to s3. The code was working fine
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(BUCKET, validate=False)
k = Key(bucket)
k.key = self.filekey
k.set_metadata('Content-Type', 'text/javascript')
k.set_contents_from_string(json.dumps(self.output))
k.set_acl(FILE_ACL)
This was working just fine. Then I noticed I wasn't closing my connection so I added this line at the end:
conn.close()
Now, the file writes as before but, I'm seeing this error in my logs now
S3Connection instance has no attribute '_cache', unable to write file
Anyone see what I'm doing wrong here or know what's causing this? I noticed that none of the tutorials on boto show people closing connections but I know you should close your connections for IO operations as a general rule...
EDIT
A note about this, when I comment out conn.close() the error disappears

I can't find that error message in the latest boto source code, so unfortunately I can't tell you what caused it. Recently, we had problems when we were NOT calling conn.close(), so there definitely is at least one case where you must close the connection. Here's my understanding of what's going on:
S3Connection (well, its parent class) handles almost all connectivity details transparently, and you shouldn't have to think about closing resource, reconnecting, etc.. This is why most tutorials and docs don't mention closing resources. In fact, I only know of one situation where you should close resources explicitly, which I describe at the bottom. Read on!
Under the covers, boto uses httplib. This client library supports HTTP 1.1 Keep-Alive, so it can and should keep the socket open so that it can perform multiple requests over the same connection.
AWS will close your connection (socket) for two reasons:
According to the boto source code, "AWS starts timing things out after three minutes." Presumably "things" means "idle connections."
According to Best Practices for Using Amazon S3, "S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset')."
Fortunately, boto works around the first case by recycling stale connections well before three minutes are up. Unfortunately, boto doesn't handle the second case quite so transparently:
When AWS closes a connection, your end of the connection goes into CLOSE_WAIT, which means that the socket is waiting for the application to execute close(). S3Connection handles connectivity details so transparently that you cannot actually do this directly! It's best to prevent it from happening in the first place.
So, circling back to the original question of when you need to close explicitly, if your application runs for a long time, keeps a reference to (reuses) a boto connection for a long time, and makes many boto S3 requests over that connection (thus triggering a "connection reset" on the socket by AWS), then you may find that more and more sockets are in CLOSE_WAIT. You can check for this condition on linux by calling netstat | grep CLOSE_WAIT. To prevent this, make an explicit call to boto's connection.close before you've made 100 requests. We make hundreds of thousands of S3 requests in a long running process, and we call connection.close after every, say, 80 requests.

cherrypy check connection status

I have a cherrypy api that is intended to run for a long time on the server.
I have an unreliable client that can die or close connection for various reasons that are out of my control.
During the time my server api runs, I want to periodically check the status of the connection, making sure the client is still listening and abort my operation if the client has gone away.
I could not find any good place describing how to poll the connection status while serving a cherrypy request.
One example of such a long run is computing md5 of multiple big files (of tens of GBs) in chunks of small buffer (limited memory).
I don't need any solutions that shorten the runtime since that is not my goal here. I want to keep this connection open for as long as I can, but abort if it is closed.
Here is the simple sample of my code:
#cherrypy.expose
def foo(self):
cherrypy.response.headers['Content-Type'] = 'text/plain'
def run():
for result in get_results(): # get_results() is a heavy method mentioned
yield json.dumps(result)
return run()
foo._cp_config = {'response.stream': True}

The only reliable way to know that the client has died is to try to write some data to the socket, which for CherryPy can be done with yield. You must yield non-empty strings, so you'd have to be returning a Content-Type that can handle some filler text, like some extra spaces after the opening <head> tag of an HTML document. If the client closes the connection, the CherryPy server will stop requesting additional yielded data from the handler (and call any close method on the generator so you can clean up).

As far as I know, CherryPy doesn't provide you with any mechanism to detect that a client died. It will only tell you if a response took too long to complete (and therefore be sent out).
You may refer to this SO thread for more information.

How can i ignore server response to save bandwidth?

I am using a server to send some piece of information to another server every second. The problem is that the other server response is few kilobytes and this consumes the bandwidth on the first server ( about 2 GB in an hour ). I would like to send the request and ignore the return ( not even receive it to save bandwidth ) ..
I use a small python script for this task using (urllib). I don't mind using any other tool or even any other language if this is going to make the request only.

A 5K reply is small stuff and is probably below the standard TCP window size of your OS. This means that even if you close your network connection just after sending the request and checking just the very first bytes of the reply (to be sure that request has been really received) probably the server already sent you the whole answer and the packets are already on the wire or on your computer.
If you cannot control (i.e. trim down) what is the server reply for your notification the only alternative I can think to is to add another server on the remote machine waiting for a simple command and doing the real request locally and just sending back to you the result code. This can be done very easily may be even just with bash/perl/python using for example netcat/wget locally.
By the way there is something strange in your math as Glenn Maynard correctly wrote in a comment.

For HTTP, you can send a HEAD request instead of GET or POST:
import urllib2
request = urllib2.Request('https://stackoverflow.com/q/5049244/')
request.get_method = lambda: 'HEAD' # override get_method
response = urllib2.urlopen(request) # make request
print response.code, response.url
Output
200 https://stackoverflow.com/questions/5049244/how-can-i-ignore-server-response-t
o-save-bandwidth
See How do you send a HEAD HTTP request in Python?

Sorry but this does not make much sense and is likely a violation of the HTTP protocol. I consider such an idea as weird and broken-by-design. Either make the remote server shut up or configure your application or whatever is running on the remote server on a different protocol level using a smarter protocol with less bandwidth usage. Everything else is hard being considered as nonsense.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.