Repeated transient-socket connections - memory leak risk? - python

I'm writing a script that opens a text file and loops through each line (pausing a few seconds between each one). For each line, it opens a transient client socket connection and sends the text to a host server. A host response may or may not come; it doesn't matter either way.
I've already bumped into the Python socket limitation where you can't reconnect with an existing socket object (because doing so triggers the exception EBADF, 'Bad file descriptor'). So I'm creating a new socket instance for each transient connection. The trick then of course becomes how to avoid a memory leak.
The way I've approached this is to push the whole portion of creating, using, and closing the socket to a function -- relying on Python's garbage collection to remove each instance after I'm done with it:
import socket,select,time
def transientConnect(host,port,sendData):
response = ''
sendSocket = socket.socket()
sendSocket.connect((serverHost,serverPort))
sendSocket.send(line)
gotData = select.select([sendSocket],[],[],2)
if (gotData[0]):response = sendSocket.recv(65535)
sendSocket.close()
return response
scriptLines = open('testScript.txt','r').readlines()
serverHost = '127.0.0.1'
serverPort = 15004
for line in scriptLines:
response = transientConnect(serverHost,serverPort,line)
print(response)
time.sleep(3.0)
My questions: (1) Will this approach work to avert any memory leaks? (2) Is there a more direct way to make sure each instance is eliminated after I'm finished with it?

First off, it is normal to only use a socket for a single exchange. See the socket HOWTO.
One of the nice things about python is that in general you don't have to worry about garbage collection. And you shouldn't unless you have real memory use problems.
From this webpage, keep in mind that:
"Python won’t clean up an object when it goes out of scope. It will clean it up when the last reference to it has gone out of scope."
So if the socket created inside the function isn't referenced elsewhere, it should go out of scope and be deallocated (but not gc-ed). What follows is probably specific to cpython. Read the documentation of gc.set_threshold() for an idea how garbage collection works in cpython. Especially:
"When the number of allocations minus the number of deallocations exceeds threshold0, collection starts."
The standard values for the thresholds (in cpython) are:
In [2]: gc.get_threshold()
Out[2]: (700, 10, 10)
So there would heva to be a fair number of allocations before you get a gc run. You can force garbage collection by running gc.collect().

Related

How can I do parallel tests in Django with Elasticsearch-dsl?

Has anyone gotten parallel tests to work in Django with Elasticsearch? If so, can you share what configuration changes were required to make it happen?
I've tried just about everything I can think of to make it work including the solution outlined here. Taking inspiration from how Django itself does the parallel DB's, I currently have created a custom new ParallelTestSuite that overrides the init_worker to iterate through each index/doctype and change the index names roughly as follows:
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")
This seems to generally work, however, there's some weirdness that happens seemingly randomly (i.e. running the test suite again doesn't reliably reproduce the issue and/or the error messages change). The following are the various errors I've gotten and it seems to randomly fail on one of them each test run:
Raise a 404 when trying to create the index in the function above (I've confirmed that it's the 404 coming back from the PUT request, however in the Elasticsearch server logs it says that it's created the index without issue)
a 500 when trying to create the index, although this one hasn't happened in a while so I think this was fixed by something else
query responses will sometimes not have an items dictionary value inside the _process_bulk_chunk function from the elasticsearch library
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something? I'm really at a loss of other things to try from here and would greatly appreciate some hints or even just confirmation that this is in fact possible to do.
I'm thinking that there's something weird going on at the connection layer (like somehow the connections between Django test runner processes are getting the responses mixed up?) but I'm at a loss as to how that would be even possible since Django uses multiprocessing to parallelize the tests and thus they are each running in their own process. Is it somehow possible that the spun-off processes are still trying to use the connection pool of the original process or something?
This is exactly what is happening. From the Elasticsearch DSL docs:
Since we use persistent connections throughout the client it means that the client doesn’t tolerate fork very well. If your application calls for multiple processes make sure you create a fresh client after call to fork. Note that Python’s multiprocessing module uses fork to create new processes on POSIX systems.
What I observed happening is that the responses get very weirdly interleaved with a seemingly random client that may have started the request. So a request to index a document might end up with a response to create an index which have very different attributes on them.
The fix is to ensure that each test worker has its own Elasticsearch client. This can be done by creating worker-specific connection aliases and then overwriting the current connection aliases (with the private attribute _using) with the worker-specific one. Below is a modified version of the code you posted with the change
_worker_id = 0
def _elastic_search_init_worker(counter):
global _worker_id
with counter.get_lock():
counter.value += 1
_worker_id = counter.value
for alias in connections:
connection = connections[alias]
settings_dict = connection.creation.get_test_db_clone_settings(_worker_id)
# connection.settings_dict must be updated in place for changes to be
# reflected in django.db.connections. If the following line assigned
# connection.settings_dict = settings_dict, new threads would connect
# to the default database instead of the appropriate clone.
connection.settings_dict.update(settings_dict)
connection.close()
### Everything above this is from the Django version of this function ###
from elasticsearch_dsl.connections import connections
# each worker needs its own connection to elasticsearch, the ElasticsearchClient uses
# global connection objects that do not play nice otherwise
worker_connection_postfix = f"_worker_{_worker_id}"
for alias in connections:
connections.configure(**{alias + worker_connection_postfix: settings.ELASTICSEARCH_DSL["default"]})
# Update index names in doctypes
for doc in registry.get_documents():
doc._doc_type.index += f"_{_worker_id}"
# Use the worker-specific connection
doc._doc_type._using = doc.doc_type._using + worker_connection_postfix
# Update index names for indexes and create new indexes
for index in registry.get_indices():
index._name += f"_{_worker_id}"
index._using = doc.doc_type._using + worker_connection_postfix
index.delete(ignore=[404])
index.create()
print(f"Started thread # {_worker_id}")

Python Twisted: Is it possible to reload certificates on the fly?

I have a simple web server which serves content over HTTPS:
sslContext = ssl.DefaultOpenSSLContextFactory(
'/home/user/certs/letsencrypt-privkey.pem',
'/home/user/certs/letsencrypt-fullchain.pem',
)
reactor.listenSSL(
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
do_print(bcolors.YELLOW + 'server.py | running https server on ' + https_server_interface + ':' + str(https_server_port) + bcolors.END)
Is it possible to reload the certificates on the fly (for example by calling a path like https://example.com/server/reload-certificates and having it execute some code) or what do I need to do in order to get it done?
I want to avoid restarting the Python process.
It is possible in several ways. Daniel F's answer is pretty good and shows a good, general technique for reconfiguring your server on the fly.
Here are a couple more techniques that are more specific to TLS support in Twisted.
First, you could reload the OpenSSL "context" object from the DefaultOpenSSLContextFactory instance. When it comes time to reload the certificates, run:
sslContext._context = None
sslContext.cacheContext()
The cacheContext call will create a new OpenSSL context, re-reading the certificate files in the process. This does have the downside of relying on a private interface (_context) and its interaction with a not-really-that-public interface (cacheContext).
You could also implement your own version of DefaultOpenSSLContextFactory so that you don't have to rely on these things. DefaultOpenSSLContextFactory doesn't really do much. Here's a copy/paste/edit that removes the caching behavior entirely:
class DefaultOpenSSLContextFactory(ContextFactory):
"""
L{DefaultOpenSSLContextFactory} is a factory for server-side SSL context
objects. These objects define certain parameters related to SSL
handshakes and the subsequent connection.
"""
_context = None
def __init__(self, privateKeyFileName, certificateFileName,
sslmethod=SSL.SSLv23_METHOD, _contextFactory=SSL.Context):
"""
#param privateKeyFileName: Name of a file containing a private key
#param certificateFileName: Name of a file containing a certificate
#param sslmethod: The SSL method to use
"""
self.privateKeyFileName = privateKeyFileName
self.certificateFileName = certificateFileName
self.sslmethod = sslmethod
self._contextFactory = _contextFactory
def getContext(self):
"""
Return an SSL context.
"""
ctx = self._contextFactory(self.sslmethod)
# Disallow SSLv2! It's insecure! SSLv3 has been around since
# 1996. It's time to move on.
ctx.set_options(SSL.OP_NO_SSLv2)
ctx.use_certificate_file(self.certificateFileName)
ctx.use_privatekey_file(self.privateKeyFileName)
Of course, this reloads the certificate files for every single connection which may be undesirable. You could add your own caching logic back in, with a control interface that fits into your certificate refresh system. This also has the downside that DefaultOpenSSLContextFactory is not really a very good SSL context factory to begin with. It doesn't follow current best practices for TLS configuration.
So you probably really want to use twisted.internet.ssl.CertificateOptions instead. This has a similar _context cache that you could clear out:
sslContext = CertificateOptions(...) # Or PrivateCertificate(...).options(...)
...
sslContext._context = None
It will regenerate the context automatically when it finds that it is None so at least you don't have to call cacheContext this way. But again you're relying on a private interface.
Another technique that's more similar to Daniel F's suggestion is to provide a new factory for the already listening socket. This avoids the brief interruption in service that comes between stopListening and listenSSL. This would be something like:
from twisted.protocols.tls import TLSMemoryBIOFactory
# DefaultOpenSSLContextFactory or CertificateOptions or whatever
newContextFactory = ...
tlsWebsiteFactory = TLSMemoryBIOFactory(
newContextFactory,
isClient=False,
websiteFactory,
)
listeningPortFileno = sslPort.fileno()
websiteFactory.sslPort.stopReading()
websiteFactory.sslPort = reactor.adoptStreamPort(
listeningPortFileno,
AF_INET,
tlsWebsiteFactory,
)
This basically just has the reactor stop servicing the old sslPort with its outdated configuration and tells it to start servicing events for that port's underlying socket on a new factory. In this approach, you have to drop down to the slightly lower level TLS interface since you can't adopt a "TLS port" since there is no such thing. Instead, you adopt the TCP port and apply the necessary TLS wrapping yourself (this is just what listenSSL is doing for you under the hood).
Note this approach is a little more limited than the others since not all reactors provide the fileno or adoptStreamPort methods. You can test for the interfaces the various objects provide if you want to use this where it's supported and degrade gracefully elsewhere.
Also note that since TLSMemoryBIOFactory is how it always works under the hood anyway, you could also twiddle its private interface, if you have a reference to it:
tlsMemoryBIOFactory._connectionCreator = IOpenSSLServerConnectionCreator(
newContextFactory,
)
and it will begin using that for new connections. But, again, private...
It is possible.
reactor.listenSSL returns a twisted.internet.tcp.Port instance which you can store somewhere accessible like in the website resource of your server, so that you can later access it:
website_resource = Website()
website_factory = server.Site(website_resource)
website_resource.sslPort = reactor.listenSSL( # <---
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
then later in your http handler (render function) you can execute the following:
if request.path == b'/server/reload-certificates':
request.setHeader("connection", "close")
self.sslPort.connectionLost(reason=None)
self.sslPort.stopListening()
self.sslListen()
return b'ok'
where self.sslListen is the initial setup code:
website_resource = Website()
website_factory = server.Site(website_resource)
def sslListen():
sslContext = ssl.DefaultOpenSSLContextFactory(
'/home/user/certs/letsencrypt-privkey.pem',
'/home/user/certs/letsencrypt-fullchain.pem',
)
website_resource.sslPort = reactor.listenSSL(
port=https_server_port,
factory=website_factory,
contextFactory=sslContext,
interface=https_server_interface
)
website_resource.sslListen = sslListen # <---
sslListen() # invoke once initially
# ...
reactor.run()
Notice that request.setHeader("connection", "close") is optional. It indicates the browser that it should close the connection and not reuse it for the next fetch to the server (HTTP/1.1 connections usually are kept open for at least 30 seconds in order to be reused).
If the connection: close header is not sent, then everything will still work, the connection will still be active and usable, but it will still be using the old certificate, which should be no problem if you're just reloading the certificates to refresh them after certbot updated them. New connections from other browsers will start using the new certificates immediately.

Will Python close an fd if it's out of a local scope?

I found Python close my file descriptor automatically. Run the follow code and use lsof to find the open file. When sleep in function openAndSleep, I found file "fff" was holding by the process. But when it run out of the function, file "fff" was not holding any more.
import time
def openAndSleep():
f = open("fff", 'w')
print "opened, sleep 10 sec"
time.sleep(10)
print "sleep finish"
openAndSleep()
print "in main...."
time.sleep(10000)
I check class file, it has no __del__ method. It seems strange, anyone know something about it?
Yes, CPython will.
File objects close automatically when their reference count drops to 0. A local scope being cleaned up means that the refcount drops, and if the local scope was the only reference then the file object refcount drops to 0 and is closed.
However, it is better to use the file object as a context manager in a with statement and have it closed automatically that way; don't count on the specific garbage handling implementation of CPython:
def openAndSleep():
with open("fff", 'w') as f:
print "opened, sleep 10 sec"
time.sleep(10)
print "sleep finish"
Note that __del__ is a hook for custom Python classes; file objects are implemented in C and fill the tp_dealloc slot instead. The file_dealloc() function closes the file object.
If you want to hold a file object open for longer, make sure there is still a reference to it. Store a reference to it somewhere else too. Return it from the function and store the return value, for example, or make it a global, etc.
In short: Yes.
Python spares the user the need to manage his memory by implementing a Garbage Collection mechanism.
This basically means that each object in Python will be automatically freed and removed if no one uses it, to free memory and resources so they can be used later in the program again.
File Objects are Pythonic objects, the same as any other object in Python and they too are managed by the garbage collector. Once you leave the function scope the Garbage Collector sees that no one uses the file (using a reference counter) and disposes of the object - which means closing it as well.
What you can do to avoid it is to open the file without using the Python file object by using os.open which returns a file descriptor (int) rather than a Python file object. The file descriptor will then not be discarded by the Garbage Collector since it's not a Python object but an Operating System object and thus your code will work.
You should be careful to close (os.close) the fd later, though, or you will leak resources and sooner or later your program will crash (A process can only store 1024 file descriptors and then no more files can be opened)!
Additional information:
http://www.digi.com/wiki/developer/index.php/Python_Garbage_Collection

readinto() replacement?

Copying a File using a straight-forward approach in Python is typically like this:
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
(This code snippet is from shutil.py, by the way).
Unfortunately, this has drawbacks in my special use-case (involving threading and very large buffers) [Italics part added later]. First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
To avoid this I'm using the file.readinto() method which, unfortunately, is documented as deprecated and "don't use":
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
buffer = array.array('c')
buffer.fromstring('-' * length)
while True:
count = fsrc.readinto(buffer)
if count == 0:
break
if count != len(buffer):
fdst.write(buffer.toString()[:count])
else:
buf.tofile(fdst)
My solution works, but there are two drawbacks as well: First, readinto() is not to be used. It might go away (says the documentation). Second, with readinto() I cannot decide how many bytes I want to read into the buffer and with buffer.tofile() I cannot decide how many I want to write, hence the cumbersome special case for the last block (which also is unnecessarily expensive).
I've looked at array.array.fromfile(), but it cannot be used to read "all there is" (reads, then throws EOFError and doesn't hand out the number of processed items). Also it is no solution for the ending special-case problem.
Is there a proper way to do what I want to do? Maybe I'm just overlooking a simple buffer class or similar which does what I want.
This code snippet is from shutil.py
Which is a standard library module. Why not just use it?
First, it means that with each call of read() a new memory chunk is allocated and when buf is overwritten in the next iteration this memory is freed, only to allocate new memory again for the same purpose. This can slow down the whole process and put unnecessary load on the host.
This is tiny compared to the effort required to actually grab a page of data from disk.
Normal Python code would not be in need off such tweaks as this - however if you really need all that performance tweaking to read files from inside Python code (as in, you are on the rewriting some server coe you wrote and already works for performance or memory usage) I'd rather call the OS directly using ctypes - thus having a copy performed as low level as I want too.
It may even be possible that simple calling the "cp" executable as an external process is less of a hurdle in your case (and it would take full advantages of all OS and filesystem level optimizations for you).

UNIX named PIPE end of file

I'm trying to use a unix named pipe to output statistics of a running service. I intend to provide a similar interface as /proc where one can see live stats by catting a file.
I'm using a code similar to this in my python code:
while True:
f = open('/tmp/readstatshere', 'w')
f.write('some interesting stats\n')
f.close()
/tmp/readstatshere is a named pipe created by mknod.
I then cat it to see the stats:
$ cat /tmp/readstatshere
some interesting stats
It works fine most of the time. However, if I cat the entry several times in quick successions, sometimes I get multiple lines of some interesting stats instead of one. Once or twice, it has even gone into an infinite loop printing that line forever until I killed it. The only fix that I've got so far is to put a delay of let's say 500ms after f.close() to prevent this issue.
I'd like to know why exactly this happens and if there is a better way of dealing with it.
Thanks in advance
A pipe is simply the wrong solution here. If you want to present a consistent snapshot of the internal state of your process, write that to a temporary file and then rename it to the "public" name. This will prevent all issues that can arise from other processes reading the state while you're updating it. Also, do NOT do that in a busy loop, but ideally in a thread that sleeps for at least one second between updates.
What about a UNIX socket instead of a pipe?
In this case, you can react on each connect by providing fresh data just in time.
The only downside is that you cannot cat the data; you'll have to create a new socket handle and connect() to the socket file.
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
try:
os.unlink(MYSOCKETFILE)
except OSError: pass
s = socket.socket(socket.AF_UNIX)
s.bind(MYSOCKETFILE)
s.listen(10)
while True:
s2, peeraddr = s.accept()
s2.send('These are my actual data')
s2.close()
Program querying this socket:
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
s = socket.socket(socket.AF_UNIX)
s.connect(MYSOCKETFILE)
while True:
d = s.recv(100)
if not d: break
print d
s.close()
I think you should use fuse.
it has python bindings, see http://pypi.python.org/pypi/fuse-python/
this allows you to compose answers to questions formulated as posix filesystem system calls
Don't write to an actual file. That's not what /proc does. Procfs presents a virtual (non-disk-backed) filesystem which produces the information you want on demand. You can do the same thing, but it'll be easier if it's not tied to the filesystem. Instead, just run a web service inside your Python program, and keep your statistics in memory. When a request comes in for the stats, formulate them into a nice string and return them. Most of the time you won't need to waste cycles updating a file which may not even be read before the next update.
You need to unlink the pipe after you issue the close. I think this is because there is a race condition where the pipe can be opened for reading again before cat finishes and it thus sees more data and reads it out, leading to multiples of "some interesting stats."
Basically you want something like:
while True:
os.mkfifo(the_pipe)
f = open(the_pipe, 'w')
f.write('some interesting stats')
f.close()
os.unlink(the_pipe)
Update 1: call to mkfifo
Update 2: as noted in the comments, there is a race condition in this code as well with multiple consumers.

Categories