Are generators with context managers an anti-pattern? - python

I'm wondering about code like this:
def all_lines(filename):
with open(filename) as infile:
yield from infile
The point of a context manager is to have explicit control over the lifetime of some form of state, e.g. a file handle. A generator, on the other hand, keeps its state until it is exhausted or deleted.
I do know that both cases work in practice. But I'm worried about whether it is a good idea. Consider for example this:
def all_first_lines(filenames):
return [next(all_lines(filename), None) for filename in filenames]
I never exhaust the generators. Instead, their state is destroyed when the generator object is deleted. This works fine in reference-counted implementations like CPython, but what about garbage-collected implementations? I'm practically relying on the reference counter for managing state, something that context managers were explicitly designed to avoid!
And even in CPython it shouldn't be too hard to construct cases were a generator is part of a reference cycle and needs the garbage collector to be destroyed.
To summarize: Would you consider it prudent to avoid context managers in generators, for example by refactoring the above code into something like this?
def all_lines(filename):
with open(filename) as infile:
return infile.readlines()
def first_line(filename):
with open(filename) as infile:
return next(infile, None)
def all_first_lines(filenames):
return [first_line(filename) for filename in filenames]

While it does indeed extend the lifetime of the object until the generator exits or is destroyed, it also can make the generators clearer to work with.
Consider creating the generators under an outer with and passing the file as an argument instead of them opening it. Now the file is invalid for use after the context manager is exited, even though the generators can still be seen as usable.
If limiting the time for how long the handles are held is important, you can explicitly close the generators using the close method after you are done with them.
This is a similar problem to what trio tries to solve with its nurseries for asynchronous tasks, where the nursery context manager waits for every task spawned from that nursery to exit before proceeding, the tutorial example illustrates this. This blog post by the author can provide some reasoning for the way it's done in trio which can be an interesting read that's somewhat related to the problem.

There are two answers to your question :
the absolutist : indeed, the context managers will not serve their role, the GC will have to clean the mess that should not have happened
the pragmatic : true, but is it actually a problem ? Your file handle will get released a few milliseconds later, what's the bother ? Does it have a measurable impact on production, or is it just bikeshedding ?
I'm not an expert to Python alt implementations' differences (see this page for PyPy's example), but I posit that this lifetime problem will not occur in 99% of cases. If you happen to hit in prod, then yes, you should address it (either with your proposal, or a mix of generator with context manager) otherwise, why bother ? I mean it in a kind way : your point is strictly valid, but irrelevant to most cases.

Related

Proper finalization in Python

I have a bunch of instances, each having a unique tempfile for its use (save data from memory to disk and retrieve them later).
I want to be sure that at the end of the day, all these files are removed. However, I want to leave a room for a fine-grained control of their deletion. That is, some files may be removed earlier, if needed (e.g. they are too big and not important any more).
What is the best / recommended way to achieve this?
May thoughts on that
The try-finalize blocks or with statements are not an option, as we have many files, whose lifetime may overlap each other. Also, it hardly admits the option of finer control.
From what I have read, __del__ is also not a feasible option, as it is not even guaranteed that it will eventually run (although, it is not entirely clear to me, what are the "risky" cases). Also (if it is still the case), the libraries may not be available when __del__ runs.
tempfile library seems promising. However, the file is gone after just closing it, which is definitely a bummer, as I want them to be closed (when they perform no operation) to limit the number of open files.
The library promises that the file "will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected)."
How do they achieve the implicit close? E.g. in C# I would use a (reliable) finalizer, which __del__ is not.
atexit library seems to be the best candidate, which can work as a reliable finalizer instead of __del__ to implement safe disposable pattern. The only problem, compared to object finalizers, is that it runs truly at-exit, which is rather inconvenient (what if the object eligible to be garbage-collected earlier?).
Here, the question still stands. How the library achieves that the methods always run? (Except in a really unexpected cases with which is hard to do anything)
In ideal case, it seems that a combination of __del__ and atexit library may perform best. That is, the clean-up is both at __del__ and the method registered in atexit, while repeated clean-up would be forbidden. If __del__ was called, the registered will be removed.
The only (yet crucial) problem is that __del__ won't run if a method is registered at atexit, because a reference to the object exists forever.
Thus, any suggestion, advice, useful link and so on is welcomed.
I suggest considering weakref built-in module for this task, more specifically weakref.finalize simple example:
import weakref
class MyClass:
pass
def clean_up(*args):
print('clean_up', args)
my_obj = MyClass()
weakref.finalize(my_obj, clean_up, 'arg1', 'arg2', 'arg3')
del my_obj # optional
when run it will output
clean_up ('arg1', 'arg2', 'arg3')
Note that clean_up will be executed even without del-ing of my_obj (you might delete last line of code and behavior will not change). clean_up is called after all strong references to my_obj are gone or at end (like using atexit module).

When/How does an anonymous file object close?

In the comments of this question about a python one-liner, it occurred to me I have no idea how python handles anonymous file objects. From the question:
open(to_file, 'w').write(open(from_file).read())
There are two calls to open without using the with keyword (which is usually how I handle files). I have, in the past, used this kind of unnamed file. IIRC, it seemed there was a leftover OS-level lock on the file that would expire after a minute or two.
So what happens to these file handles? Are they cleaned up by garbage collection? By the OS? What happens to the Python machine and file when close() is called, and will it all happen anyway when the script finishes and some time passes?
Monitoring the file descriptor on Linux (by checking /proc/$$/fds) and the File Handle on Windows (using SysInternals tools) it appears that the file is closed immediately after the statement.
This cannot be guarenteed however, since the garbage collector has to execute. In the testing I have done it does get closed at once every time.
The with statement is recommended to be used with open, however the occasions when it is actually needed are rare. It is difficult to demonstrate a scenario where you must use with, but it is probably a good idea to be safe.
So your one-liner becomes:
with open(to_file, 'w') as tof, open(from_file) as fof:
tof.write(fof.read())
The advantage of with is that the special method (in the io class) called __exit__() is guaranteed* to be called.
* Unless you do something like os._exit().
The files will get closed after the garbage collector collects them, CPython will collect them immediately because it uses reference counting, but this is not a guaranteed behavior.
If you use files without closing them in a loop you might run out of file descriptors, that's why it's recommended to use the with statement (if you're using 2.5 you can use from __future__ import with_statement).

Python Twisted Deferred : clarification needed

I am hoping for some clarification on the best way to deal with handling "first" deferreds , ie not just adding callbacks and errbacks to existing Twisted methods that return a deferred, but the best way of creating those original deferreds.
As a concrete example, here are 2 variations of the same method :
it just counts the number of lines in some rather big text files, and is used as the starting point for a chain of deferreds.
Method 1:
This one does not feel so good, as the deferred is fired directly by the reactor.callLater method.
def get_line_count(self):
deferred = defer.Deferred()
def count_lines(result):
try:
print_file = file(self.print_file_path, "r")
self.line_count = sum(1 for line in print_file)
print_file.close()
return self.line_count
except Exception as inst:
raise InvalidFile()
deferred.addCallback(count_lines)
reactor.callLater(1, deferred.callback, None)
return deferred
Method 2:
slightly better , as the deferred is actually fired when the result is available
def get_line_count(self):
deferred = defer.Deferred()
def count_lines():
try:
print_file = file(self.print_file_path, "r")
self.line_count = sum(1 for line in print_file)
print_file.close()
deferred.callback(self.line_count)
except Exception as inst:
deferred.errback(InvalidFile())
reactor.callLater(1, count_lines)
return deferred
Note: You could also point out that both of these are actually synchronous, and potentially blocking methods, (and I perhaps could use "MaybeDeferred"?).
But well, that that is actually one of the aspects I get confused by.
For Method 2, if the count_lines method is very slow (counting the lines in some huge files etc), will it potentially "block" the whole Twisted app ?
I read quite a lot of documentation on how callbacks and errbacks and the reactor behave together (callbacks need to be executed quickly, or return deferreds themselves etc), but in this case , I just don't see and would really appreciate some pointers/examples etc
Are there some articles/clear explanations that deal with the best approach to creating these "first" deferreds? I have read through these excellent articles , and they have helped a lot with some of the basic understanding, but I still feel like I am missing a piece.
For blocking code, would this be this a typicall case for DeferToThread or reactor.spawnprocess ?
I read through a lot of questions like this one and this article, but I still am not 100% sure on how to deal with potentially blocking code, mostly when dealing with file i/o
Sorry if any of this seems too basic , but I really want to get the hang of using Twisted more thoroughly. (It has been a really powerful tool for all the more network-oriented aspects).
Thank you for your time!
Yes, you've got it right: you need threads or separate processes to avoid blocking the Twisted event loop. Using Deferreds wont magically make your code non-blocking. For your questions:
Yes, you would block the event loop if count_lines is very slow. Deferring it to a thread would solve this.
I used Twisteds documentation to learn how Deferreds work, but I guess you've already been through that. The article on database support was information since it clearly says that this library is built using threads. This is how you bridge the synchronous–asynchronous gap.
If the call is truly blocking, then you need to DeferToThread. Python itself is kind-of single threaded, meaning that only one thread can execute Python byte code at a time. However, if the thread you create will block on I/O anyway, then this model works fine: the thread will release the global interpreter lock and so let other Python threads run, including the main thread with the Twisted event loop.
It can also be the case that you can use non-blocking I/O in your code. This can be done with the select module, for example. In that case, you don't need a separate thread. Twisted uses this technique internally and you don't have to think of this if you do normal network I/O. But if you're doing something exotic, then it's good to know how things are built so that you can do the same.
I hope that makes things a bit clearer!

Python: time a method call and stop it if time is exceeded

I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.
The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.
You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.
You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.
I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.
Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit
a quick google for "python timeout" reveals a TimeoutFunction class
Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.

python threadsafe object cache

I have implemented a python webserver. Each http request spawns a new thread.
I have a requirement of caching objects in memory and since its a webserver, I want the cache to be thread safe. Is there a standard implementatin of a thread safe object cache in python? I found the following
http://freshmeat.net/projects/lrucache/
This does not look to be thread safe. Can anybody point me to a good implementation of thread safe cache in python?
Thanks!
Well a lot of operations in Python are thread-safe by default, so a standard dictionary should be ok (at least in certain respects). This is mostly due to the GIL, which will help avoid some of the more serious threading issues.
There's a list here: http://coreygoldberg.blogspot.com/2008/09/python-thread-synchronization-and.html that might be useful.
Though atomic nature of those operation just means that you won't have an entirely inconsistent state if you have two threads accessing a dictionary at the same time. So you wouldn't have a corrupted value. However you would (as with most multi-threading programming) not be able to rely on the specific order of those atomic operations.
So to cut a long story short...
If you have fairly simple requirements and aren't to bothered about the ordering of what get written into the cache then you can use a dictionary and know that you'll always get a consistent/not-corrupted value (it just might be out of date).
If you want to ensure that things are a bit more consistent with regard to reading and writing then you might want to look at Django's local memory cache:
http://code.djangoproject.com/browser/django/trunk/django/core/cache/backends/locmem.py
Which uses a read/write lock for locking.
Thread per request is often a bad idea. If your server experiences huge spikes in load it will take the box to its knees. Consider using a thread pool that can grow to a limited size during peak usage and shrink to a smaller size when load is light.
Point 1. GIL does not help you here, an example of a (non-thread-safe) cache for something called "stubs" would be
stubs = {}
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
if host not in stubs:
stub = create_new_stub_for_host(host)
stubs[host] = stub
return stubs[host]
What can happen is that Thread 1 calls maybe_new_stub('localhost'), and it discovers we do not have that key in the cache yet. Now we switch to Thread 2, which calls the same maybe_new_stub('localhost'), and it also learns the key is not present. Consequently, both threads call create_new_stub_for_host and put it into the cache.
The map itself is protected by the GIL, so we cannot break it by concurrent access. The logic of the cache, however, is not protected, and so we may end up creating two or more stubs, and dropping all except one on the floor.
Point 2. Depending on the nature of the program, you may not want a global cache. Such shared cache forces synchronization between all your threads. For performance reasons, it is good to make the threads as independent as possible. I believe I do need it, you may actually not.
Point 3. You may use a simple lock. I took inspiration from https://codereview.stackexchange.com/questions/160277/implementing-a-thread-safe-lrucache and came up with the following, which I believe is safe to use for my purposes
import threading
stubs = {}
lock = threading.Lock()
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
with lock:
if host not in stubs:
channel = grpc.insecure_channel('%s:6666' % host)
stub = cli_pb2_grpc.BrkStub(channel)
stubs[host] = stub
return stubs[host]
Point 4. It would be best to use existing library. I haven't found any I am prepared to vouch for yet.
You probably want to use memcached instead. It's very fast, very stable, very popular, has good python libraries, and will allow you to grow to a distributed cache should you need to:
http://www.danga.com/memcached/
I'm not sure any of these answers are doing what you want.
I have a similar problem and I'm using a drop-in replacement for lrucache called cachetools which allows you to pass in a lock to make it a bit safer.
For a thread safe object you want threading.local:
from threading import local
safe = local()
safe.cache = {}
You can then put and retrieve objects in safe.cache with thread safety.

Categories