I was wondering about implementing a singleton class following http://code.activestate.com/recipes/52558-the-singleton-pattern-implemented-with-python/ but was wondering about any (b)locking issues. My code is suppose to cache SQL statements and execute all cached statements using cursor.executemany(SQL, list-of-params) when a certain number of cached elements are reached or a specific execute-call is done by the user. Implementing a singleton was suppose to make it possible to cache statements application-wide, but Im afraid Ill run into (b)locking issues.
Any thoughts?
By avoiding lazy initialization the blocking problem will go away. In a module where initialization of your connection to the database is occurring import the module that contains the singleton and then immediately create an instance of the singleton that is not stored in a variable.
#Do Database Initialization
import MySingleton
MySingleton()
#Allow threads to be created
Why don't you use the module directly (as pointed out before, models are Singletons). If you create a module like:
# mymodule.py
from mydb import Connection
connection = Connection('host', 'port')
you can use the import mechanism and the connection instance will be the same everywhere.
from mymodule import connection
Of course, you can define a much more complex initialization of connection (possibly via writing your own class), but the point is that Python will only initialize the module once, and provide the same objects for every subsequent call.
I believe the Singleton (or Borg) patterns have very specific applications in Python, and for the most part you should rely on direct imports until proven otherwise.
There should be no problems unless you plan to use that Singleton instance with several threads.
Recently I've faced with some issue caused by wrongly implemented cache reloading mechanism - cache data was first cleared and then filled. This works well in single thread, but produces bugs in multithreading.
As long as you use CPython - Global Interpreter Lock should prevent blocking problems. You could also use the Borg pattern.
Related
I realize this is a very basic question. Please see the disclaimers and context below if this just seems plain stupid!
I often have an object representing a resource -- for example a Redis queue -- that is used in several places in my Django application. However, the resource is not necessarily invoked in every single HTTP request.
Should I:
Instantiate the object once, and import it into the
relevant modules?
Or, instantiate the object locally in each module where it is needed?
Option 1
shared.py
from redis import Redis
from rq import Queue
queue = Queue(connection=Redis())
views1.py
from shared import queue
# ... use the queue
views2.py
from shared import queue
# ... use the queue
Option 2
views1.py
from redis import Redis
from rq import Queue
queue = Queue(connection=Redis())
# ... use the queue
views2.py
from redis import Redis
from rq import Queue
queue = Queue(connection=Redis())
# ... use the queue
Disclaimers & Context
I'm sure this question seems elementary. I realize both methods work -- I'm really asking so that I can gain a better understanding of the fundamentals involved.
What are the implications and consequences of these two approaches? Are there advantages to using Option 1 in certain circumstances, and Option 2 in others?
I've read up a bit on python's import system and found those concepts somewhat confusing. I also don't entirely understand how the python process that runs a Django application works. Specifically, (1) when objects are being loaded, (2) whether every object is retained in memory during the lifetime of the python process, and (3) how and whether that python process persists through multiple HTTP requests.
Thanks in advance.
Option 1 is probably best. This is in effect making it a singleton, which is used wherever you need it.
In terms of your follow-up questions, everything at module level is executed when that module is first imported in a process. When subsequent modules import the first module, it is not executed again; they just get additional references to the same object. So in this case there would be a single instantiation of your queue. Objects remain in memory for as long as there are references to them; since this object is instantiated at module level and assigned to a module-level variable, the instance will persist in memory for the duration of the process.
There are quite a few questions here about how processes work in Django; suffice to say that although this depends to some extent on the server that is running it, almost all ways of running Django consist of multiple processes each of which persist for many requests. Again, in your case each server process would have its own single reference to the queue object.
I am looking for a way to correctly manage module level global variables that use some operating system resource (like a file or a thread).
The problem is that when the module is reloaded, my resource must be properly disposed (e.g. the file closed or the thread terminated) before creating the new one.
So I need a better pattern to manage those singleton objects.
I've been reading the docs on module reload and this is quite interesting:
When a module is reloaded, its dictionary (containing the module’s
global variables) is retained. Redefinitions of names will override
the old definitions, so this is generally not a problem. If the new
version of a module does not define a name that was defined by the old
version, the old definition remains. This feature can be used to the
module’s advantage if it maintains a global table or cache of objects
— with a try statement it can test for the table’s presence and skip
its initialization if desired:
try:
cache
except NameError:
cache = {}
So I could just check if the objects already exist, and dispose them before creating the new ones.
You need to monkeypatch or fork django to hook into django dev server reloading feature and do the proper thing to manage file closing etc...
But since you develop a django application, if you mean to use a proper server to serve your app in the future you should consider managing your global variables and think about semaphores and all that jazz.
But before going this route and implement all this difficult code prone to error and hair loss. You should consider other solution like nosql databases (redis, mongodb, neo4j, hadoop...) and background process managers like celery and gearman. If all of this don't feet your use-case(s) and you can't avoid to create and manage files yourself and global variables then consider the client/server pattern where clients are webserver threads unless you want to mess with NFS.
For example, if one application does from twisted.internet import reactor, and another application does the same, are those reactors the same?
I am asking because Deluge, an application that uses twisted, looks like it uses the reactor to connect their UI (gtk) to the rest of the application being driven by twisted (I am trying to understand the source). For example, when the UI is closed it simply calls reactor.stop().
Is that all there is to it? It just seems kind of magic to me. What if I wanted to run another application that uses twisted?
Yes, every module in Python is always global, or, to put it better, a singleton: when you do from twisted.internet import reactor, Python's import mechanism first checks sys.modules['twisted.internet.reactor'], and, if that exists, returns said value; only if it doesn't exist (i.e., the first time a module is imported) is the module actually loaded for the first time (and stashed into an entry in sys.modules for possible future imports).
There is nothing especially magical in the Singleton design pattern, though it can sometimes prove limiting when you desperately need more than one of those thingies for which the architecture has decreed "there can be only one". Twisted's docs acknowledge that:
New application code should prefer to
pass and accept the reactor as a
parameter where it is needed, rather
than relying on being able to import
this module to get a reference. This
simplifies unit testing and may make
it easier to one day support multiple
reactors (as a performance
enhancement), though this is not
currently possible.
The best way to make it possible, if it's crucial to your app, is to contribute to the Twisted project, either labor (coding the subtle mechanisms needed to support multiple reactors, that is, multiple event loops, within a single app) or funding (money will enable sustaining somebody with a stipend in order to perform this work).
Otherwise, use separate processes (e.g. with the multiprocessing module of the standard library) with no more than one reactor each.
The reactor is indeed global. It takes care of the event loop, and you register handlers to consume events. If you want to use several applications with the same reactor, you can use the twistd daemon. http://twistedmatrix.com/documents/current/core/howto/application.html
I keep a cache of transactions to flush (to persistent storage) on the event of a watermark or object finalization. Since __del__ is no longer guaranteed to be called on every object, is the appropriate approach to hook a similar function (or __del__ itself) into atexit.register (during initialization)?
If I'm not mistaken, this will cause the object to which the method is bound to hang around until program termination. This isn't likely to be a problem, but maybe there's a more elegant solution?
Note: I know using __del__ is non-ideal because it can cause uncatchable exceptions, but I can't think of another way to do this short of cascading finalize() calls all the way through my program. TIA!
If you have to handle ressources the prefered way is to have an explicit call to a close() or finalize() method. Have a look at the with statement to abstract that. In your case the weakref module might be an option. The cached object can be garbage collected by the system and have their __del__() method called or you finalize them if they are still alive.
I would say atexit or try and see if you can refactor the code into being able to be expressed using a with_statement which is in the __future__ in 2.5 and in 2.6 by default. 2.5 includes a module contextlib to simplify things a bit. I've done something like this when using Canonical's Storm ORM.
from future import with_statement
#contextlib.contextmanager
def start_transaction(db):
db.start()
yield
db.end()
with start_transaction(db) as transaction:
...
For a non-db case, you could just register the objects to be flushed with a global and then use something similar. The benefit of this approach is that it keeps things explicit.
If you don't need your object to be alive at the time you perform the flush, you could use weak references
This is similar to your proposed solution, but rather than using a real reference, store a list of weak references, with a callback function to perform the flush. This way, the references aren't going to keep those objects alive, and you won't run into any circular garbage problems with __del__ methods.
You can run through the list of weak references on termination to manually flush any still alive if this needs to be guaranteed done at a certain point.
Put the following in a file called destructor.py
import atexit
objects = []
def _destructor():
global objects
for obj in objects:
obj.destroy()
del objects
atexit.register(_destructor)
now use it this way:
import destructor
class MyObj(object):
def __init__(self):
destructor.objects.append(self)
# ... other init stuff
def destroy(self):
# clean up resources here
I think atexit is the way to go here.
I have implemented a python webserver. Each http request spawns a new thread.
I have a requirement of caching objects in memory and since its a webserver, I want the cache to be thread safe. Is there a standard implementatin of a thread safe object cache in python? I found the following
http://freshmeat.net/projects/lrucache/
This does not look to be thread safe. Can anybody point me to a good implementation of thread safe cache in python?
Thanks!
Well a lot of operations in Python are thread-safe by default, so a standard dictionary should be ok (at least in certain respects). This is mostly due to the GIL, which will help avoid some of the more serious threading issues.
There's a list here: http://coreygoldberg.blogspot.com/2008/09/python-thread-synchronization-and.html that might be useful.
Though atomic nature of those operation just means that you won't have an entirely inconsistent state if you have two threads accessing a dictionary at the same time. So you wouldn't have a corrupted value. However you would (as with most multi-threading programming) not be able to rely on the specific order of those atomic operations.
So to cut a long story short...
If you have fairly simple requirements and aren't to bothered about the ordering of what get written into the cache then you can use a dictionary and know that you'll always get a consistent/not-corrupted value (it just might be out of date).
If you want to ensure that things are a bit more consistent with regard to reading and writing then you might want to look at Django's local memory cache:
http://code.djangoproject.com/browser/django/trunk/django/core/cache/backends/locmem.py
Which uses a read/write lock for locking.
Thread per request is often a bad idea. If your server experiences huge spikes in load it will take the box to its knees. Consider using a thread pool that can grow to a limited size during peak usage and shrink to a smaller size when load is light.
Point 1. GIL does not help you here, an example of a (non-thread-safe) cache for something called "stubs" would be
stubs = {}
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
if host not in stubs:
stub = create_new_stub_for_host(host)
stubs[host] = stub
return stubs[host]
What can happen is that Thread 1 calls maybe_new_stub('localhost'), and it discovers we do not have that key in the cache yet. Now we switch to Thread 2, which calls the same maybe_new_stub('localhost'), and it also learns the key is not present. Consequently, both threads call create_new_stub_for_host and put it into the cache.
The map itself is protected by the GIL, so we cannot break it by concurrent access. The logic of the cache, however, is not protected, and so we may end up creating two or more stubs, and dropping all except one on the floor.
Point 2. Depending on the nature of the program, you may not want a global cache. Such shared cache forces synchronization between all your threads. For performance reasons, it is good to make the threads as independent as possible. I believe I do need it, you may actually not.
Point 3. You may use a simple lock. I took inspiration from https://codereview.stackexchange.com/questions/160277/implementing-a-thread-safe-lrucache and came up with the following, which I believe is safe to use for my purposes
import threading
stubs = {}
lock = threading.Lock()
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
with lock:
if host not in stubs:
channel = grpc.insecure_channel('%s:6666' % host)
stub = cli_pb2_grpc.BrkStub(channel)
stubs[host] = stub
return stubs[host]
Point 4. It would be best to use existing library. I haven't found any I am prepared to vouch for yet.
You probably want to use memcached instead. It's very fast, very stable, very popular, has good python libraries, and will allow you to grow to a distributed cache should you need to:
http://www.danga.com/memcached/
I'm not sure any of these answers are doing what you want.
I have a similar problem and I'm using a drop-in replacement for lrucache called cachetools which allows you to pass in a lock to make it a bit safer.
For a thread safe object you want threading.local:
from threading import local
safe = local()
safe.cache = {}
You can then put and retrieve objects in safe.cache with thread safety.