I have a number of threads in my software that all do the same thing, but each thread operates from a different "perspective." I have a "StateModel" object that is used throughout the thread and the objects within the thread, but StateModel needs to be calculated differently for each thread.
I don't like the idea of passing the StateModel object around to all of the functions that need it. Normally, I would create a module variable and all of the objects throughout the program could reference the same data from the module variable. But, is there a way to have this concept of a static module variable that is different and independent for each thread? A kind of static Thread variable?
Thanks.
This is implemented in threading.local.
I tend to dislike mostly-quoting-the-docs answers, but... well, time and place for everything.
A class that represents thread-local data. Thread-local data are data
whose values are thread specific. To manage thread-local data, just
create an instance of local (or a subclass) and store attributes on
it:
mydata = threading.local()
mydata.x = 1
The instance’s values will be
different for separate threads.
For more details and extensive examples, see the documentation string
of the _threading_local module.
Notably you can just have your class extend threading.local, and suddenly your class has thread-local behavior.
Related
If I instantiate an object in the main thread, and then send one of it's member methods to a ThreadPoolExecutor, does Python somehow create a copy-by-value of the object and sends it to the subthread, so that the objects member method will have access to its own copy of self?
Or is it indeed accessing self from the object in the main thread, thus meaning that every member in a subthread is modifying / overwriting the same properties (living in the main thread)?
Threads share a memory space. There is no magic going on behind the scenes, so code in different threads accesses the same objects. Thread switches can occur at any time, although most simple Python operations are atomic. It is up to you to avoid race conditions. Normal Python scoping rules apply.
You might want to read about ThreadLocal variables if you want to find out about workarounds to the default behavior.
Processes as quite different. Each Process has its own memory space and its own copy of all the objects it references.
Complete code here: https://gist.github.com/mnjul/82151862f7c9585dcea616a7e2e82033
Environment is Python 2.7.6 on an up-to-date Ubuntu 14.04 x64.
Prologue:
Well, I got this strange piece of code at my work project, and it's a classic "somebody wrote it and quit the job, and it works but I don't why" piece, so I decided to write a stripped-down version of it, hoping to get my questions clarified/answered. Please kindly check the referred gist.
Situation:
So, I have a custom class Storage inheriting from Python's thread local storage, intended to book-keep some thread-local data. There is only one instance of that class, instantiated in the global scope when no threads have been constructed. So I would expect that as there is only one Storage instance, its __init__() running only once, those Runner threads would actually not have thread-local storage and data accesses will clash.
However this turned out to be wrong and the code output (see my comment at that gist) indicates that each thread actually perfectly has its own local storage --- strangely, at each thread's first access to the storage object (i.e. a set()), Storage.__init__() is mysteriously run, thus properly creating the thread-local storage, producing the desired effect.
Questions: Why on earth did Storage.__init__ get invoked when the threads attempted to call a member function of a seemingly already-instantiated object? Is this a CPython (or PThread, if that matters) implementation detail? I feel like there're a lot of things happening between my stack trace's "py_thr_local.py", line 36, in run => storage.set('keykey', value) and "py_thr_local.py", line 14, in __init__, but I can't find any relevant piece of information in (C)Python's source code, or on the StackOverflow.
Any feedback is welcome. Let me know if I need to clarify things or provide more information.
The first piece of information to consider is what is a thread-local? They are independently initialized instances of a particular type that are tied to a particular thread. With that in mind I would expect that some initialization code would be called multiple times. While in some languages like Java the initialization is more explicit, it does not necessarily need to be.
Let's look at the source for the supertype of the storage container you're using: https://github.com/python/cpython/blob/2.7/Lib/_threading_local.py
Line 186 contains the local type that is being used. Taking a look at that class you can see that the methods setattr and getattribute are among the overridden methods. Remember that in python these methods are called every time you attempt to assign a value or access a value in a type. The implementations of these methods acquire a local lock and then call the _patch method. This patch method creates a new dictionary and assigns it to the current instance dict (using object base to avoid infinite recursion: How is the __getattribute__ method used?)
So when you are calling storage.set(...) you are actually looking up a proxy dictionary in the local thread. If one doesn't exist the the init method is called on your type (see line 182). The result of that lookup is substituted in to the current instances dict method, and then the appropriate method is called on object to retrieve or set that value (l. 193,206,219) which uses the newly installed dict.
That's part of the contract (from http://svn.python.org/projects/python/tags/r27a1/Lib/_threading_local.py):
Note that if you define an init method, it will be
called each time the local object is used in a separate thread.
It's not too well documented in the official docs but basically each time you interact with a thread local in a different thread a new instance unique to that thread gets allocated.
I'm having a hard time wrapping my head around Python threading, especially since the documentation explicitly tells you to RTFS at some points, instead of kindly including the relevant info. I'll admit I don't feel qualified to read the threading module. I've seen lots of dirt-simple examples, but they all use global variables, which is offensive and makes me wonder if anyone really knows when or where it's required to use them as opposed to just convenient.
In particular, I'd like to know:
In threading.Thread(target=x), is x shared or private? Does each thread have its own stack, or are all threads using the same context simultaneously?
What is the preferred way to pass mutable variables to threads? Immutable ones are obviously through Thread(args=[],kwargs={}) and that's what all the examples cover. If it's global, I'll have to hold my nose and use it, but it seems like there has to be a better way. I suppose I could wrap everything in a class and just pass the instance in, but it'd be nice to point at regular variables, too.
When do I need threading.local()? In the x above?
Do I have to subclass Thread to update data, as many examples show?
I'm used to Win32 threads and pthreads, where it's explicitly laid out in docs what is and isn't shared with different uses of threads. Those are pretty low-level, and I'd like to avoid _thread if possible to be pythonic.
I'm not sure if it's relevant, but I'm trying to thread OpenMP-style to get the hang of it - make a for loop run concurrently using a queue and some threads. It was easy with globals and locks, but now I'd like to nail down scopes for better lock use.
In threading.Thread(target=x), is x shared or private?
It is private. Each thread has its own private invocation of x.
This is similar to recursion, for example (regardless of multithreading). If x calls itself, each invocation of x gets its own "private" frame, with its own private local variables.
What is the preferred way to pass mutable variables to threads? Do I have to subclass Thread to update data?
I view the target argument as a quick shortcut, good for quick experiments, but not much else. Using it where it ought not be used leads to all the limitations you describe in your question (and the hacks you describe in the possible solutions you contemplate).
Most of the time, you'd want to subclass threading.Thread. The code creating/managing the threads would pass all mutable shared objects to your thread-classes' __init__, and they should keep those objects as their attributes, and access them when running (within their run method).
When do I need threading.local()?
You rarely do, so you probably don't.
I'd like to avoid _thread if possible to be pythonic
Without a doubt, avoid it.
I'm happy to accept that this might not be possible, let alone sensible, but is it possible to keep a persistent reference to an object I have created?
For example, in a few of my views I have code that looks a bit like this (simplified for clarity):
api = Webclient()
api.login(GPLAY_USER,GPLAY_PASS)
url = api.get_stream_urls(track.stream_id)[0]
client = mpd.MPDClient()
client.connect("localhost", 6600)
client.clear()
client.add(url)
client.play()
client.disconnect()
It would be really neat if I could just keep one reference to api and client throughout my project, especially to avoid repeated api logins with gmusicapi. Can I declare them in settings.py? (I'm guessing this is a terrible idea), or by some other means keep a connection to them that's persistent?
Ideally I would then have functions like get_api() which would check the existing object was still ok and return it or create a new one as required.
You can't have anything that's instantiated once per application, because you'll almost certainly have more than one server process, and objects aren't easily shared across processes. However, one per process is definitely possible, and worthwhile. To do that, you only need to instantiate it at module level in the relevant file (eg views.py). That means it will be automatically instantiated when Django first imports that file (in that process), and you can refer to it as a global variable in that file. It will persist as long as the process does, and when as new process is created, a new global var will be instantiated.
You could make them properties of your application object or of some
other application object that is declared at the top level of your
project - before anything else needs it.
If you put them into a class that gets instantiated on the first
import and is then just used on the rest it can be imported by
several modules and accessed.
Either way they would have a life of the length of the execution.
You can't persist the object reference, but you can store something either in memory django cache or in memcached django cache.
Django Cache
https://docs.djangoproject.com/en/dev/topics/cache/
See also
Creating a Persistent Data Object In Django
GLOBAL_VAR = 1
class Worker:
class_var = 2
Worker's instances are created by multiple processes. Will they have their own copies of above variables? If not, can I use them to safely lock access to a resource accessed by multiple class instances (creating and accessing them in a thread-safe manner, of course)? I want to do it transparently for a class user. What is the right way to do this?
Are you talking about multithreading or multiprocessing? These are very different in Python.
Threads can access variables like you are use to in Python, without restriction.
Process on the other hand can't access variables from another process (aside some exception of shared variables). Process will copy the current state of the local variables on creation, but it will only be a copy.