What are the semantics of data passed to threading.Thread.__init__()? Are they copied over and made local to the thread? Or, do they continue to be shared with the creating thread? The docs say that args is a tuple, so I assume it will be deep copied, but would like to make sure.
Basically, I'd like to dump a buffer periodically to disk, and I plan to pass this buffer as the arg to a saving thread's __init__. Can I continue to modifying the buffer in the calling thread without worrying if it will be affected in the saving thread?
Data are generally shared in Python unless you explicitly copy. Dumping a buffer from one thread while modifying it in another is not a safe operation unless the buffer itself has a thread-safe design. You need to synchronise access to the buffer somehow.
Can I continue to modifying the buffer in the calling thread without worrying if it will be affected in the saving thread?
You can if you use multiprocessing.Process instead of threading.Thread ... the process get's the data at the time of the Process.start() call via fork -- the data in one process cannot modify the data in another process. Although, to do IPC you will need to use a Queue or a Pipe or use shared objects from the multiprocessing module (Value, Array, etc...).
Related
If I instantiate an object in the main thread, and then send one of it's member methods to a ThreadPoolExecutor, does Python somehow create a copy-by-value of the object and sends it to the subthread, so that the objects member method will have access to its own copy of self?
Or is it indeed accessing self from the object in the main thread, thus meaning that every member in a subthread is modifying / overwriting the same properties (living in the main thread)?
Threads share a memory space. There is no magic going on behind the scenes, so code in different threads accesses the same objects. Thread switches can occur at any time, although most simple Python operations are atomic. It is up to you to avoid race conditions. Normal Python scoping rules apply.
You might want to read about ThreadLocal variables if you want to find out about workarounds to the default behavior.
Processes as quite different. Each Process has its own memory space and its own copy of all the objects it references.
I've assumed that multiprocessing.Pool uses pickle to pass initargs to child processes.
However I find the following stange:
value = multiprocessing.Value('i', 1)
multiprocess.Pool(initializer=worker, initargs=(value, )) # Works
But this does not work:
pickle.dumps(value)
throwing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Why is that, and how multiprocessing initargs can bypass that, as it's using pickle as well?
As I understand, multiprocessing.Value is using shared memory behind the scenes, what is the difference between inheritance or passing it via initargs? Specifically speaking on Windows, where the code does not fork, so a new instance of multiprocessing.Value is created.
And if you had instead passed an instance of multiprocessing.Lock(), the error message would have been RuntimeError: Lock objects should only be shared between processes through inheritance. These things can be passed as arguments if you are creating an instance of multiprocessing.Process, which is in fact what is being used when you say initializer=worker, initargs=(value,). The test being made is whether a process is currently being spawned, which is not the case when you already have an existing process pool and are now just submitting some work for it. But why this restriction?
Would it make sense for you to be able to pickle this shared memory to a file and then a week later trying to unpickle it and use it? Of course not! Python cannot know that you would not be doing anything so foolish as that and so it places great restrictions on how shared memory and locks can be pickled/unpickled, which is only for passing to other processes.
When I put an object in Queue, Is it necessary to create deep copy of object and then put in Queue?
If you can ensure that the Object is only processed in one Thread, this is not a problem. But if you can't, it is recommended to use a deep copy.
The Queue object doesn't do this autmatically if you put the object into it.
See Refs
Multithreading, Python and passed arguments
Python in Practice: Create Better Programs Using Concurrency... p.154
Keep in mind that the object need to be able to be pickled (Multiprocessing Basics)
It usually more useful to be able to spawn a process with arguments to tell it what work to do. Unlike with threading, to pass arguments to a multiprocessing Process the argument must be able to be serialized using pickle. This example passes each worker a number so the output is a little more interesting.
I have a problem with understanding how data is exchanged between processes (multiprocessing implementation). It behaves as parameters are passed as reference (or copy - depends whether it is mutable or imutable variable). If so, how it is achieved between processes?
Below examplary code is understandable for me if it is executed within one process (ConsumerProcess is a thread not a process for instance) but how does it work if it is exercised within 2 separate processes?
tasks = Queue()
consumerProcess = ConsumerProcess(tasks) # it is subprocess for main
tasks.put(aTask) # why does it behave as reference?
The multiprocessing library either lets you use shared memory, or in the case your Queue class, using a manager service that coordinates communication between processes.
See the Sharing state between processes and Managers sections in the documentation.
Managers use Proxy objects to represent state in a process. The Queue class is such a proxy. State is then shared via pickled states:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
import multiprocessing as mp
def delay_one_second(event):
print 'in SECONDARY process, preparing to wait for 1 second'
event.wait(1)
print 'in the SECONDARY process, preparing to raise the event'
event.set()
if __name__=='__main__':
evt = mp.Event()
print 'preparing to wait 10 seconds in the PRIMARY process'
mp.Process(target = delay_one_second, args=(evt,)).start()
evt.wait(10)
print 'PRIMARY process, waking up'
This code (run nicely from inside a module with the "python module.py" command inside cmd.exe) yields a surprising result.
The main process apparently only waits for 1 second before waking up. For this to happen, it means that the secondary process has a reference to an object in the main process.
How can this be? I was expecting to have to use a multiprocessing.Manager(), to share objects between processes, but how is this possible?
I mean the Processes are not threads, they shouldn't use the same memory space. Anyone have any ideas what's going on here?
The short answer is that the shared memory is not managed by a separate process; it's managed by the operating system itself.
You can see how this works if you spend some time browsing through the multiprocessing source. You'll see that an Event object uses a Semaphore and a Condition, both of which rely on the locking behavior provided by the SemLock object. This, in turn, wraps a _multiprocessing.SemLock object, which is implemented in c and depends on either sem_open (POSIX) or CreateSemaphore (Windows).
These are c functions that enable access to shared resources that are managed by the operating system itself -- in this case, named semaphores. They can be shared between threads or processes; the OS takes care of everything. What happens is that when a new semaphore is created, it is given a handle. Then, when a new process that needs access to that semaphore is created, it's given a copy of the handle. It then passes that handle to sem_open or CreateSemapohre, and the operating system gives the new process access to the original semaphore.
So the memory is being shared, but it's being shared as part of the operating system's built-in support for synchronization primitives. In other words, in this case, you don't need to open a new process to manage the shared memory; the operating system takes over that task. But this is only possible because Event doesn't need anything more complex than a semaphore to work.
The documentation says that the multiprocessing module follows the threading API. My guess would be that it uses a mechanism similar to 'fork'. If you fork a thread your OS will create a copy of the current process. It means that it copies the heap and stack, including all your variables and globals and that's what you're seeing.
You can see it for yourself if you pass the function below to a new process.
def print_globals():
print globals()