I have a problem with understanding how data is exchanged between processes (multiprocessing implementation). It behaves as parameters are passed as reference (or copy - depends whether it is mutable or imutable variable). If so, how it is achieved between processes?
Below examplary code is understandable for me if it is executed within one process (ConsumerProcess is a thread not a process for instance) but how does it work if it is exercised within 2 separate processes?
tasks = Queue()
consumerProcess = ConsumerProcess(tasks) # it is subprocess for main
tasks.put(aTask) # why does it behave as reference?
The multiprocessing library either lets you use shared memory, or in the case your Queue class, using a manager service that coordinates communication between processes.
See the Sharing state between processes and Managers sections in the documentation.
Managers use Proxy objects to represent state in a process. The Queue class is such a proxy. State is then shared via pickled states:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
Related
I've assumed that multiprocessing.Pool uses pickle to pass initargs to child processes.
However I find the following stange:
value = multiprocessing.Value('i', 1)
multiprocess.Pool(initializer=worker, initargs=(value, )) # Works
But this does not work:
pickle.dumps(value)
throwing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Why is that, and how multiprocessing initargs can bypass that, as it's using pickle as well?
As I understand, multiprocessing.Value is using shared memory behind the scenes, what is the difference between inheritance or passing it via initargs? Specifically speaking on Windows, where the code does not fork, so a new instance of multiprocessing.Value is created.
And if you had instead passed an instance of multiprocessing.Lock(), the error message would have been RuntimeError: Lock objects should only be shared between processes through inheritance. These things can be passed as arguments if you are creating an instance of multiprocessing.Process, which is in fact what is being used when you say initializer=worker, initargs=(value,). The test being made is whether a process is currently being spawned, which is not the case when you already have an existing process pool and are now just submitting some work for it. But why this restriction?
Would it make sense for you to be able to pickle this shared memory to a file and then a week later trying to unpickle it and use it? Of course not! Python cannot know that you would not be doing anything so foolish as that and so it places great restrictions on how shared memory and locks can be pickled/unpickled, which is only for passing to other processes.
I am trying to share a lock among processes. I understand that the way to share a lock is to pass it as an argument to the target function. However I found that even the approach below is working. I could not understand the way the processes are sharing this lock. Could anyone please explain?
import multiprocessing as mp
import time
class SampleClass:
def __init__(self):
self.lock = mp.Lock()
self.jobs = []
self.total_jobs = 10
def test_run(self):
for i in range(self.total_jobs):
p = mp.Process(target=self.run_job, args=(i,))
p.start()
self.jobs.append(p)
for p in self.jobs:
p.join()
def run_job(self, i):
with self.lock:
print('Sleeping in process {}'.format(i))
time.sleep(5)
if __name__ == '__main__':
t = SampleClass()
t.test_run()
On Windows (which you said you're using), these kinds of things always reduce to details about how multiprocessing plays with pickle, because all Python data crossing process boundaries on Windows is implemented by pickling on the sending end (and unpickling on the receiving end).
My best advice is to avoid doing things that raise such questions to begin with ;-) For example, the code you showed blows up on Windows under Python 2, and also blows up under Python 3 if you use a multiprocessing.Pool method instead of multiprocessing.Process.
It's not just the lock, simply trying to pickle a bound method (like self.run_job) blows up in Python 2. Think about it. You're crossing a process boundary, and there isn't an object corresponding to self on the receiving end. To what object is self.run_job supposed to be bound on the receiving end?
In Python 3, pickling self.run_job also pickles a copy of the self object. So that's the answer: a SampleClass object corresponding to self is created by magic on the receiving end. Clear as mud. t's entire state is pickled, including t.lock. That's why it "works".
See this for more implementation details:
Why can I pass an instance method to multiprocessing.Process, but not a multiprocessing.Pool?
In the long run, you'll suffer the fewest mysteries if you stick to things that were obviously intended to work: pass module-global callable objects (neither, e.g., instance methods nor local functions), and explicitly pass multiprocessing data objects (whether an instance of Lock, Queue, manager.list, etc etc).
On Unix Operating Systems, new processes are created via the fork primitive.
The fork primitive works by cloning the parent process memory address space assigning it to the child. The child will have a copy of the parent's memory as well as for the file descriptors and shared objects.
This means that, when you call fork, if the parent has a file opened, the child will have it too. The same applied with shared objects such as pipes, sockets etc...
In Unix+CPython, Locks are realized via the sem_open primitive which is designed to be shared when forking a process.
I usually recommend against mixing concurrency (multiprocessing in particular) and OOP because it frequently leads to these kind of misunderstandings.
EDIT:
Saw just now that you are using Windows. Tim Peters gave the right answer. For the sake of abstraction, Python is trying to provide OS independent behaviour over its API. When calling an instance method, it will pickle the object and send it over a pipe. Thus providing a similar behaviour as for Unix.
I'd recommend you to read the programming guidelines for multiprocessing. Your issue is addressed in particular in the first point:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
It is probably best to stick to using queues or pipes for communication between processes rather than using the lower level synchronization primitives.
In python, it supplies a lot of ways to communicate between processes in module multiprocessing, Pipe, Queue, Value, Array and Manager. Which of them are better choices?
Use Pipe and Queues if you want to implement message-passing.
Use Value and Array if you want to implement shared memeory.
Use Managers if you want to expose an object-oriented interface to multiple processes.
Pipe is good for 1-to-1 communication or for byte-level protocols:
Pipe can be either bidirectional (“duplex”) or unidirectional
Pipe is not concurrency safe: only one process can use the same end of the Pipe; it is good in 1-to-1 communications, otherwise it requires locks
Pipe may transmit pickable Python objects or raw bytes
The buffer size cannot be specified
Queue is similar to a unidirectional Pipe, but may work in many-to-many scenarios:
Queue is unidirectional: first in, first out
Queue is concurrency safe: multiple processes may use the same end of the Queue; so it is good when there are multiple producers or multiple consumers
Queue may transmit only pickable Python objects
The maximum size of the Queue may be specified
"Whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined."
Queue is implemented using a pipe and some locks/semaphores.
Value and Array:
provide synchronized access to shared data
work only with C data types (ctypes)
Managers:
may be used on various data types: they return proxy objects which expose the same methods as the underlying (shared) object
can handle remote access
Value and Array is a lightweight approach to shared memory. In my experience, the overhead of using a SyncManager and AutoProxy can be huge. If you can solve your problem using a Value or an Array, use them. SyncManager may be useful to expose an object-oriented interface to multiple processes, unless it is not called too frequently.
import multiprocessing as mp
def delay_one_second(event):
print 'in SECONDARY process, preparing to wait for 1 second'
event.wait(1)
print 'in the SECONDARY process, preparing to raise the event'
event.set()
if __name__=='__main__':
evt = mp.Event()
print 'preparing to wait 10 seconds in the PRIMARY process'
mp.Process(target = delay_one_second, args=(evt,)).start()
evt.wait(10)
print 'PRIMARY process, waking up'
This code (run nicely from inside a module with the "python module.py" command inside cmd.exe) yields a surprising result.
The main process apparently only waits for 1 second before waking up. For this to happen, it means that the secondary process has a reference to an object in the main process.
How can this be? I was expecting to have to use a multiprocessing.Manager(), to share objects between processes, but how is this possible?
I mean the Processes are not threads, they shouldn't use the same memory space. Anyone have any ideas what's going on here?
The short answer is that the shared memory is not managed by a separate process; it's managed by the operating system itself.
You can see how this works if you spend some time browsing through the multiprocessing source. You'll see that an Event object uses a Semaphore and a Condition, both of which rely on the locking behavior provided by the SemLock object. This, in turn, wraps a _multiprocessing.SemLock object, which is implemented in c and depends on either sem_open (POSIX) or CreateSemaphore (Windows).
These are c functions that enable access to shared resources that are managed by the operating system itself -- in this case, named semaphores. They can be shared between threads or processes; the OS takes care of everything. What happens is that when a new semaphore is created, it is given a handle. Then, when a new process that needs access to that semaphore is created, it's given a copy of the handle. It then passes that handle to sem_open or CreateSemapohre, and the operating system gives the new process access to the original semaphore.
So the memory is being shared, but it's being shared as part of the operating system's built-in support for synchronization primitives. In other words, in this case, you don't need to open a new process to manage the shared memory; the operating system takes over that task. But this is only possible because Event doesn't need anything more complex than a semaphore to work.
The documentation says that the multiprocessing module follows the threading API. My guess would be that it uses a mechanism similar to 'fork'. If you fork a thread your OS will create a copy of the current process. It means that it copies the heap and stack, including all your variables and globals and that's what you're seeing.
You can see it for yourself if you pass the function below to a new process.
def print_globals():
print globals()
What is difference between multiprocessing.Event and multiprocessing.managers.SyncManager.Event. When do I use each? Why two different objects exist?
Same question for other similar objects existing in multiprocessing directly and also in Manager (Lock, etc.)
Unfortunately, the only given answer is not very correct and others wasn't given.
I looked it up my own, and found that multiprocessing.Event can be used to synch between processes, it's completely alright.
Event and other objects from multiprocessing.Manager exist to be able to synchronize things between processes that runs on different machines via sockets under the hood. They also can be used synchronize on single machine, but less efficient for this than just using synch objects from multiprocessing.synchronize (like Event and Lock and others)
multiprocessing.Manager is essentially a specialised process that will create instances of multiprocessing's synchronisation primitives on demand in its own address space, and let you access them through RPC proxies. The primitives behave the same, and they have the added flexibility of being accessible from remote hosts (using TCP in the remote case).