Python: Why is the multiprocessing lock shared among processes here? - python

I am trying to share a lock among processes. I understand that the way to share a lock is to pass it as an argument to the target function. However I found that even the approach below is working. I could not understand the way the processes are sharing this lock. Could anyone please explain?
import multiprocessing as mp
import time
class SampleClass:
def __init__(self):
self.lock = mp.Lock()
self.jobs = []
self.total_jobs = 10
def test_run(self):
for i in range(self.total_jobs):
p = mp.Process(target=self.run_job, args=(i,))
p.start()
self.jobs.append(p)
for p in self.jobs:
p.join()
def run_job(self, i):
with self.lock:
print('Sleeping in process {}'.format(i))
time.sleep(5)
if __name__ == '__main__':
t = SampleClass()
t.test_run()

On Windows (which you said you're using), these kinds of things always reduce to details about how multiprocessing plays with pickle, because all Python data crossing process boundaries on Windows is implemented by pickling on the sending end (and unpickling on the receiving end).
My best advice is to avoid doing things that raise such questions to begin with ;-) For example, the code you showed blows up on Windows under Python 2, and also blows up under Python 3 if you use a multiprocessing.Pool method instead of multiprocessing.Process.
It's not just the lock, simply trying to pickle a bound method (like self.run_job) blows up in Python 2. Think about it. You're crossing a process boundary, and there isn't an object corresponding to self on the receiving end. To what object is self.run_job supposed to be bound on the receiving end?
In Python 3, pickling self.run_job also pickles a copy of the self object. So that's the answer: a SampleClass object corresponding to self is created by magic on the receiving end. Clear as mud. t's entire state is pickled, including t.lock. That's why it "works".
See this for more implementation details:
Why can I pass an instance method to multiprocessing.Process, but not a multiprocessing.Pool?
In the long run, you'll suffer the fewest mysteries if you stick to things that were obviously intended to work: pass module-global callable objects (neither, e.g., instance methods nor local functions), and explicitly pass multiprocessing data objects (whether an instance of Lock, Queue, manager.list, etc etc).

On Unix Operating Systems, new processes are created via the fork primitive.
The fork primitive works by cloning the parent process memory address space assigning it to the child. The child will have a copy of the parent's memory as well as for the file descriptors and shared objects.
This means that, when you call fork, if the parent has a file opened, the child will have it too. The same applied with shared objects such as pipes, sockets etc...
In Unix+CPython, Locks are realized via the sem_open primitive which is designed to be shared when forking a process.
I usually recommend against mixing concurrency (multiprocessing in particular) and OOP because it frequently leads to these kind of misunderstandings.
EDIT:
Saw just now that you are using Windows. Tim Peters gave the right answer. For the sake of abstraction, Python is trying to provide OS independent behaviour over its API. When calling an instance method, it will pickle the object and send it over a pipe. Thus providing a similar behaviour as for Unix.
I'd recommend you to read the programming guidelines for multiprocessing. Your issue is addressed in particular in the first point:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
It is probably best to stick to using queues or pipes for communication between processes rather than using the lower level synchronization primitives.

Related

multiprocessing initargs - how it works under the hood?

I've assumed that multiprocessing.Pool uses pickle to pass initargs to child processes.
However I find the following stange:
value = multiprocessing.Value('i', 1)
multiprocess.Pool(initializer=worker, initargs=(value, )) # Works
But this does not work:
pickle.dumps(value)
throwing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Why is that, and how multiprocessing initargs can bypass that, as it's using pickle as well?
As I understand, multiprocessing.Value is using shared memory behind the scenes, what is the difference between inheritance or passing it via initargs? Specifically speaking on Windows, where the code does not fork, so a new instance of multiprocessing.Value is created.
And if you had instead passed an instance of multiprocessing.Lock(), the error message would have been RuntimeError: Lock objects should only be shared between processes through inheritance. These things can be passed as arguments if you are creating an instance of multiprocessing.Process, which is in fact what is being used when you say initializer=worker, initargs=(value,). The test being made is whether a process is currently being spawned, which is not the case when you already have an existing process pool and are now just submitting some work for it. But why this restriction?
Would it make sense for you to be able to pickle this shared memory to a file and then a week later trying to unpickle it and use it? Of course not! Python cannot know that you would not be doing anything so foolish as that and so it places great restrictions on how shared memory and locks can be pickled/unpickled, which is only for passing to other processes.

Python multiprocessing guidelines seems to conflict: share memory or pickle?

I'm playing with Python multiprocessing module to have a (read-only) array shared among multiple processes. My goal is to use multiprocessing.Array to allocate the data and then have my code forked (forkserver) so that each worker can read straight from the array to do their job.
While reading the Programming guidelines I got a bit confused.
It is first said:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of
data between processes.
It is probably best to stick to using queues or pipes for
communication between processes rather than using the lower level
synchronization primitives.
And then, a couple of lines below:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
As far as I understand, queues and pipes pickle objects. If so, aren't those two guidelines conflicting?
Thanks.
The second guideline is the one relevant to your use case.
The first is reminding you that this isn't threading where you manipulate shared data structures with locks (or atomic operations). If you use Manager.dict() (which is actually SyncManager.dict) for everything, every read and write has to access the manager's process, and you also need the synchronization typical of threaded programs (which itself may come at a higher cost from being cross-process).
The second guideline suggests inheriting shared, read-only objects via fork; in the forkserver case, this means you have to create such objects before the call to set_start_method, since all workers are children of a process created at that time.
The reports on the usability of such sharing are mixed at best, but if you can use a small number of any of the C-like array types (like numpy or the standard array module), you should see good performance (because the majority of pages will never be written to deal with reference counts). Note that you do not need multiprocessing.Array here (though it may work fine), since you do not need writes in one concurrent process to be visible in another.

Objects in Multiprocess Shared Memory?

I have a set of objects states which is greater than I think it would be reasonable to thread or process at a 1:1 basis, let's say it looks like this
class SubState(object):
def __init__(self):
self.stat_1 = None
self.stat_2 = None
self.list_1 = []
class State(object):
def __init__(self):
self.my_sub_states = {'a': SubState(), 'b': SubState(), 'c': SubState()}
What I'd like to do is to make each of the sub_states to the self.my_sub_states keys shared, and simply access them by grabbing a single lock for the entire sub-state - i.e. self.locks={'a': multiprocessing.Lock(), 'b': multiprocessing.Lock() etc. and then release it when I'm done. Is there a class I can inherit to share an entire SubState object with a single Lock?
The actually process workers would be pulling tasks from a queue (I can't pass the sub_states as args into the process because they don't know which sub_state they need until they get the next task).
Edit: also I'd prefer not to use a manager - manager's are atrociously slow (I haven't done the benchmarks but I'm inclined to think an in memory database would work faster than a manager if it came down to it).
As the multiprocessing docs state, you've really only got two options for actually sharing state between multiprocessing.Process instances (at least without going to third-party options - e.g. redis):
Use a Manager
Use multiprocessing.sharedctypes
A Manager will allow you to share pure Python objects, but as you pointed out, both read and write access to objects being shared this way is quite slow.
multiprocessing.sharedctypes will use actual shared memory, but you're limited to sharing ctypes objects. So you'd need to convert your SubState object to a ctypes.Struct. Also of note is that each multiprocessing.sharedctypes object has its own lock built-in, so you can synchronize access to each object by taking that lock explicitly before operating on it.

Python multiprocessing.Process object behaves like it would hold a reference to an object in another process. Why?

import multiprocessing as mp
def delay_one_second(event):
print 'in SECONDARY process, preparing to wait for 1 second'
event.wait(1)
print 'in the SECONDARY process, preparing to raise the event'
event.set()
if __name__=='__main__':
evt = mp.Event()
print 'preparing to wait 10 seconds in the PRIMARY process'
mp.Process(target = delay_one_second, args=(evt,)).start()
evt.wait(10)
print 'PRIMARY process, waking up'
This code (run nicely from inside a module with the "python module.py" command inside cmd.exe) yields a surprising result.
The main process apparently only waits for 1 second before waking up. For this to happen, it means that the secondary process has a reference to an object in the main process.
How can this be? I was expecting to have to use a multiprocessing.Manager(), to share objects between processes, but how is this possible?
I mean the Processes are not threads, they shouldn't use the same memory space. Anyone have any ideas what's going on here?
The short answer is that the shared memory is not managed by a separate process; it's managed by the operating system itself.
You can see how this works if you spend some time browsing through the multiprocessing source. You'll see that an Event object uses a Semaphore and a Condition, both of which rely on the locking behavior provided by the SemLock object. This, in turn, wraps a _multiprocessing.SemLock object, which is implemented in c and depends on either sem_open (POSIX) or CreateSemaphore (Windows).
These are c functions that enable access to shared resources that are managed by the operating system itself -- in this case, named semaphores. They can be shared between threads or processes; the OS takes care of everything. What happens is that when a new semaphore is created, it is given a handle. Then, when a new process that needs access to that semaphore is created, it's given a copy of the handle. It then passes that handle to sem_open or CreateSemapohre, and the operating system gives the new process access to the original semaphore.
So the memory is being shared, but it's being shared as part of the operating system's built-in support for synchronization primitives. In other words, in this case, you don't need to open a new process to manage the shared memory; the operating system takes over that task. But this is only possible because Event doesn't need anything more complex than a semaphore to work.
The documentation says that the multiprocessing module follows the threading API. My guess would be that it uses a mechanism similar to 'fork'. If you fork a thread your OS will create a copy of the current process. It means that it copies the heap and stack, including all your variables and globals and that's what you're seeing.
You can see it for yourself if you pass the function below to a new process.
def print_globals():
print globals()

Monitor concurrency (sharing object across processes) in Python

I'm new here and I'm Italian (forgive me if my English is not so good).
I am a computer science student and I am working on a concurrent program project in Python.
We should use monitors, a class with its methods and data (such as condition variables). An instance (object) of this class monitor should be shared accross all processes we have (created by os.fork o by multiprocessing module) but we don't know how to do. It is simpler with threads because they already share memory but we MUST use processes. Is there any way to make this object (monitor) shareable accross all processes?
Hoping I'm not saying nonsenses...thanks a lot to everyone for tour attention.
Waiting answers.
Lorenzo
As far as "sharing" the instance, I believe the instructor wants you to make your monitor's interface to its local process such that it's as if it were shared (a la CORBA).
Look into the absolutely fantastic documentation on multiprocessing's Queue:
from multiprocessing import Process, Queue
def f(q):
q.put([42, None, 'hello'])
if __name__ == '__main__':
q = Queue()
p = Process(target=f, args=(q,))
p.start()
print q.get() # prints "[42, None, 'hello']"
p.join()
You should be able to imagine how your monitor's attributes might be propagated among the peer processes when changes are made.
shared memory between processes is usually a poor idea; when calling os.fork(), the operating system marks all of the memory used by parent and inherited by the child as copy on write; if either process attempts to modify the page, it is instead copied to a new location that is not shared between the two processes.
This means that your usual threading primitives (locks, condition variables, et-cetera) are not useable for communicating across process boundaries.
There are two ways to resolve this; The preferred way is to use a pipe, and serialize communication on both ends. Brian Cain's answer, using multiprocessing.Queue, works in this exact way. Because pipes do not have any shared state, and use a robust ipc mechanism provided by the kernel, it's unlikely that you will end up with processes in an inconsistent state.
The other option is to allocate some memory in a special way so that the os will allow you to use shared memory. the most natural way to do that is with mmap. cPython won't use shared memory for native python object's though, so you would still need to sort out how you will use this shared region. A reasonable library for this is numpy, which can map the untyped binary memory region into useful arrays of some sort. Shared memory is much harder to work with in terms of managing concurrency, though; since there's no simple way for one process to know how another processes is accessing the shared region. The only time this approach makes much sense is when a small number of processes need to share a large volume of data, since shared memory can avoid copying the data through pipes.

Categories