I've assumed that multiprocessing.Pool uses pickle to pass initargs to child processes.
However I find the following stange:
value = multiprocessing.Value('i', 1)
multiprocess.Pool(initializer=worker, initargs=(value, )) # Works
But this does not work:
pickle.dumps(value)
throwing:
RuntimeError: Synchronized objects should only be shared between processes through inheritance
Why is that, and how multiprocessing initargs can bypass that, as it's using pickle as well?
As I understand, multiprocessing.Value is using shared memory behind the scenes, what is the difference between inheritance or passing it via initargs? Specifically speaking on Windows, where the code does not fork, so a new instance of multiprocessing.Value is created.
And if you had instead passed an instance of multiprocessing.Lock(), the error message would have been RuntimeError: Lock objects should only be shared between processes through inheritance. These things can be passed as arguments if you are creating an instance of multiprocessing.Process, which is in fact what is being used when you say initializer=worker, initargs=(value,). The test being made is whether a process is currently being spawned, which is not the case when you already have an existing process pool and are now just submitting some work for it. But why this restriction?
Would it make sense for you to be able to pickle this shared memory to a file and then a week later trying to unpickle it and use it? Of course not! Python cannot know that you would not be doing anything so foolish as that and so it places great restrictions on how shared memory and locks can be pickled/unpickled, which is only for passing to other processes.
Related
If I pass in a reference to an instance method instead of a module level method into multiprocessing.Process, when I call it's start method is an exact copy of the parent process instance passed in or is the constructor called again? What happens with 'deep' instance member objects? Exact copy or default values?
Instances are not passed between processes, because instances are per Python VM process.
Values passed between processes are pickled. Unpickling normally restores instances by measures other than calling __init__; As far as I can tell it directly sets attribute values to resurrect an instance, and resolves references to/from other instances.
Provided that you run identical code in either process (and with multiprocessing, you do), it restores a reference to a correct instance method inside a restored chain of instances.
This means that if an __init__ in the chain of objects does something side-effectful, it will not have been done on the receiving side, that is, in a subprocess. Do such initialization explicitly.
All in all, it's easiest to share (effectively) immutable objects between parallel processes, and reconcile results after join()ing all of them.
I am trying to share a lock among processes. I understand that the way to share a lock is to pass it as an argument to the target function. However I found that even the approach below is working. I could not understand the way the processes are sharing this lock. Could anyone please explain?
import multiprocessing as mp
import time
class SampleClass:
def __init__(self):
self.lock = mp.Lock()
self.jobs = []
self.total_jobs = 10
def test_run(self):
for i in range(self.total_jobs):
p = mp.Process(target=self.run_job, args=(i,))
p.start()
self.jobs.append(p)
for p in self.jobs:
p.join()
def run_job(self, i):
with self.lock:
print('Sleeping in process {}'.format(i))
time.sleep(5)
if __name__ == '__main__':
t = SampleClass()
t.test_run()
On Windows (which you said you're using), these kinds of things always reduce to details about how multiprocessing plays with pickle, because all Python data crossing process boundaries on Windows is implemented by pickling on the sending end (and unpickling on the receiving end).
My best advice is to avoid doing things that raise such questions to begin with ;-) For example, the code you showed blows up on Windows under Python 2, and also blows up under Python 3 if you use a multiprocessing.Pool method instead of multiprocessing.Process.
It's not just the lock, simply trying to pickle a bound method (like self.run_job) blows up in Python 2. Think about it. You're crossing a process boundary, and there isn't an object corresponding to self on the receiving end. To what object is self.run_job supposed to be bound on the receiving end?
In Python 3, pickling self.run_job also pickles a copy of the self object. So that's the answer: a SampleClass object corresponding to self is created by magic on the receiving end. Clear as mud. t's entire state is pickled, including t.lock. That's why it "works".
See this for more implementation details:
Why can I pass an instance method to multiprocessing.Process, but not a multiprocessing.Pool?
In the long run, you'll suffer the fewest mysteries if you stick to things that were obviously intended to work: pass module-global callable objects (neither, e.g., instance methods nor local functions), and explicitly pass multiprocessing data objects (whether an instance of Lock, Queue, manager.list, etc etc).
On Unix Operating Systems, new processes are created via the fork primitive.
The fork primitive works by cloning the parent process memory address space assigning it to the child. The child will have a copy of the parent's memory as well as for the file descriptors and shared objects.
This means that, when you call fork, if the parent has a file opened, the child will have it too. The same applied with shared objects such as pipes, sockets etc...
In Unix+CPython, Locks are realized via the sem_open primitive which is designed to be shared when forking a process.
I usually recommend against mixing concurrency (multiprocessing in particular) and OOP because it frequently leads to these kind of misunderstandings.
EDIT:
Saw just now that you are using Windows. Tim Peters gave the right answer. For the sake of abstraction, Python is trying to provide OS independent behaviour over its API. When calling an instance method, it will pickle the object and send it over a pipe. Thus providing a similar behaviour as for Unix.
I'd recommend you to read the programming guidelines for multiprocessing. Your issue is addressed in particular in the first point:
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
It is probably best to stick to using queues or pipes for communication between processes rather than using the lower level synchronization primitives.
When I put an object in Queue, Is it necessary to create deep copy of object and then put in Queue?
If you can ensure that the Object is only processed in one Thread, this is not a problem. But if you can't, it is recommended to use a deep copy.
The Queue object doesn't do this autmatically if you put the object into it.
See Refs
Multithreading, Python and passed arguments
Python in Practice: Create Better Programs Using Concurrency... p.154
Keep in mind that the object need to be able to be pickled (Multiprocessing Basics)
It usually more useful to be able to spawn a process with arguments to tell it what work to do. Unlike with threading, to pass arguments to a multiprocessing Process the argument must be able to be serialized using pickle. This example passes each worker a number so the output is a little more interesting.
I have a problem with understanding how data is exchanged between processes (multiprocessing implementation). It behaves as parameters are passed as reference (or copy - depends whether it is mutable or imutable variable). If so, how it is achieved between processes?
Below examplary code is understandable for me if it is executed within one process (ConsumerProcess is a thread not a process for instance) but how does it work if it is exercised within 2 separate processes?
tasks = Queue()
consumerProcess = ConsumerProcess(tasks) # it is subprocess for main
tasks.put(aTask) # why does it behave as reference?
The multiprocessing library either lets you use shared memory, or in the case your Queue class, using a manager service that coordinates communication between processes.
See the Sharing state between processes and Managers sections in the documentation.
Managers use Proxy objects to represent state in a process. The Queue class is such a proxy. State is then shared via pickled states:
An important feature of proxy objects is that they are picklable so they can be passed between processes.
I have a set of objects states which is greater than I think it would be reasonable to thread or process at a 1:1 basis, let's say it looks like this
class SubState(object):
def __init__(self):
self.stat_1 = None
self.stat_2 = None
self.list_1 = []
class State(object):
def __init__(self):
self.my_sub_states = {'a': SubState(), 'b': SubState(), 'c': SubState()}
What I'd like to do is to make each of the sub_states to the self.my_sub_states keys shared, and simply access them by grabbing a single lock for the entire sub-state - i.e. self.locks={'a': multiprocessing.Lock(), 'b': multiprocessing.Lock() etc. and then release it when I'm done. Is there a class I can inherit to share an entire SubState object with a single Lock?
The actually process workers would be pulling tasks from a queue (I can't pass the sub_states as args into the process because they don't know which sub_state they need until they get the next task).
Edit: also I'd prefer not to use a manager - manager's are atrociously slow (I haven't done the benchmarks but I'm inclined to think an in memory database would work faster than a manager if it came down to it).
As the multiprocessing docs state, you've really only got two options for actually sharing state between multiprocessing.Process instances (at least without going to third-party options - e.g. redis):
Use a Manager
Use multiprocessing.sharedctypes
A Manager will allow you to share pure Python objects, but as you pointed out, both read and write access to objects being shared this way is quite slow.
multiprocessing.sharedctypes will use actual shared memory, but you're limited to sharing ctypes objects. So you'd need to convert your SubState object to a ctypes.Struct. Also of note is that each multiprocessing.sharedctypes object has its own lock built-in, so you can synchronize access to each object by taking that lock explicitly before operating on it.