Why is "pickle" and "multiprocessing picklability" so different in Python?

Why is "pickle" and "multiprocessing picklability" so different in Python? - python

Using Python's multiprocessing on Windows will require many arguments to be "picklable" while passing them to child processes.
import multiprocessing
class Foobar:
def __getstate__(self):
print("I'm being pickled!")
def worker(foobar):
print(foobar)
if __name__ == "__main__":
# Uncomment this on Linux
# multiprocessing.set_start_method("spawn")
foobar = Foobar()
process = multiprocessing.Process(target=worker, args=(foobar, ))
process.start()
process.join()
The documentation mentions this explicitly several times:
Picklability
Ensure that the arguments to the methods of proxies are picklable.
[...]
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
[...]
More picklability
Ensure that all arguments to Process.__init__() are picklable. Also, if you subclass Process then make sure that instances will be picklable when the Process.start method is called.
However, I noticed two main differences between "multiprocessing pickle" and the standard pickle module, and I have trouble making sense of all of this.
multiprocessing.Queue() are not "pickable" yet passable to child processes
import pickle
from multiprocessing import Queue, Process
def worker(queue):
pass
if __name__ == "__main__":
queue = Queue()
# RuntimeError: Queue objects should only be shared between processes through inheritance
pickle.dumps(queue)
# Works fine
process = Process(target=worker, args=(queue, ))
process.start()
process.join()
Not picklable if defined in "main"
import pickle
from multiprocessing import Process
def worker(foo):
pass
if __name__ == "__main__":
class Foo:
pass
foo = Foo()
# Works fine
pickle.dumps(foo)
# AttributeError: Can't get attribute 'Foo' on <module '__mp_main__' from 'C:\\Users\\Delgan\\test.py'>
process = Process(target=worker, args=(foo, ))
process.start()
process.join()
If multiprocessing does not use pickle internally, then what are the inherent differences between these two ways of serializing objects?
Also, what does "inherit" mean in the context of multiprocessing? How am I supposed to prefer it over pickle?

When a multiprocessing.Queue is passed to a child process, what is actually sent is a file descriptor (or handle) obtained from pipe, which must have been created by the parent before creating the child. The error from pickle is to prevent attempts to send a Queue over another Queue (or similar channel), since it’s too late to use it then. (Unix systems do actually support sending a pipe over certain kinds of socket, but multiprocessing doesn’t use such features.) It’s expected to be “obvious” that certain multiprocessing types can be sent to child processes that would otherwise be useless, so no mention is made of the apparent contradiction.
Since the “spawn” start method can’t create the new process with any Python objects already created, it has to re-import the main script to obtain relevant function/class definitions. It doesn’t set __name__ like the original run for obvious reasons, so anything that is dependent on that setting will not be available. (Here, it is unpickling that failed, which is why your manual pickling works.)
The fork methods start the children with the parent’s objects (at the time of the fork only) still existing; this is what is meant by inheritance.

Related

How are locks differenciated in multiprocessing?

Let's say that you have two lists created with manager.list(), and two locks created with manager.Lock(). How do you assign each lock to each list?
I was doing like
lock1 = manager.Lock()
lock2 = manager.Lock()
list1 = manager.list()
list2 = manager.list()
and when I wanted to write/read from the list
lock1.acquire()
list1.pop(0)
lock1.release()
lock2.acquire()
list2.pop(0)
lock2.released()
Today I realized that there's nothing that associates lock1 to list1.
Am I misunderstanding these functions?

TL;DR yes, and this might be an XY problem!
If you create a multiprocessing.Manager() and use its methods to create container primitives (.list and .dict), they will already be synchronized and you don't need to deal with the synchronization primitives yourself
from multiprocessing import Manager, Process, freeze_support
def my_function(d, lst):
lst.append([x**2 for x in d.values()])
def main():
with Manager() as manager: # context-managed SyncManager
normal_dict = {'a': 1, 'b': 2}
managed_synchronized_dict = manager.dict(normal_dict)
managed_synchronized_list = manager.list() # used to store results
p = Process(
target=my_function,
args=(managed_synchronized_dict, managed_synchronized_list)
)
p.start()
p.join()
print(managed_synchronized_list)
if __name__ == '__main__':
freeze_support()
main()
% python3 ./test_so_66603485.py
[[1, 4]]
multiprocessing.Array, is also synchronized
BEWARE: proxy objects are not directly comparable to their Python collection equivalents
Note: The proxy types in multiprocessing do nothing to support comparisons by value. So, for instance, we have:
>>> manager.list([1,2,3]) == [1,2,3]
False
One should just use a copy of the referent instead when making comparisons.
Some confusion might come from the section of the multiprocessing docs on Synchronization Primitives, which implies that one should use a Manager to create synchronization primitives, when really the Manager can already do the synchronization for you
Synchronization primitives
Generally synchronization primitives are not as necessary in a multiprocess program as they are in a multithreaded program. See the documentation for threading module.
Note that one can also create synchronization primitives by using a manager object – see Managers.
If you use simply multiprocessing.Manager(), per the docs, it
Returns a started SyncManager object which can be used for sharing objects between processes. The returned manager object corresponds to a spawned child process and has methods which will create shared objects and return corresponding proxies.
From the SyncManager section
Its methods create and return Proxy Objects for a number of commonly used data types to be synchronized across processes. This notably includes shared lists and dictionaries.
This means that you probably have most of what you want already
manager object with methods for building managed types
synchronization via Proxy Objects
Finally, to sum up the thread from comments specifically about instances of Lock objects
there's no inherent way to tell that some named lock is for anything in particular other than meta-information such as its name, comment(s), documentation ..conversely, they're free to be used for whatever synchronization needs you may have
some useful class/container can be made to manage both the lock and whatever it should be synchronizing -- a normal multiprocessing.Manager (SyncManager)'s .list and .dict do this, and a variety of other useful constructs exist, such as Pipe and Queue
one lock can be used to synchronize any number of actions, but having more locks can be a valuable trade-off as they are potentially unnecessarily blocking access to resources
a variety of synchronization primitives also exist for different purposes
value = my_queue.get() # already synchronized
if not my_lock1.acquire(timeout=5): # False if cannot acquire
raise CustomException("failed to acquire my_lock1 after 5 seconds")
try:
with my_lock2(): # blocks until acquired
some_shared_mutable = some_container_of_mutables[-1]
some_shared_mutable = some_object_with_mutables.get()
if foo(value, some_shared_mutable):
action1(value, some_shared_mutable)
action2(value, some_other_shared_mutable)
finally:
lock2.release()

Multiprocess message queue not receiving messages

I have a script that creates a class and try's to launch an object of that class in a separate process;
class Task():
def __init__(self, messageQueue):
self.messageQueue = messageQueue
def run(self):
startTime = time.time()
while time.time() -startTime < 60:
try:
message = self.messageQueue.get_nowait()
print message
self.messageQueue.task_done()
except Queue.Empty:
print "No messages"
time.sleep(1)
def test(messageQueue):
task = Task(messageQueue)
task.run()
if __name__ == '__main__':
messageQueue = Queue.Queue()
p = Process(target=test, args=(messageQueue,))
p.start()
time.sleep(5)
messageQueue.put("hello")
Instead of seeing the message "hello" printed out after 5 seconds, I just get a continuous stream of "No messages". What am I doing wrong?

The problem is that you're using Queue.Queue, which only handles multiple threads within the same process, not multiple processes.
The multiprocessing module comes with its own replacement, multiprocessing.Queue, which provides the same functionality, but works with both threads and processes.
See Pipes and Queues in the multiprocessing doc for more details—but you probably don't need any more details; the multiprocessing.Queue is meant to be as close to a multi-process clone of Queue.Queue as possible.
If you want to understand the under-the-covers difference:
A Queue.Queue is a deque with condition variables wrapped around it. It relies on the fact that code running in the same interpreter can access the same objects to share the deque, and uses the condition variables to protect the deque from races as well as for signaling.
A multiprocessing.Queue is a more complicated thing that pickles objects and passes them over a pipe between the processes. Races aren't a problem, but signaling still is, so it also has the equivalent of condition variables, but obviously not the ones from threading.

Can I safely use global Queues when using multiprocessing in python?

I have a large codebase to parallelise. I can avoid rewriting the method signatures of hundreds of functions by using a single global queue. I know it's messy; please don't tell me that if I'm using globals I'm doing something wrong in this case it really is the easiest choice. The code below works but i don't understand why. I declare a global multiprocessing.Queue() but don't declare that it should be shared between processes (by passing it as a parameter to the worker). Does python automatically place this queue in shared memory? Is it safe to do this on a larger scale?
Note: You can tell that the queue is shared between the processes: the worker processes start doing work on empty queues and are idle for one second before the main queue pushes some work onto the queues.
import multiprocessing
import time
outqueue = None
class WorkerProcess(multiprocessing.Process):
def __init__(self):
multiprocessing.Process.__init__(self)
self.exit = multiprocessing.Event()
def doWork(self):
global outqueue
ob = outqueue.get()
ob = ob + "!"
print ob
time.sleep(1) #simulate more hard work
outqueue.put(ob)
def run(self):
while not self.exit.is_set():
self.doWork()
def shutdown(self):
self.exit.set()
if __name__ == '__main__':
global outqueue
outqueue = multiprocessing.Queue()
procs = []
for x in range(10):
procs.append(WorkerProcess())
procs[x].start()
time.sleep(1)
for x in range(20):
outqueue.put(str(x))
time.sleep(10)
for p in procs:
p.shutdown()
for p in procs:
p.join()
try:
while True:
x = outqueue.get(False)
print x
except:
print "done"

Assuming you're using Linux, the answer is in the way the OS creates a new process.
When a process spawns a new one in Linux, it actually forks the parent one. The result is a child process with all the properties of the parent one. Basically a clone.
In your example you are instantiating the Queue and then creating the new processes. Therefore the children processes will have a copy of the same queue and will be able to use it.
To see things broken just try to first create the processes and then creating the Queue object. You'll see the children having the global variable still set as None while the parent will have a Queue.
It is safe, yet not recommended, to share a Queue as a global variable on Linux. On Windows, due to the different process creation approach, sharing a queue through a global variable won't work.
As mentioned in the programming guidelines
Explicitly pass resources to child processes
On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource. However, it is better to pass the object as an argument to the constructor for the child process.
Apart from making the code (potentially) compatible with Windows and the other start methods this also ensures that as long as the child process is still alive the object will not be garbage collected in the parent process. This might be important if some resource is freed when the object is garbage collected in the parent process.
For more info about Linux forking you can read its man page.

What is being pickled when I call multiprocessing.Process?

I know that multiprocessing uses pickling in order to have the processes run on different CPUs, but I think I am a little confused as to what is being pickled. Lets look at this code.
from multiprocessing import Process
def f(I):
print('hello world!',I)
if __name__ == '__main__':
for I in (range1, 3):
Process(target=f,args=(I,)).start()
I assume what is being pickled is the def f(I) and the argument going in. First, is this assumption correct?
Second, lets say f(I) has a function call within in it like:
def f(I):
print('hello world!',I)
randomfunction()
Does the randomfunction's definition get pickled as well, or is it only the function call?
Further more, if that function call was located in another file, would the process be able to call it?

In this particular example, what gets pickled is platform dependent. On systems that support os.fork, like Linux, nothing is pickled here. Both the target function and the args you're passing get inherited by the child process via fork.
On platforms that don't support fork, like Windows, the f function and args tuple will both be pickled and sent to the child process. The child process will re-import your __main__ module, and then unpickle the function and its arguments.
In either case, randomfunction is not actually pickled. When you pickle f, all you're really pickling is a pointer for the child function to re-build the f function object. This is usually little more than a string that tells the child how to re-import f:
>>> def f(I):
... print('hello world!',I)
... randomfunction()
...
>>> pickle.dumps(f)
'c__main__\nf\np0\n.'
The child process will just re-import f, and then call it. randomfunction will be accessible as long as it was properly imported into the original script to begin with.
Note that in Python 3.4+, you can get the Windows-style behavior on Linux by using contexts:
ctx = multiprocessing.get_context('spawn')
ctx.Process(target=f,args=(I,)).start() # even on Linux, this will use pickle
The descriptions of the contexts are also probably relevant here, since they apply to Python 2.x as well:
spawn
The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows.
fork
The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
forkserver
When the program starts and selects the forkserver start
method, a server process is started. From then on, whenever a new
process is needed, the parent process connects to the server and
requests that it fork a new process. The fork server process is single
threaded so it is safe for it to use os.fork(). No unnecessary
resources are inherited.
Available on Unix platforms which support passing file descriptors
over Unix pipes.
Note that forkserver is only available in Python 3.4, there's no way to get that behavior on 2.x, regardless of the platform you're on.

The function is pickled, but possibly not in the way you think of it:
You can look at what's actually in a pickle like this:
pickletools.dis(pickle.dumps(f))
I get:
0: c GLOBAL '__main__ f'
12: p PUT 0
15: . STOP
You'll note that there is nothing in there correspond to the code of the function. Instead, it has references to __main__ f which is the module and name of the function. So when this is unpickled, it will always attempt to lookup the f function in the __main__ module and use that. When you use the multiprocessing module, that ends up being a copy of the same function as it was in your original program.
This does mean that if you somehow modify which function is located at __main__.f you'll end up unpickling a different function then you pickled in.
Multiprocessing brings up a complete copy of your program complete with all the functions you defined it. So you can just call functions. The entire function isn't copied over, just the name of the function. The pickle module's assumption is that function will be same in both copies of your program, so it can just lookup the function by name.

Only the function arguments (I,) and the return value of the function f are pickled. The actual definition of the function f has to be available when loading the module.
The easiest way to see this is through the code:
from multiprocessing import Process
if __name__ == '__main__':
def f(I):
print('hello world!',I)
for I in [1,2,3]:
Process(target=f,args=(I,)).start()
That returns:
AttributeError: 'module' object has no attribute 'f'

Understanding os.fork and Queue.Queue

I wanted to implement a simple python program using parallel execution. It's I/O bound, so I figured threads would be appropriate (as opposed to processes). After reading the documentation for Queue and fork, I thought something like the following might work.
q = Queue.Queue()
if os.fork(): # child
while True:
print q.get()
else: # parent
[q.put(x) for x in range(10)]
However, the get() call never returns. I thought it would return once the other thread executes a put() call. Using the threading module, things behave more like I expected:
q = Queue.Queue()
def consume(q):
while True:
print q.get()
worker = threading.Thread (target=consume, args=(q,))
worker.start()
[q.put(x) for x in range(10)]
I just don't understand why the fork approach doesn't do the same thing. What am I missing?

The POSIX fork system call creates a new process, rather than a new thread inside the same adress space:
The fork() function shall create a new process. The new process (child
process) shall be an exact copy of the calling process (parent
process) except as detailed below: [...]
So the Queue is duplicated in your first example, rather than shared between the parent and child.
You can use multiprocessing.Queue instead or just use threads like in your second example :)
By the way, using list comprehensions just for side effects isn't good practice for several reasons. You should use a for loop instead:
for x in range(10): q.put(x)

To share the data between unrelated processes, you can use named pipes. Through the os.open() funcion..
http://docs.python.org/2/library/os.html#os.open. You can simply name a pipe as named_pipe='my_pipe' and in a different python programs use os.open(named_pipe, ), where mode is WRONLY and so on. After that you'll make a FIFO to write into the pipe. Don't forget to close the pipe and catch exceptions..

Fork creates a new process. The child and parent processes do not share the same Queue: that's why the elements put by the parent process cannot be retrieved by the child.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.