I know that multiprocessing uses pickling in order to have the processes run on different CPUs, but I think I am a little confused as to what is being pickled. Lets look at this code.
from multiprocessing import Process
def f(I):
print('hello world!',I)
if __name__ == '__main__':
for I in (range1, 3):
Process(target=f,args=(I,)).start()
I assume what is being pickled is the def f(I) and the argument going in. First, is this assumption correct?
Second, lets say f(I) has a function call within in it like:
def f(I):
print('hello world!',I)
randomfunction()
Does the randomfunction's definition get pickled as well, or is it only the function call?
Further more, if that function call was located in another file, would the process be able to call it?
In this particular example, what gets pickled is platform dependent. On systems that support os.fork, like Linux, nothing is pickled here. Both the target function and the args you're passing get inherited by the child process via fork.
On platforms that don't support fork, like Windows, the f function and args tuple will both be pickled and sent to the child process. The child process will re-import your __main__ module, and then unpickle the function and its arguments.
In either case, randomfunction is not actually pickled. When you pickle f, all you're really pickling is a pointer for the child function to re-build the f function object. This is usually little more than a string that tells the child how to re-import f:
>>> def f(I):
... print('hello world!',I)
... randomfunction()
...
>>> pickle.dumps(f)
'c__main__\nf\np0\n.'
The child process will just re-import f, and then call it. randomfunction will be accessible as long as it was properly imported into the original script to begin with.
Note that in Python 3.4+, you can get the Windows-style behavior on Linux by using contexts:
ctx = multiprocessing.get_context('spawn')
ctx.Process(target=f,args=(I,)).start() # even on Linux, this will use pickle
The descriptions of the contexts are also probably relevant here, since they apply to Python 2.x as well:
spawn
The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows.
fork
The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
forkserver
When the program starts and selects the forkserver start
method, a server process is started. From then on, whenever a new
process is needed, the parent process connects to the server and
requests that it fork a new process. The fork server process is single
threaded so it is safe for it to use os.fork(). No unnecessary
resources are inherited.
Available on Unix platforms which support passing file descriptors
over Unix pipes.
Note that forkserver is only available in Python 3.4, there's no way to get that behavior on 2.x, regardless of the platform you're on.
The function is pickled, but possibly not in the way you think of it:
You can look at what's actually in a pickle like this:
pickletools.dis(pickle.dumps(f))
I get:
0: c GLOBAL '__main__ f'
12: p PUT 0
15: . STOP
You'll note that there is nothing in there correspond to the code of the function. Instead, it has references to __main__ f which is the module and name of the function. So when this is unpickled, it will always attempt to lookup the f function in the __main__ module and use that. When you use the multiprocessing module, that ends up being a copy of the same function as it was in your original program.
This does mean that if you somehow modify which function is located at __main__.f you'll end up unpickling a different function then you pickled in.
Multiprocessing brings up a complete copy of your program complete with all the functions you defined it. So you can just call functions. The entire function isn't copied over, just the name of the function. The pickle module's assumption is that function will be same in both copies of your program, so it can just lookup the function by name.
Only the function arguments (I,) and the return value of the function f are pickled. The actual definition of the function f has to be available when loading the module.
The easiest way to see this is through the code:
from multiprocessing import Process
if __name__ == '__main__':
def f(I):
print('hello world!',I)
for I in [1,2,3]:
Process(target=f,args=(I,)).start()
That returns:
AttributeError: 'module' object has no attribute 'f'
Related
I'm struggling to find answers on what objects and variables are copied to child processes when creating a multiprocessing pool in Python 3.
In other words, say I have a huge list (~230000000 elements) stored in a class that implements a function that uses a pool of four child processes. Will this list then be copied across to all four child processes if...
the child processes do not read from the list?
the child processes read from the list (however, the list is not modified)?
To concretely answer the original question specifically regarding the usage of "spawn" (as OP mentioned they are familiar with "fork")
When a process object is created, it is constructed in main, and then a new python process is executed with command line args to share a pair of file handles for communication as well as a stub of code to start from.
That "bootstrap" code will try to import the main file, which is both why you need to protect against unintended side-effects on import (if __name__ == "__main__":), and why anything outside of that protection is "available" to the child. This primarily is meant to make sure functions from the main file are defined, but any variables defined at the module level are also defined. This is useful for constants as long as it doesn't matter that you're effectively re-computing the values, and making one copy for each process. For large datasets this is very inefficient.
The bootstrap code will also read one of the file handles, and attempt to unpickle the process object that the parent sent to it. The target of the process is generally a function you have defined, but care must be taken that it's accessible in the "main" namespace on import (no lambda's, no instance methods, etc..). Python does not serialize code objects with pickle, rather it relays how to properly import the function, which gets dicey with objects that don't have a concrete namespace on import (sidebar, the 3rd party multiprocess library attempts to solve this by using dill instead of pickle to generally good success). This also plays into account when you subclass the Process class, and attach other data to a process instance; it all must be pickleable.
Once the process object has been successfully un-pickled by the child process, the run method is called. This is generally the entrypoint of your code. with a Pool, there's a big class that lives on the main process, and launches "worker" processes with a pre-defined function that takes in "jobs" and returns the results until told to exit. Data (task items consisting of a function to execute and args for that func) is sent to and from the workers via Queues which work pretty much the same as sending the original Process object: the thing you put into the queue is pickled, sent via a file handle, and un-pickled in the child.
Note: this answer is partial in the sense that I too couldn't (yet) find written evidence and documentation about this, but the following gives some kind of empirical data, if you will.
The following code is used to demonstrate how data is being passed/copied to child processes using a Pool (the actual list l is not used on purpose in the map to allow clean printings):
from multiprocessing import Pool
import os
def process(x):
print(os.getpid(), __name__, 'l' in globals())
# A - l = list(range(100000))
if __name__ == "__main__":
# B - l = list(range(100000))
with Pool() as pool:
pool.map(process, [1,2,3,4])
print(os.getpid(), __name__, 'l' in globals())
On Windows
When uncommenting comment A, a printout similar to:
19604 __mp_main__ True
6392 __mp_main__ True
19604 __mp_main__ True
7048 __mp_main__ True
6568 __main__ True
will be given. This is because the list is defined outside the __name__ guard, and as the processes in Windows basically import the py file, they all define their own version of l.
When uncommenting comment B, a printout similar to:
7248 __mp_main__ False
22644 __mp_main__ False
22676 __mp_main__ False
16520 __mp_main__ False
19736 __main__ True
will be given. i.e. as the the list is defined inside the __name__ guard, only the __main__ process have it defined and it passes the arguments through map to the different processes.
On Linux
Uncommenting any of the comments will give a printout similar to:
25261 __main__ True
25262 __main__ True
25263 __main__ True
25264 __main__ True
25260 __main__ True
I am guessing that this is because Linux uses fork to create the spawned processes, where the processes are being "cloned" so the list will be defined either way.
import multiprocessing
import numpy as np
import multiprocessing as mp
import ctypes
class Test():
def __init__(self):
shared_array_base = multiprocessing.Array(ctypes.c_double, 100, lock=False)
self.a = shared_array = np.ctypeslib.as_array(shared_array_base)
def my_fun(self,i):
self.a[i] = 1
if __name__ == "__main__":
num_cores = multiprocessing.cpu_count()
t = Test()
def my_fun_wrapper(i):
t.my_fun(i)
with mp.Pool(num_cores) as p:
p.map(my_fun_wrapper, np.arange(100))
print(t.a)
In the code above, I'm trying to write a code to modify an array, using multiprocessing. The function my_fun(), executed in each process, should modify the value for the array a[:] at index i which is passed to my_fun() as a parameter. With regards to the code above, I would like to know what is being copied.
1) Is anything in the code being copied by each process? I think the object might be but ideally nothing is.
2) Is there a way to get around using a wrapper function my_fun() for the object?
Almost everything in your code is getting copied, except the shared memory you allocated with multiprocessing.Array. multiprocessing is full of unintuitive, implicit copies.
When you spawn a new process in multiprocessing, the new process needs its own version of just about everything in the original process. This is handled differently depending on platform and settings, but we can tell you're using "fork" mode, because your code wouldn't work in "spawn" or "forkserver" mode - you'd get an error about the workers not being able to find my_fun_wrapper. (Windows only supports "spawn", so we can tell you're not on Windows.)
In "fork" mode, this initial copy is made by using the fork system call to ask the OS to essentially copy the whole entire process and everything inside. The memory allocated by multiprocessing.Array is sort of "external" and isn't copied, but most other things are. (There's also copy-on-write optimization, but copy-on-write still behaves as if everything was copied, and the optimization doesn't work very well in Python due to refcount updates.)
When you dispatch tasks to worker processes, multiprocessing needs to make even more copies. Any arguments, and the callable for the task itself, are objects in the master process, and objects inherently exist in only one process. The workers can't access any of that. They need their own versions. multiprocessing handles this second round of copies by pickling the callable and arguments, sending the serialized bytes over interprocess communication, and unpickling the pickles in the worker.
When the master pickles my_fun_wrapper, the pickle just says "look for the my_fun_wrapper function in the __main__ module", and the workers look up their version of my_fun_wrapper to unpickle it. my_fun_wrapper looks for a global t, and in the workers, that t was produced by the fork, and the fork produced a t with an array backed by the shared memory you allocated with your original multiprocessing.Array call.
On the other hand, if you try to pass t.my_fun to p.map, then multiprocessing has to pickle and unpickle a method object. The resulting pickle doesn't say "look up the t global variable and get its my_fun method". The pickle says to build a new Test instance and get its my_fun method. The pickle doesn't have any instructions in it about using the shared memory you allocated, and the resulting Test instance and its array are independent of the original array you wanted to modify.
I know of no good way to avoid needing some sort of wrapper function.
Using Python's multiprocessing on Windows will require many arguments to be "picklable" while passing them to child processes.
import multiprocessing
class Foobar:
def __getstate__(self):
print("I'm being pickled!")
def worker(foobar):
print(foobar)
if __name__ == "__main__":
# Uncomment this on Linux
# multiprocessing.set_start_method("spawn")
foobar = Foobar()
process = multiprocessing.Process(target=worker, args=(foobar, ))
process.start()
process.join()
The documentation mentions this explicitly several times:
Picklability
Ensure that the arguments to the methods of proxies are picklable.
[...]
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from multiprocessing need to be picklable so that child processes can use them. However, one should generally avoid sending shared objects to other processes using pipes or queues. Instead you should arrange the program so that a process which needs access to a shared resource created elsewhere can inherit it from an ancestor process.
[...]
More picklability
Ensure that all arguments to Process.__init__() are picklable. Also, if you subclass Process then make sure that instances will be picklable when the Process.start method is called.
However, I noticed two main differences between "multiprocessing pickle" and the standard pickle module, and I have trouble making sense of all of this.
multiprocessing.Queue() are not "pickable" yet passable to child processes
import pickle
from multiprocessing import Queue, Process
def worker(queue):
pass
if __name__ == "__main__":
queue = Queue()
# RuntimeError: Queue objects should only be shared between processes through inheritance
pickle.dumps(queue)
# Works fine
process = Process(target=worker, args=(queue, ))
process.start()
process.join()
Not picklable if defined in "main"
import pickle
from multiprocessing import Process
def worker(foo):
pass
if __name__ == "__main__":
class Foo:
pass
foo = Foo()
# Works fine
pickle.dumps(foo)
# AttributeError: Can't get attribute 'Foo' on <module '__mp_main__' from 'C:\\Users\\Delgan\\test.py'>
process = Process(target=worker, args=(foo, ))
process.start()
process.join()
If multiprocessing does not use pickle internally, then what are the inherent differences between these two ways of serializing objects?
Also, what does "inherit" mean in the context of multiprocessing? How am I supposed to prefer it over pickle?
When a multiprocessing.Queue is passed to a child process, what is actually sent is a file descriptor (or handle) obtained from pipe, which must have been created by the parent before creating the child. The error from pickle is to prevent attempts to send a Queue over another Queue (or similar channel), since it’s too late to use it then. (Unix systems do actually support sending a pipe over certain kinds of socket, but multiprocessing doesn’t use such features.) It’s expected to be “obvious” that certain multiprocessing types can be sent to child processes that would otherwise be useless, so no mention is made of the apparent contradiction.
Since the “spawn” start method can’t create the new process with any Python objects already created, it has to re-import the main script to obtain relevant function/class definitions. It doesn’t set __name__ like the original run for obvious reasons, so anything that is dependent on that setting will not be available. (Here, it is unpickling that failed, which is why your manual pickling works.)
The fork methods start the children with the parent’s objects (at the time of the fork only) still existing; this is what is meant by inheritance.
This code executes on linux but throws an AttributeError: type object 'T' has no attribute 'val' on windows, why?
from multiprocessing import Process
import sys
class T():
#classmethod
def init(cls, val):
cls.val = val
def f():
print(T.val)
if __name__ == '__main__':
T.init(5)
f()
p = Process(target=f, args=())
p.start()
Windows lacks a fork() system call, which duplicates current process. This has many implications, including those listed on the windows multiprocessing documentation page. More specifically:
Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called.
In internals, python creates a new process on windows by starting a new process from scratch, and telling it to load all modules again. So any change you have done in current process will not be seen.
In your example, this means that in the child process, your module will be loaded, but the if __name__ == '__main__' section will not be run. So T.init will not be called, and T.val won't exist, thus the error you see.
On the other hand, on POSIX systems (that includes Linux), process creation uses fork, and all global state is left untouched. The child runs with a copy of everything, so it does not have to reload anything and will see its copy of T with its copy of val.
This also means that Process creation is much faster and much lighter on resources on POSIX systems, especially as the “duplication” uses copy-on-write to avoid the overhead of actually copying the data.
There are other quirks when using multiprocessing, all of which are detailed in the python multiprocessing guidelines.
I'm currently using the standard multiprocessing in python to generate a bunch of processes that will run indefinitely. I'm not particularly concerned with performance; each thread is simply watching for a different change on the filesystem, and will take the appropriate action when a file is modified.
Currently, I have a solution that works, for my needs, in Linux. I have a dictionary of functions and arguments that looks like:
job_dict['func1'] = {'target': func1, 'args': (args,)}
For each, I create a process:
import multiprocessing
for k in job_dict.keys():
jobs[k] = multiprocessing.Process(target=job_dict[k]['target'],
args=job_dict[k]['args'])
With this, I can keep track of each one that is running, and, if necessary, restart a job that crashes for any reason.
This does not work in Windows. Many of the functions I'm using are wrappers, using various functools functions, and I get messages about not being able to serialize the functions (see What can multiprocessing and dill do together?). I have not figured out why I do not get this error in Linux, but do in Windows.
If I import dill before starting my processes in Windows, I do not get the serialization error. However, the processes do not actually do anything. I cannot figure out why.
I then switched to the multiprocessing implementation in pathos, but did not find an analog to the simple Process class within the standard multiprocessing module. I was able to generate threads for each job using pathos.pools.ThreadPool. This is not the intended use for map, I'm sure, but it started all the threads, and they ran in Windows:
import pathos
tp = pathos.pools.ThreadPool()
for k in job_dict.keys():
tp.uimap(job_dict[k]['target'], job_dict[k]['args'])
However, now I'm not sure how to monitor whether a thread is still active, which I'm looking for so that I can restart threads that crash for some reason or another. Any suggestions?
I'm the pathos and dill author. The Process class is buried deep within pathos at pathos.helpers.mp.process.Process, where mp itself is the actual fork of the multiprocessing library. Everything in multiprocessing should be accessible from there.
Another thing to know about pathos is that it keeps the pool alive for you until you remove it from the held state. This helps reduce overhead in creating "new" pools. To remove a pool, you do:
>>> # create
>>> p = pathos.pools.ProcessPool()
>>> # remove
>>> p.clear()
There's no such mechanism for a Process however.
For multiprocessing, windows is different than Linux and Macintosh… because windows doesn't have a proper fork like on linux… linux can share objects across processes, while on windows there is no sharing… it's basically a fully independent new process created… and therefore the serialization has to be better for the object to pass across to the other process -- just as if you would send the object to another computer. On, linux, you'd have to do this to get the same behavior:
def check(obj, *args, **kwds):
"""check pickling of an object across another process"""
import subprocess
fail = True
try:
_x = dill.dumps(x, *args, **kwds)
fail = False
finally:
if fail:
print "DUMP FAILED"
msg = "python -c import dill; print dill.loads(%s)" % repr(_x)
print "SUCCESS" if not subprocess.call(msg.split(None,2)) else "LOAD FAILED"