Python multiprocessing linux windows difference

Python multiprocessing linux windows difference - python

This code executes on linux but throws an AttributeError: type object 'T' has no attribute 'val' on windows, why?
from multiprocessing import Process
import sys
class T():
#classmethod
def init(cls, val):
cls.val = val
def f():
print(T.val)
if __name__ == '__main__':
T.init(5)
f()
p = Process(target=f, args=())
p.start()

Windows lacks a fork() system call, which duplicates current process. This has many implications, including those listed on the windows multiprocessing documentation page. More specifically:
Bear in mind that if code run in a child process tries to access a global variable, then the value it sees (if any) may not be the same as the value in the parent process at the time that Process.start was called.
In internals, python creates a new process on windows by starting a new process from scratch, and telling it to load all modules again. So any change you have done in current process will not be seen.
In your example, this means that in the child process, your module will be loaded, but the if __name__ == '__main__' section will not be run. So T.init will not be called, and T.val won't exist, thus the error you see.
On the other hand, on POSIX systems (that includes Linux), process creation uses fork, and all global state is left untouched. The child runs with a copy of everything, so it does not have to reload anything and will see its copy of T with its copy of val.
This also means that Process creation is much faster and much lighter on resources on POSIX systems, especially as the “duplication” uses copy-on-write to avoid the overhead of actually copying the data.
There are other quirks when using multiprocessing, all of which are detailed in the python multiprocessing guidelines.

Related

Statements before multiprocessing main() executed multiple times (Python)

I am learning Python and its multiprocessing.
I created a project with a mian() in main.py and a a_simulation inside the module simulation.py under the package simulator/.
The symptom is that a test statement print("hello\n") inside main.py before the definition of mian() is executed multiple times when the program is run with python main.py, indicating things before the print, including the creations of the lists are all executed multiple times.
I do not think I understand the related issues of python very well. May I know what is reason for the symptom and what is the best practice in python when creating projects like this? I have included the codes and the terminal prints. Thank you!
Edit: Forgot to mention that I am running it with anaconda python on macOS, although I would wish my project will work just fine on any platforms.
mian.py:
from multiprocessing import Pool
from simulator.simulation import a_simulation
import random
num_trials = 10
iter_trials = list(range(num_trials))
arg_list = [random.random() for _ in range(num_trials)]
input = list(zip(iter_trials, arg_list))
print("hello\n")
def main():
with Pool(processes=4) as pool:
result = pool.starmap(a_simulation, input)
print(result)
if __name__ == "__main__":
main()
simulatior/simulation.py:
import os
from time import sleep
def a_simulation(x, seed_):
print(f"Process {os.getpid()}: trial {x} received {seed_}\n" )
sleep(1)
return seed_
Results from the terminal:
hello
hello
hello
hello
hello
Process 71539: trial 0 received 0.4512600158461971
Process 71538: trial 1 received 0.8772526554425158
Process 71541: trial 2 received 0.6893833978242683
Process 71540: trial 3 received 0.29249994820563296
Process 71538: trial 4 received 0.5759647958461107
Process 71541: trial 5 received 0.08799525261308505
Process 71539: trial 6 received 0.3057644321667139
Process 71540: trial 7 received 0.5402091856171599
Process 71538: trial 8 received 0.1373456223147438
Process 71541: trial 9 received 0.24000943476017
[0.4512600158461971, 0.8772526554425158, 0.6893833978242683, 0.29249994820563296, 0.5759647958461107, 0.08799525261308505, 0.3057644321667139, 0.5402091856171599, 0.1373456223147438, 0.24000943476017]
(base)

The reason why this happens is because multiprocessing uses start method spawn, by default, on Windows and macOS to start new processes. What this means is that whenever you want to start a new process, the child process is initially created without sharing any of the memory of the parent. However, this makes things messy when you want to start a function in the child process from the parent because not only will the child not know the definition of the function itself, you might also run into some unexpected obstacles (what if the function depends on a variable defined in the parent processes' module?). To stop these sorts of things from happening, multiprocessing automatically imports the parent processes' module from the child process, which essentially copies almost the entire state of the parent when the child process was started.
This is where the if __name__ == "__main__" comes in. This statement basically translates to if the current file is being run directly then..., the code under this block will not run if the module is being imported. Therefore, the child process will not run anything under this block when they are spawned. You can hence use this block to create, for example, variables which use up a lot of memory and are not required for the child processes to function but are used by the parent. Basically, anything that the child processes won't need, throw it under here.
Now coming to your comment about imports:
This must be a silly questions, but should I leave the import statements as they are, or move them inside if name == "main":, or somewhere else? Thanks
Like I said, anything that the child doesn't need can be put under this if block. The reason you don't often see imports under this block is perhaps due to sticking to convention ("imports should be done at the top") and because the modules being imported don't really affect performance much (even after being needlessly imported multiple times). Keep in mind however, that if a child process requires a particular module to start its work, it will always be imported again within the child process, even if you have imported it under the if __name__... block. This is because when you attempt to spawn child processes to start a function in parallel, multiprocessing automatically serializes and sends the names of the function, and the module that defines the function (actual code is not serialized, only the names), to the child processes where they are imported once more (relevant question).
This is only specific to when the start method is spawn, you can read more about the differences here

Does multiprocessing copy the object in this scenario?

import multiprocessing
import numpy as np
import multiprocessing as mp
import ctypes
class Test():
def __init__(self):
shared_array_base = multiprocessing.Array(ctypes.c_double, 100, lock=False)
self.a = shared_array = np.ctypeslib.as_array(shared_array_base)
def my_fun(self,i):
self.a[i] = 1
if __name__ == "__main__":
num_cores = multiprocessing.cpu_count()
t = Test()
def my_fun_wrapper(i):
t.my_fun(i)
with mp.Pool(num_cores) as p:
p.map(my_fun_wrapper, np.arange(100))
print(t.a)
In the code above, I'm trying to write a code to modify an array, using multiprocessing. The function my_fun(), executed in each process, should modify the value for the array a[:] at index i which is passed to my_fun() as a parameter. With regards to the code above, I would like to know what is being copied.
1) Is anything in the code being copied by each process? I think the object might be but ideally nothing is.
2) Is there a way to get around using a wrapper function my_fun() for the object?

Almost everything in your code is getting copied, except the shared memory you allocated with multiprocessing.Array. multiprocessing is full of unintuitive, implicit copies.
When you spawn a new process in multiprocessing, the new process needs its own version of just about everything in the original process. This is handled differently depending on platform and settings, but we can tell you're using "fork" mode, because your code wouldn't work in "spawn" or "forkserver" mode - you'd get an error about the workers not being able to find my_fun_wrapper. (Windows only supports "spawn", so we can tell you're not on Windows.)
In "fork" mode, this initial copy is made by using the fork system call to ask the OS to essentially copy the whole entire process and everything inside. The memory allocated by multiprocessing.Array is sort of "external" and isn't copied, but most other things are. (There's also copy-on-write optimization, but copy-on-write still behaves as if everything was copied, and the optimization doesn't work very well in Python due to refcount updates.)
When you dispatch tasks to worker processes, multiprocessing needs to make even more copies. Any arguments, and the callable for the task itself, are objects in the master process, and objects inherently exist in only one process. The workers can't access any of that. They need their own versions. multiprocessing handles this second round of copies by pickling the callable and arguments, sending the serialized bytes over interprocess communication, and unpickling the pickles in the worker.
When the master pickles my_fun_wrapper, the pickle just says "look for the my_fun_wrapper function in the __main__ module", and the workers look up their version of my_fun_wrapper to unpickle it. my_fun_wrapper looks for a global t, and in the workers, that t was produced by the fork, and the fork produced a t with an array backed by the shared memory you allocated with your original multiprocessing.Array call.
On the other hand, if you try to pass t.my_fun to p.map, then multiprocessing has to pickle and unpickle a method object. The resulting pickle doesn't say "look up the t global variable and get its my_fun method". The pickle says to build a new Test instance and get its my_fun method. The pickle doesn't have any instructions in it about using the shared memory you allocated, and the resulting Test instance and its array are independent of the original array you wanted to modify.
I know of no good way to avoid needing some sort of wrapper function.

multiprocessing -> pathos.multiprocessing and windows

I'm currently using the standard multiprocessing in python to generate a bunch of processes that will run indefinitely. I'm not particularly concerned with performance; each thread is simply watching for a different change on the filesystem, and will take the appropriate action when a file is modified.
Currently, I have a solution that works, for my needs, in Linux. I have a dictionary of functions and arguments that looks like:
job_dict['func1'] = {'target': func1, 'args': (args,)}
For each, I create a process:
import multiprocessing
for k in job_dict.keys():
jobs[k] = multiprocessing.Process(target=job_dict[k]['target'],
args=job_dict[k]['args'])
With this, I can keep track of each one that is running, and, if necessary, restart a job that crashes for any reason.
This does not work in Windows. Many of the functions I'm using are wrappers, using various functools functions, and I get messages about not being able to serialize the functions (see What can multiprocessing and dill do together?). I have not figured out why I do not get this error in Linux, but do in Windows.
If I import dill before starting my processes in Windows, I do not get the serialization error. However, the processes do not actually do anything. I cannot figure out why.
I then switched to the multiprocessing implementation in pathos, but did not find an analog to the simple Process class within the standard multiprocessing module. I was able to generate threads for each job using pathos.pools.ThreadPool. This is not the intended use for map, I'm sure, but it started all the threads, and they ran in Windows:
import pathos
tp = pathos.pools.ThreadPool()
for k in job_dict.keys():
tp.uimap(job_dict[k]['target'], job_dict[k]['args'])
However, now I'm not sure how to monitor whether a thread is still active, which I'm looking for so that I can restart threads that crash for some reason or another. Any suggestions?

I'm the pathos and dill author. The Process class is buried deep within pathos at pathos.helpers.mp.process.Process, where mp itself is the actual fork of the multiprocessing library. Everything in multiprocessing should be accessible from there.
Another thing to know about pathos is that it keeps the pool alive for you until you remove it from the held state. This helps reduce overhead in creating "new" pools. To remove a pool, you do:
>>> # create
>>> p = pathos.pools.ProcessPool()
>>> # remove
>>> p.clear()
There's no such mechanism for a Process however.
For multiprocessing, windows is different than Linux and Macintosh… because windows doesn't have a proper fork like on linux… linux can share objects across processes, while on windows there is no sharing… it's basically a fully independent new process created… and therefore the serialization has to be better for the object to pass across to the other process -- just as if you would send the object to another computer. On, linux, you'd have to do this to get the same behavior:
def check(obj, *args, **kwds):
"""check pickling of an object across another process"""
import subprocess
fail = True
try:
_x = dill.dumps(x, *args, **kwds)
fail = False
finally:
if fail:
print "DUMP FAILED"
msg = "python -c import dill; print dill.loads(%s)" % repr(_x)
print "SUCCESS" if not subprocess.call(msg.split(None,2)) else "LOAD FAILED"

What is being pickled when I call multiprocessing.Process?

I know that multiprocessing uses pickling in order to have the processes run on different CPUs, but I think I am a little confused as to what is being pickled. Lets look at this code.
from multiprocessing import Process
def f(I):
print('hello world!',I)
if __name__ == '__main__':
for I in (range1, 3):
Process(target=f,args=(I,)).start()
I assume what is being pickled is the def f(I) and the argument going in. First, is this assumption correct?
Second, lets say f(I) has a function call within in it like:
def f(I):
print('hello world!',I)
randomfunction()
Does the randomfunction's definition get pickled as well, or is it only the function call?
Further more, if that function call was located in another file, would the process be able to call it?

In this particular example, what gets pickled is platform dependent. On systems that support os.fork, like Linux, nothing is pickled here. Both the target function and the args you're passing get inherited by the child process via fork.
On platforms that don't support fork, like Windows, the f function and args tuple will both be pickled and sent to the child process. The child process will re-import your __main__ module, and then unpickle the function and its arguments.
In either case, randomfunction is not actually pickled. When you pickle f, all you're really pickling is a pointer for the child function to re-build the f function object. This is usually little more than a string that tells the child how to re-import f:
>>> def f(I):
... print('hello world!',I)
... randomfunction()
...
>>> pickle.dumps(f)
'c__main__\nf\np0\n.'
The child process will just re-import f, and then call it. randomfunction will be accessible as long as it was properly imported into the original script to begin with.
Note that in Python 3.4+, you can get the Windows-style behavior on Linux by using contexts:
ctx = multiprocessing.get_context('spawn')
ctx.Process(target=f,args=(I,)).start() # even on Linux, this will use pickle
The descriptions of the contexts are also probably relevant here, since they apply to Python 2.x as well:
spawn
The parent process starts a fresh python interpreter process.
The child process will only inherit those resources necessary to run
the process objects run() method. In particular, unnecessary file
descriptors and handles from the parent process will not be inherited.
Starting a process using this method is rather slow compared to using
fork or forkserver.
Available on Unix and Windows. The default on Windows.
fork
The parent process uses os.fork() to fork the Python interpreter.
The child process, when it begins, is effectively identical to the
parent process. All resources of the parent are inherited by the child
process. Note that safely forking a multithreaded process is
problematic.
Available on Unix only. The default on Unix.
forkserver
When the program starts and selects the forkserver start
method, a server process is started. From then on, whenever a new
process is needed, the parent process connects to the server and
requests that it fork a new process. The fork server process is single
threaded so it is safe for it to use os.fork(). No unnecessary
resources are inherited.
Available on Unix platforms which support passing file descriptors
over Unix pipes.
Note that forkserver is only available in Python 3.4, there's no way to get that behavior on 2.x, regardless of the platform you're on.

The function is pickled, but possibly not in the way you think of it:
You can look at what's actually in a pickle like this:
pickletools.dis(pickle.dumps(f))
I get:
0: c GLOBAL '__main__ f'
12: p PUT 0
15: . STOP
You'll note that there is nothing in there correspond to the code of the function. Instead, it has references to __main__ f which is the module and name of the function. So when this is unpickled, it will always attempt to lookup the f function in the __main__ module and use that. When you use the multiprocessing module, that ends up being a copy of the same function as it was in your original program.
This does mean that if you somehow modify which function is located at __main__.f you'll end up unpickling a different function then you pickled in.
Multiprocessing brings up a complete copy of your program complete with all the functions you defined it. So you can just call functions. The entire function isn't copied over, just the name of the function. The pickle module's assumption is that function will be same in both copies of your program, so it can just lookup the function by name.

Only the function arguments (I,) and the return value of the function f are pickled. The actual definition of the function f has to be available when loading the module.
The easiest way to see this is through the code:
from multiprocessing import Process
if __name__ == '__main__':
def f(I):
print('hello world!',I)
for I in [1,2,3]:
Process(target=f,args=(I,)).start()
That returns:
AttributeError: 'module' object has no attribute 'f'

python multiprocess caller (as well as callee) invoked multiple times on windows XP [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Multiprocessing launching too many instances of Python VM
I'm trying to use python multiprocess to parallelize web fetching, but I'm finding that the application calling the multiprocessing gets instantiated multiple times, not just the function I want called (which is a problem for me as the caller has some dependencies on a library that is slow to instantiate - losing most of my performance gains from parallelism).
What am I doing wrong or how is this avoided?
my_app.py:
from url_fetcher import url_fetch, parallel_fetch
import my_slow_stuff
my_slow_stuff.py:
if __name__ == '__main__':
import datetime
urls = ['http://www.microsoft.com'] * 20
results = parallel_fetch(urls, fn=url_fetch)
print([x[:20] for x in results])
class MySlowStuff(object):
import time
print('doing slow stuff')
time.sleep(0)
print('done slow stuff')
url_fetcher.py:
import multiprocessing
import urllib
def url_fetch(url):
#return urllib.urlopen(url).read()
return url
def parallel_fetch(urls, fn):
PROCESSES = 10
CHUNK_SIZE = 1
pool = multiprocessing.Pool(PROCESSES)
results = pool.imap(fn, urls, CHUNK_SIZE)
return results
if __name__ == '__main__':
import datetime
urls = ['http://www.microsoft.com'] * 20
results = parallel_fetch(urls, fn=url_fetch)
print([x[:20] for x in results])
partial output:
$ python my_app.py
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
doing slow stuff
done slow stuff
...

Python multiprocessing module for Windows behaves slightly differently because Python doesn't implement os.fork() on this platform. In particular:
Safe importing of main module
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
Here, global class MySlowStuff gets always evaluated by newly started child processes on Windows. To fix that class MySlowStuff should be defined only when __name__ == '__main__'.
See 16.6.3.2. Windows for more details.

The multiprocessing module on windows doesn't work the same as in Unix/Linux. On Linux it uses the fork command and all the context is copied/duplciated to the new pocess as it is when forked.
The system call fork does not exsit on windows, and the multiprocessing module has to create a new python process and load all the modules again, this is the reason why on the python lib documetnacion forces you to user the if __name__ == '__main__' trick when using mutiprocessing on windows.
The solution to this case is to use threads instead. This case is a IO bound process and you the advantage os multiprocessing that is avoiding GIL problems does not afect you.
More info in http://docs.python.org/library/multiprocessing.html#windows

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.