I use windows. I don't understand why the following code fails.
a, when defined in the "if main" block, should be a global variable. But running the script I got error "a is not defined". However, if a is defined outside the "if main" block, the code will work.
from multiprocessing import Pool
import numpy as np
# a = np.array([1,2,3])
def f(x):
return a*x
if __name__ == '__main__':
a = np.array([1,2,3])
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
Only the main thread has __name__=='__main__'. The other child threads import your code from a spawned process without setting __name__ to __main__. This is intentional, and required in windows (where fork() is not available), to provide a mechanism for executing code like initializing the pool only in the parent. See discussion here: Workaround for using __name__=='__main__' in Python multiprocessing
Related
I encountered a problem while writing the python code with a multiprocessing map function. The minimum code to reproduce the problem is like
import multiprocessing as mp
if __name__ == '__main__':
def f(x):
return x*x
num_workers = 2
with mp.Pool(num_workers) as p:
print(p.map(f, [1,2,3]))
If one runs this piece of code, I got the error message
AttributeError: Can't get attribute 'f' on <module '__mp_main__' from 'main.py'>
However, If I move f-function outside the main function, i.e.
import multiprocessing as mp
def f(x):
return x*x
if __name__ == '__main__':
num_workers = 2
with mp.Pool(num_workers) as p:
print(p.map(f, [1,2,3]))
It works this time. I am wondering what's the difference between them and how can I get an error in the first version. Thanks in advance.
Depending on your operating system, sub-processes will either be forked or spawned. macOS, for example, will spawn whereas Windows will fork.
You can enforce forking but you need to fully understand the implications of doing so.
For this specific question a workaround could be implemented thus:
import multiprocessing as mp
from multiprocessing import set_start_method
if __name__ == '__main__':
def f(x):
return x*x
set_start_method('fork')
num_workers = 2
with mp.Pool(num_workers) as p:
print(p.map(f, [1,2,3]))
This will vary between operating systems, but the basic reason is that this line of code
if __name__ == '__main__':
is telling the Python interpreter to only include anything in this code section in the main process when run as a script - it won't be included in any sub process, nor will it appear if you import it as a module. So when you do this
import multiprocessing as mp
if __name__ == '__main__':
def f(x):
return x*x
num_workers = 2
with mp.Pool(num_workers) as p:
print(p.map(f, [1,2,3]))
any sub processes created by p.map will not have the definition of function f
I have a multiprocessing pool , that runs with 1 thread, and it keeps repeating the code before my function, i have tried with different threads, and also, i make things like this quite a bit, so i think i know what is causing the problem but i dont understand why, usually i use argparse to to parse files from the user, but i instead wanted to use input, no errors are thrown so i honestly have no clue.
from colorama import Fore
import colorama
import os
import ctypes
import multiprocessing
from multiprocessing import Pool
import random
colorama.init(autoreset=False)
print("headerhere")
#as you can see i used input instead of argparse
g = open(input(Fore.RED + " File Path?: " + Fore.RESET))
gg = open(input(Fore.RED + "File Path?: " + Fore.RESET))
#I messed around with this to see if it was the problem, ultimately disabling it until i fixed it, i just use 1 thread
threads = int(input(Fore.RED + "Amount of Threads?: " + Fore.RESET))
arrange = [lines.replace("\n", "")for lines in g]
good = [items.replace("\n", "") for items in gg]
#this is all of the code before the function that Pool calls
def che(line):
print("f")
#i would show my code but as i said this isnt the problem since ive made programs like this before, the only thing i changed is how i take file inputs from the user
def main():
pool = Pool(1)
pool.daemon = True
result = pool.map(che, arrange)
if __name__ == "__main__":
main()
if __name__ == "__main__":
main()
Here's a minimal, reproducible example of your issue:
from multiprocessing import Pool
print('header')
def func(n):
print(f'func {n}')
def main():
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
On OSes where "spawn" (Windows and MacOS) or "forkserver" (some Unix) are the default start methods, the sub-process imports your script. Since print('header') is at global scope, it will run the first time a script is imported into a process, so the output is:
header
header
header
header
func 1
func 2
func 3
A multiprocessing script should have everything meant to run once inside function(s), and they should be called once by the main script via if_name__ == '__main__':, so the solution is to move it into your def main()::
from multiprocessing import Pool
def func(n):
print(f'func {n}')
def main():
print('header')
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
Output:
header
func 1
func 2
func 3
If you want the top level code before the definition of che to only be executed in the master process, then place it in a function and call that function in main.
In multiprocessing, the top level statements will be interpreted/executed by both the master process and every child process. So, if some code should be executed only by the master and not by the children, then such code should not placed that at the top-level. Instead, such code should be placed in functions and these functions should be invoked in the main scope, i.e., in the scope of if block controlled by __main__ (or called in the main function in your code snippet).
I am trying to use multiprocessing on a different problem but I can't get it to work. To make sure I'm using the Pool class correctly, I made the following simpler problem but even that won't work. What am I doing wrong here?
from multiprocessing import Pool
def square(x):
sq = x**2
return sq
def main():
x1 = [1,2,3,4]
pool = Pool()
result = pool.map( square, x1 )
print(result)
if __name__ == '__main__': main()
The computer just seems to run forever and I need to close and restart the IPython shell before I can do anything.
I figured out what was wrong. I named the script "multiprocessing.py" which is the name of the module that was being imported. This resulted in the script attempting to import itself instead of the actual module.
Whats the difference between ThreadPool and Pool in multiprocessing module. When I try my code out, this is the main difference I see:
from multiprocessing import Pool
import os, time
print("hi outside of main()")
def hello(x):
print("inside hello()")
print("Proccess id: ", os.getpid())
time.sleep(3)
return x*x
if __name__ == "__main__":
p = Pool(5)
pool_output = p.map(hello, range(3))
print(pool_output)
I see the following output:
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
hi outside of main()
inside hello()
Proccess id: 13268
inside hello()
Proccess id: 11104
inside hello()
Proccess id: 13064
[0, 1, 4]
With "ThreadPool":
from multiprocessing.pool import ThreadPool
import os, time
print("hi outside of main()")
def hello(x):
print("inside hello()")
print("Proccess id: ", os.getpid())
time.sleep(3)
return x*x
if __name__ == "__main__":
p = ThreadPool(5)
pool_output = p.map(hello, range(3))
print(pool_output)
I see the following output:
hi outside of main()
inside hello()
inside hello()
Proccess id: 15204
Proccess id: 15204
inside hello()
Proccess id: 15204
[0, 1, 4]
My questions are:
why is the “outside __main__()” run each time in the Pool?
multiprocessing.pool.ThreadPool doesn't spawn new processes? It just creates new threads?
If so whats the difference between using multiprocessing.pool.ThreadPool as opposed to just threading module?
I don't see any official documentation for ThreadPool anywhere, can someone help me out where I can find it?
The multiprocessing.pool.ThreadPool behaves the same as the multiprocessing.Pool with the only difference that uses threads instead of processes to run the workers logic.
The reason you see
hi outside of main()
being printed multiple times with the multiprocessing.Pool is due to the fact that the pool will spawn 5 independent processes. Each process will initialize its own Python interpreter and load the module resulting in the top level print being executed again.
Note that this happens only if the spawn process creation method is used (only method available on Windows). If you use the fork one (Unix), you will see the message printed only once as for the threads.
The multiprocessing.pool.ThreadPool is not documented as its implementation has never been completed. It lacks tests and documentation. You can see its implementation in the source code.
I believe the next natural question is: when to use a thread based pool and when to use a process based one?
The rule of thumb is:
IO bound jobs -> multiprocessing.pool.ThreadPool
CPU bound jobs -> multiprocessing.Pool
Hybrid jobs -> depends on the workload, I usually prefer the multiprocessing.Pool due to the advantage process isolation brings
On Python 3 you might want to take a look at the concurrent.future.Executor pool implementations.
I want to learn multiprocessing in python. I started reading http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html and I am not able to understand the section on importing target functions.
In particular what does the following sentence mean..
"Wrapping the main part of the application in a check for __main__ ensures that it is not run recursively in each child as the module is imported."
Can someone explain this in more detail with an example ?
http://effbot.org/pyfaq/tutor-what-is-if-name-main-for.htm
http://docs.python.org/tutorial/modules.html#executing-modules-as-scripts
What does if __name__ == "__main__": do?
http://en.wikipedia.org/wiki/Main_function#Python
On Windows, the multiprocessing module imports the __main__ module when spawning a new process. If the code that spawns the new process is not wrapped in a if __name__ == '__main__' block, then importing the main module will again spawn a new process. And so on, ad infinitum.
This issue is also mentioned in the multiprocessing docs in the section entitled "Safe importing of main module". There, you'll find the following simple example:
Running this on Windows:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
results in a RuntimeError.
And the fix is to use:
if __name__ == '__main__':
p = Process(target=foo)
p.start()
"""This is my module (mymodule.py)"""
def sum(a,b):
""">>> sum(1,1)
2
>>> sum(1,-1)
0
"""
return a+b
# if you run this module using 'python mymodule.py', run a self test
# if you just import this module, you get foo() and other definitions,
# but the self-test isn't run
if __name__=='__main__':
import doctest
doctest.testmod()
Ensures that the script being run is in the 'top level environment' for interactivity.
For example if you wanted to interact with the user (launching process) you would want to ensure that it is main.
if __name__ == '__main__':
do_something()