Windows multiprocessing - python

As I have discovered windows is a bit of a pig when it comes to multiprocessing and I have a questions about it.
The pydoc states you should protect the entry point of a windows application when using multiprocessing.
Does this mean only the code which creates the new process?
For example
Script 1
import multiprocessing
def somemethod():
while True:
print 'do stuff'
# this will need protecting
p = multiprocessing.Process(target=somemethod).start()
# this wont
if __name__ == '__main__':
p = multiprocessing.Process(target=somemethod).start()
In this script you need to wrap this in if main because the line in spawning the process.
But what about if you had?
Script 2
file1.py
import file2
if __name__ == '__main__':
p = Aclass().start()
file2.py
import multiprocessing
ITEM = 0
def method1():
print 'method1'
method1()
class Aclass(multiprocessing.Process):
def __init__(self):
print 'Aclass'
super(Aclass, self).__init__()
def run(self):
print 'stuff'
What would need to be protected in this instance?
What would happen if there was a if __main__ in File 2, would the code inside of this get executed if a process was being created?
NOTE: I know the code will not compile. It's just an example.

The pydoc states you should protect the entry point of a windows application when using multiprocessing.
My interpretation differs: the documentations states
the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
So importing your module (import mymodule) should not create new processes. That is, you can avoid starting processes by protecting your process-creating code with an
if __name__ == '__main__':
...
because the code in the ... will only run when your program is run as main program, that is, when you do
python mymodule.py
or when you run it as an executable, but not when you import the file.
So, to answer your question about the file2: no, you do not need protection because no process is started during the import file2.
Also, if you put an if __name__ == '__main__' in file2.py, it would not run because file2 is imported, not executed as main program.
edit: here is an example of what can happen when you do not protect your process-creating code: it might just loop and create a ton of processes.

Related

Multiprocessing never executing function keeps repeating code before function

I have a multiprocessing pool , that runs with 1 thread, and it keeps repeating the code before my function, i have tried with different threads, and also, i make things like this quite a bit, so i think i know what is causing the problem but i dont understand why, usually i use argparse to to parse files from the user, but i instead wanted to use input, no errors are thrown so i honestly have no clue.
from colorama import Fore
import colorama
import os
import ctypes
import multiprocessing
from multiprocessing import Pool
import random
colorama.init(autoreset=False)
print("headerhere")
#as you can see i used input instead of argparse
g = open(input(Fore.RED + " File Path?: " + Fore.RESET))
gg = open(input(Fore.RED + "File Path?: " + Fore.RESET))
#I messed around with this to see if it was the problem, ultimately disabling it until i fixed it, i just use 1 thread
threads = int(input(Fore.RED + "Amount of Threads?: " + Fore.RESET))
arrange = [lines.replace("\n", "")for lines in g]
good = [items.replace("\n", "") for items in gg]
#this is all of the code before the function that Pool calls
def che(line):
print("f")
#i would show my code but as i said this isnt the problem since ive made programs like this before, the only thing i changed is how i take file inputs from the user
def main():
pool = Pool(1)
pool.daemon = True
result = pool.map(che, arrange)
if __name__ == "__main__":
main()
if __name__ == "__main__":
main()
Here's a minimal, reproducible example of your issue:
from multiprocessing import Pool
print('header')
def func(n):
print(f'func {n}')
def main():
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
On OSes where "spawn" (Windows and MacOS) or "forkserver" (some Unix) are the default start methods, the sub-process imports your script. Since print('header') is at global scope, it will run the first time a script is imported into a process, so the output is:
header
header
header
header
func 1
func 2
func 3
A multiprocessing script should have everything meant to run once inside function(s), and they should be called once by the main script via if_name__ == '__main__':, so the solution is to move it into your def main()::
from multiprocessing import Pool
def func(n):
print(f'func {n}')
def main():
print('header')
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
Output:
header
func 1
func 2
func 3
If you want the top level code before the definition of che to only be executed in the master process, then place it in a function and call that function in main.
In multiprocessing, the top level statements will be interpreted/executed by both the master process and every child process. So, if some code should be executed only by the master and not by the children, then such code should not placed that at the top-level. Instead, such code should be placed in functions and these functions should be invoked in the main scope, i.e., in the scope of if block controlled by __main__ (or called in the main function in your code snippet).

Possible to create Python multiprocessing child/worker processes that do not import the __main__ module?

The standard behavior of multiprocessing on Windows is to import the __main__ module into child processes when spawned.
For large projects with many imports, this can significantly slow down the child process startup, not to mention the extra resources consumed. It seems very inefficient for cases where the child process will run a self-contained task that only uses a small subset of those imports.
Is there a way to explicitly specify the imports for the child processes? If not the multiprocessing library, is there an alternative?
While I'm specifically interested in Python 3, answers for Python 2 may be useful for others.
Edit
I've confirmed that the approach suggested by Lie Ryan works, as shown by the following example:
import sys
import types
def imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
yield val.__name__
def worker():
print('Worker modules:')
print('\n'.join(imports()))
if __name__ == '__main__':
import multiprocessing
print('Main modules:')
print('\n'.join(imports()))
print()
p = multiprocessing.Process(target=worker)
p.start()
p.join()
Output:
Main modules:
builtins
sys
types
multiprocessing
Worker modules:
sys
types
However, I don't think I can sell the rest of my team on wrapping the top-level script in if __name__ == '__main__' just to enable a small feature deep in the codebase. Still holding out hope that there's a way to do this without top-level changes.
The docs you linked tells you:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
...
Instead one should protect the “entry point” of the program by using if __name__ == '__main__': as follows:
...
You can also put import statements inside the if-block, then those import statements will only get executed when you run the __main__.py as a program but not when __main__.py is being imported.
<flame>Either that or switch to use a real OS that supports real fork()-ing</flame>

Python variables not defined after if __name__ == '__main__'

I'm trying to divvy up the task of looking up historical stock price data for a list of symbols by using Pool from the multiprocessing library.
This works great until I try to use the data I get back. I have my hist_price function defined and it outputs to a list-of-dicts pcl. I can print(pcl) and it has been flawless, but if I try to print(pcl) after the if __name__=='__main__': block, it blows up saying pcl is undefined. I've tried declaring global pcl in a couple places but it doesn't make a difference.
from multiprocessing import Pool
syms = ['List', 'of', 'symbols']
def hist_price(sym):
#... lots of code looking up data, calculations, building dicts...
stlh = {"Sym": sym, "10D Max": pcmax, "10D Min": pcmin} #simplified
return stlh
#global pcl
if __name__ == '__main__':
pool = Pool(4)
#global pcl
pcl = pool.map(hist_price, syms)
print(pcl) #this works
pool.close()
pool.join()
print(pcl) #says pcl is undefined
#...rest of my code, dependent on pcl...
I've also tried removing the if __name__=='__main__': block but it gives me a RunTimeError telling me specifically to put it back. Is there some other way to call variables to use outside of the if block?
I think there are two parts to your issue. The first is "what's wrong with pcl in the current code?", and the second is "why do I need the if __name__ == "__main__" guard block at all?".
Lets address them in order. The problem with the pcl variable is that it is only defined in the if block, so if the module gets loaded without being run as a script (which is what sets __name__ == "__main__"), it will not be defined when the later code runs.
To fix this, you can change how your code is structured. The simplest fix would be to guard the other bits of the code that use pcl within an if __name__ == "__main__" block too (e.g. indent them all under the current block, perhaps). An alternative fix would be to put the code that uses pcl into functions (which can be declared outside the guard block), then call the functions from within an if __name__ == "__main__" block. That would look something like this:
def do_stuff_with_pcl(pcl):
print(pcl)
if __name__ == "__main__":
# multiprocessing code, etc
pcl = ...
do_stuff_with_pcl(pcl)
As for why the issue came up in the first place, the ultimate cause is using the multiprocessing module on Windows. You can read about the issue in the documentation.
When multiprocessing creates a new process for its Pool, it needs to initialize that process with a copy of the current module's state. Because Windows doesn't have fork (which copies the parent process's memory into a child process automatically), Python needs to set everything up from scratch. In each child process, it loads the module from its file, and if you the module's top-level code tries to create a new Pool, you'd have a recursive situation where each of the child process would start spawning a whole new set of child processes of its own.
The multiprocessing code has some guards against that, I think (so you won't fork bomb yourself out of simple carelessness), but you still need to do some of the work yourself too, by using if __name__ == "__main__" to guard any code that shouldn't be run in the child processes.

Python - problems with shared variable

I'd like to use a shared queue from multiple threads and modules. I have the following Python code:
# moda.py
import queue
import modb
q = queue.Queue()
def myPut(x):
q.put(x)
def main():
print('moda:', str(id(q)))
modb.go()
q.get()
if __name__ == '__main__':
main()
and
# modb.py
import moda
import threading
def something():
print('modb:', str(id(moda.q)))
moda.myPut('hi')
def go():
threading.Thread(target = something).start()
something gets called on thread 1, somethingElse gets called on thread 2. The addresses of q are different in these two methods - which is why the call to get never returns. How can I avoid this? Is it because of the cyclic import or because of multithreading?
The link posted by Austin Phillips in the comments has the answer:
Finally, the executing script runs in a module named __main__,
importing the script under its own name will create a new module
unrelated to __main__.
So, __main__.q and moda.q (as imported into modb) are two different objects.
One way to make it work is to create a separate main module like this and run it instead of moda:
# modmain.py
import moda
if __name__ == '__main__':
moda.main()
However, you should still consider putting q and other shared stuff into a new module that you import into both moda and modb to avoid some other pitfalls.

importing target functions | multiprocessing

I want to learn multiprocessing in python. I started reading http://www.doughellmann.com/PyMOTW/multiprocessing/basics.html and I am not able to understand the section on importing target functions.
In particular what does the following sentence mean..
"Wrapping the main part of the application in a check for __main__ ensures that it is not run recursively in each child as the module is imported."
Can someone explain this in more detail with an example ?
http://effbot.org/pyfaq/tutor-what-is-if-name-main-for.htm
http://docs.python.org/tutorial/modules.html#executing-modules-as-scripts
What does if __name__ == "__main__": do?
http://en.wikipedia.org/wiki/Main_function#Python
On Windows, the multiprocessing module imports the __main__ module when spawning a new process. If the code that spawns the new process is not wrapped in a if __name__ == '__main__' block, then importing the main module will again spawn a new process. And so on, ad infinitum.
This issue is also mentioned in the multiprocessing docs in the section entitled "Safe importing of main module". There, you'll find the following simple example:
Running this on Windows:
from multiprocessing import Process
def foo():
print 'hello'
p = Process(target=foo)
p.start()
results in a RuntimeError.
And the fix is to use:
if __name__ == '__main__':
p = Process(target=foo)
p.start()
"""This is my module (mymodule.py)"""
def sum(a,b):
""">>> sum(1,1)
2
>>> sum(1,-1)
0
"""
return a+b
# if you run this module using 'python mymodule.py', run a self test
# if you just import this module, you get foo() and other definitions,
# but the self-test isn't run
if __name__=='__main__':
import doctest
doctest.testmod()
Ensures that the script being run is in the 'top level environment' for interactivity.
For example if you wanted to interact with the user (launching process) you would want to ensure that it is main.
if __name__ == '__main__':
do_something()

Categories