I'm trying to divvy up the task of looking up historical stock price data for a list of symbols by using Pool from the multiprocessing library.
This works great until I try to use the data I get back. I have my hist_price function defined and it outputs to a list-of-dicts pcl. I can print(pcl) and it has been flawless, but if I try to print(pcl) after the if __name__=='__main__': block, it blows up saying pcl is undefined. I've tried declaring global pcl in a couple places but it doesn't make a difference.
from multiprocessing import Pool
syms = ['List', 'of', 'symbols']
def hist_price(sym):
#... lots of code looking up data, calculations, building dicts...
stlh = {"Sym": sym, "10D Max": pcmax, "10D Min": pcmin} #simplified
return stlh
#global pcl
if __name__ == '__main__':
pool = Pool(4)
#global pcl
pcl = pool.map(hist_price, syms)
print(pcl) #this works
pool.close()
pool.join()
print(pcl) #says pcl is undefined
#...rest of my code, dependent on pcl...
I've also tried removing the if __name__=='__main__': block but it gives me a RunTimeError telling me specifically to put it back. Is there some other way to call variables to use outside of the if block?
I think there are two parts to your issue. The first is "what's wrong with pcl in the current code?", and the second is "why do I need the if __name__ == "__main__" guard block at all?".
Lets address them in order. The problem with the pcl variable is that it is only defined in the if block, so if the module gets loaded without being run as a script (which is what sets __name__ == "__main__"), it will not be defined when the later code runs.
To fix this, you can change how your code is structured. The simplest fix would be to guard the other bits of the code that use pcl within an if __name__ == "__main__" block too (e.g. indent them all under the current block, perhaps). An alternative fix would be to put the code that uses pcl into functions (which can be declared outside the guard block), then call the functions from within an if __name__ == "__main__" block. That would look something like this:
def do_stuff_with_pcl(pcl):
print(pcl)
if __name__ == "__main__":
# multiprocessing code, etc
pcl = ...
do_stuff_with_pcl(pcl)
As for why the issue came up in the first place, the ultimate cause is using the multiprocessing module on Windows. You can read about the issue in the documentation.
When multiprocessing creates a new process for its Pool, it needs to initialize that process with a copy of the current module's state. Because Windows doesn't have fork (which copies the parent process's memory into a child process automatically), Python needs to set everything up from scratch. In each child process, it loads the module from its file, and if you the module's top-level code tries to create a new Pool, you'd have a recursive situation where each of the child process would start spawning a whole new set of child processes of its own.
The multiprocessing code has some guards against that, I think (so you won't fork bomb yourself out of simple carelessness), but you still need to do some of the work yourself too, by using if __name__ == "__main__" to guard any code that shouldn't be run in the child processes.
Related
I am learning Python and its multiprocessing.
I created a project with a mian() in main.py and a a_simulation inside the module simulation.py under the package simulator/.
The symptom is that a test statement print("hello\n") inside main.py before the definition of mian() is executed multiple times when the program is run with python main.py, indicating things before the print, including the creations of the lists are all executed multiple times.
I do not think I understand the related issues of python very well. May I know what is reason for the symptom and what is the best practice in python when creating projects like this? I have included the codes and the terminal prints. Thank you!
Edit: Forgot to mention that I am running it with anaconda python on macOS, although I would wish my project will work just fine on any platforms.
mian.py:
from multiprocessing import Pool
from simulator.simulation import a_simulation
import random
num_trials = 10
iter_trials = list(range(num_trials))
arg_list = [random.random() for _ in range(num_trials)]
input = list(zip(iter_trials, arg_list))
print("hello\n")
def main():
with Pool(processes=4) as pool:
result = pool.starmap(a_simulation, input)
print(result)
if __name__ == "__main__":
main()
simulatior/simulation.py:
import os
from time import sleep
def a_simulation(x, seed_):
print(f"Process {os.getpid()}: trial {x} received {seed_}\n" )
sleep(1)
return seed_
Results from the terminal:
hello
hello
hello
hello
hello
Process 71539: trial 0 received 0.4512600158461971
Process 71538: trial 1 received 0.8772526554425158
Process 71541: trial 2 received 0.6893833978242683
Process 71540: trial 3 received 0.29249994820563296
Process 71538: trial 4 received 0.5759647958461107
Process 71541: trial 5 received 0.08799525261308505
Process 71539: trial 6 received 0.3057644321667139
Process 71540: trial 7 received 0.5402091856171599
Process 71538: trial 8 received 0.1373456223147438
Process 71541: trial 9 received 0.24000943476017
[0.4512600158461971, 0.8772526554425158, 0.6893833978242683, 0.29249994820563296, 0.5759647958461107, 0.08799525261308505, 0.3057644321667139, 0.5402091856171599, 0.1373456223147438, 0.24000943476017]
(base)
The reason why this happens is because multiprocessing uses start method spawn, by default, on Windows and macOS to start new processes. What this means is that whenever you want to start a new process, the child process is initially created without sharing any of the memory of the parent. However, this makes things messy when you want to start a function in the child process from the parent because not only will the child not know the definition of the function itself, you might also run into some unexpected obstacles (what if the function depends on a variable defined in the parent processes' module?). To stop these sorts of things from happening, multiprocessing automatically imports the parent processes' module from the child process, which essentially copies almost the entire state of the parent when the child process was started.
This is where the if __name__ == "__main__" comes in. This statement basically translates to if the current file is being run directly then..., the code under this block will not run if the module is being imported. Therefore, the child process will not run anything under this block when they are spawned. You can hence use this block to create, for example, variables which use up a lot of memory and are not required for the child processes to function but are used by the parent. Basically, anything that the child processes won't need, throw it under here.
Now coming to your comment about imports:
This must be a silly questions, but should I leave the import statements as they are, or move them inside if name == "main":, or somewhere else? Thanks
Like I said, anything that the child doesn't need can be put under this if block. The reason you don't often see imports under this block is perhaps due to sticking to convention ("imports should be done at the top") and because the modules being imported don't really affect performance much (even after being needlessly imported multiple times). Keep in mind however, that if a child process requires a particular module to start its work, it will always be imported again within the child process, even if you have imported it under the if __name__... block. This is because when you attempt to spawn child processes to start a function in parallel, multiprocessing automatically serializes and sends the names of the function, and the module that defines the function (actual code is not serialized, only the names), to the child processes where they are imported once more (relevant question).
This is only specific to when the start method is spawn, you can read more about the differences here
I built a simple multiprocessing program in Python. It worked perfectly in Python3.7, but since the upgrade to 3.9 I'm struggling to understand how to make it work now multiprocessing has changed.
The program follows roughly the following pattern:
import multiprocessing
def print_multiprocessing(my_string):
print(my_prefix, my_string)
if __name__ == "__main__":
output = []
for x in range(10):
output.append(x)
my_prefix = input()
pool = multiprocessing.Pool(4)
pool.map_async(print_multiprocessing, output, )
pool.close()
pool.join()
The actual program is more complicated obviously, but this will demonstrate my issue.
In Python3.7, the child processes would automatically inherit the my_prefix variable from the parent, but now in 3.9 that variable isn't available for the child process.
I could declare the variable outside the if __name__ == "__main__": which would mean it gets declared by each child process, but this means I'd have to call the input() function for every child process.
My current workaround is to have the print_multiprocessing function accept a list as an argument, and pass in every value I need as part of that list, but it feels very messy, especially when dealing with multiple data types.
Is there a simple trick I'm missing here?
I'm trying to build a set of lambda calls which work in parallel to do some processing of data. These methods return their bits of data back to the parent process, which then combines it all.
I need to be able to run this locally as well for a testing script which runs several test cases and scores the accuracy of the processing. To do this, I mock the lambda calls and import the lambda as a module and execute the handler directly.
So my top level lambda, uses multiprocessing.Process to call methods which invoke the other lambdas like so:
# Top Level Lambda
l = client('lambda')
def process_data(data, conn):
response = l.Invoke(
FunctionName = os.environ.get('PROCESS_FUNCTION'),
InvocationType = 'RequestResponse',
LogType = 'Tail',
Payload = json.dump({
'data': data
})
)
conn.send(response.Payload.read())
conn.close()
def create(datas)
p_parent, p_child = mp.Pipe()
process = mp.Process(target=process_data, args=(datas[0], p_child))
process.start()
I've cut out a lot of code to give the gist here. I get an error on process.start()
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
I've tried putting in freeze_support calls in each lambda in a __name__ == '__main__' block, I've tried putting it in the lambda handler, and I have tried putting it in the test script, but I always get the same error.
What really throws me here is that the new process doesn't call the target function at all, but instead runs the test script from the beginning and the 2nd call to invoke a new process in that subprocess is what is giving me the error
The standard behavior of multiprocessing on Windows is to import the __main__ module into child processes when spawned.
For large projects with many imports, this can significantly slow down the child process startup, not to mention the extra resources consumed. It seems very inefficient for cases where the child process will run a self-contained task that only uses a small subset of those imports.
Is there a way to explicitly specify the imports for the child processes? If not the multiprocessing library, is there an alternative?
While I'm specifically interested in Python 3, answers for Python 2 may be useful for others.
Edit
I've confirmed that the approach suggested by Lie Ryan works, as shown by the following example:
import sys
import types
def imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
yield val.__name__
def worker():
print('Worker modules:')
print('\n'.join(imports()))
if __name__ == '__main__':
import multiprocessing
print('Main modules:')
print('\n'.join(imports()))
print()
p = multiprocessing.Process(target=worker)
p.start()
p.join()
Output:
Main modules:
builtins
sys
types
multiprocessing
Worker modules:
sys
types
However, I don't think I can sell the rest of my team on wrapping the top-level script in if __name__ == '__main__' just to enable a small feature deep in the codebase. Still holding out hope that there's a way to do this without top-level changes.
The docs you linked tells you:
Make sure that the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
...
Instead one should protect the “entry point” of the program by using if __name__ == '__main__': as follows:
...
You can also put import statements inside the if-block, then those import statements will only get executed when you run the __main__.py as a program but not when __main__.py is being imported.
<flame>Either that or switch to use a real OS that supports real fork()-ing</flame>
As I have discovered windows is a bit of a pig when it comes to multiprocessing and I have a questions about it.
The pydoc states you should protect the entry point of a windows application when using multiprocessing.
Does this mean only the code which creates the new process?
For example
Script 1
import multiprocessing
def somemethod():
while True:
print 'do stuff'
# this will need protecting
p = multiprocessing.Process(target=somemethod).start()
# this wont
if __name__ == '__main__':
p = multiprocessing.Process(target=somemethod).start()
In this script you need to wrap this in if main because the line in spawning the process.
But what about if you had?
Script 2
file1.py
import file2
if __name__ == '__main__':
p = Aclass().start()
file2.py
import multiprocessing
ITEM = 0
def method1():
print 'method1'
method1()
class Aclass(multiprocessing.Process):
def __init__(self):
print 'Aclass'
super(Aclass, self).__init__()
def run(self):
print 'stuff'
What would need to be protected in this instance?
What would happen if there was a if __main__ in File 2, would the code inside of this get executed if a process was being created?
NOTE: I know the code will not compile. It's just an example.
The pydoc states you should protect the entry point of a windows application when using multiprocessing.
My interpretation differs: the documentations states
the main module can be safely imported by a new Python interpreter without causing unintended side effects (such a starting a new process).
So importing your module (import mymodule) should not create new processes. That is, you can avoid starting processes by protecting your process-creating code with an
if __name__ == '__main__':
...
because the code in the ... will only run when your program is run as main program, that is, when you do
python mymodule.py
or when you run it as an executable, but not when you import the file.
So, to answer your question about the file2: no, you do not need protection because no process is started during the import file2.
Also, if you put an if __name__ == '__main__' in file2.py, it would not run because file2 is imported, not executed as main program.
edit: here is an example of what can happen when you do not protect your process-creating code: it might just loop and create a ton of processes.