What happens to imports when a new process is spawned?

What happens to imports when a new process is spawned? - python

What happens to imported modules variables when a new process is spawned?
IE
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
where:
def foo(str):
x = someOtherModule.fooBar()
where foobar is accessing things declared at the start of someOtherModule:
someOtherModule.py:
myHat='green'
def fooBar():
return myHat
Specifically, I have a module (called Y) that has a py4j gateway initialized at the top, outside of any function. In module X I'm loading several files at once, and the function that sorts through the data after loading uses a function in Y which in turn uses the gateway.
Is this design pythonic?
Should I be importing my Y module after each new process is spawned? OR is there a better way to do this?

On Linux, fork will be used to spawn the child, so anything in the global scope of the parent will also be available in the child, with copy-on-write semantics.
On Windows, anything you import at the module-level in the __main__ module of the parent process will get re-imported in the child.
This means that if you have a parent module (let's call it someModule) like this:
import someOtherModule
import concurrent.futures
def foo(str):
x = someOtherModule.fooBar()
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
# stuff
And someOtherModule looks like this:
myHat='green'
def fooBar():
return myHat
In this example, someModule is the __main__ module of the script. So, on Linux, the myHat instance you get in the child will be a copy-on-write version of the one in someModule. On Windows, each child process will re-import someModule as soon as they load, which will result in someOtherModule being re-imported as well.
I don't know enough about py4j Gateway objects to tell if you for sure if this is the behavior you want or not. If the Gateway object is pickleable, you could explicitly pass it to each child instead, but you'd have to use a multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor:
import someOtherModule
import multiprocessing
def foo(str):
x = someOtherModule.fooBar()
def init(hat):
someOtherModule.myHat = hat
if __name__ == "__main__":
hat = someOtherModule.myHat
pool = multiprocessing.Pool(settings.MAX_PROCESSES,
initializer=init, initargs=(hat,))
for stuff in pool.map(foo, paths):
# stuff
It doesn't seem like you have a need to do this for you use-case, though. You're probably fine using the re-import.

When you create a new process, a fork() is called, which clones the entire process and stack, memory space etc. This is why multi-processing is considered more expensive than multi-threading since the copying is expensive.
So to answer your question, all "imported module variables" are cloned. You can modify them as you wish, but your original parent process won't see this change.
EDIT:
This for Unix based systems only. See Dano's answer for Unix+Windows answer.

Related

Transferring modules between two processes with python multiprocessing

So i have a problem. I'm trying to make my imports faster, so i started using multiprocessing module to split a group of imports into two functions, and then run each on separate core, thus speeding the imports up. But now the code will not recognize the modules at all. What am I doing wrong ?
import multiprocessing
def core1():
import wikipedia
import subprocess
import random
return wikipedia, subprocess, random
def core2():
from urllib import request
import json
import webbrowser
return request, json, webbrowser
if __name__ == "__main__":
start_core_1 = multiprocessing.Process(name='worker 1', target=core1, args = core2())
start_core_2 = multiprocessing.Process(name='worker 2', target=core2, args = core1())
start_core_1.start()
start_core_2.start()
while True:
user = input('[!] ')
with request.urlopen('https://api.wit.ai/message?v=20160511&q=%s&access_token=Z55PIVTSSFOETKSBPWMNPE6YL6HVK4YP' % request.quote(user)) as wit_api: # call to wit.ai api
wit_api_html = wit_api.read()
wit_api_html = wit_api_html.decode()
wit_api_data = json.loads(wit_api_html)
intent = wit_api_data['entities']['Intent'][0]['value']
term = wit_api_data['entities']['search_term'][0]['value']
if intent == 'info_on':
with request.urlopen('https://kgsearch.googleapis.com/v1/entities:search?query=%s&key=AIzaSyCvgNV4G7mbnu01xai0f0k9NL2ito8vY6s&limit=1&indent=True' % term.replace(' ', '%20')) as response:
google_knowledge_base_html = response.read()
google_knowledge_base_html = google_knowledge_base_html.decode()
google_knowledge_base_data = json.loads(google_knowledge_base_html)
print(google_knowledge_base_data['itemListElement'][0]['result']['detailedDescription']['articleBody'])
else:
print('Something')

I think you are missing the important parts of the whole picture i.e. crucial parts of what you need to know about multiprocessing when using it.
Here are some crucial parts that you have to know and then you will understand why you can't just import modules in child process and speed up the thing. Even returning loaded modules is not a perfect answer too.
First, when you use multiprocess.Process a child process is forked (on Linux) or spawned (on Windows). I'll assume you are using Linux. In that case, every child process inherits every loaded module from parent (global state). When child process changes anything, like global variables or imports new modules, those stay just in its context. So, parent process is not aware of it. I believe part of this can also be of interest.
Second, module can be a set of classes, external lib bindings, functions, etc. and some of them quite probably can't be pickled, at least with pickle. Here is the list of what can be pickled in Python 2.7 and in Python 3.X. There are even libraries that give you 'more pickling power' like dill. However, I'm not sure pickling whole modules is a good idea at all, not to mention that you have slow imports and yet you want to serialize them and send them to parent process. Even if you manage to do it, it doesn't sound like a best approach.
Some of the ideas on how to change the perspective:
Try to revise which module you need and why? Maybe you can use other modules that can give you similar functionalities. Maybe these modules are overweighing and bringing too much with them and cost is great in comparing to what you get.
If you have slow loading of modules, try to make a script that will always be running, so you do not have to run it multiple times.
If you really need those modules maybe you can separate their using in two processes and then each process does it's own thing. Example would be, one process parses page, other process processes and so on. That way you sped up the loading but you have to deal with passing messages between processes.

Workaround for using name=='main' in Python multiprocessing

As we all know we need to protect the main() when running code with multiprocessing in Python using if __name__ == '__main__'.
I understand that this is necessary in some cases to give access to functions defined in the main but I do not understand why this is necessary in this case:
file2.py
import numpy as np
from multiprocessing import Pool
class Something(object):
def get_image(self):
return np.random.rand(64,64)
def mp(self):
image = self.get_image()
p = Pool(2)
res1 = p.apply_async(np.sum, (image,))
res2 = p.apply_async(np.mean, (image,))
print(res1.get())
print(res2.get())
p.close()
p.join()
main.py
from file2 import Something
s = Something()
s.mp()
All of the functions or imports necessary for Something to work are part of file2.py. Why does the subprocess need to re-run the main.py?
I think the __name__ solution is not very nice as this prevents me from distribution the code of file2.py as I can't make sure they are protecting their main.
Isn't there a workaround for Windows?
How are packages solving that (as I never encountered any problem not protecting my main with any package - are they just not using multiprocessing?)
edit:
I know that this is because of the fork() not implemented in Windows. I was just asking if there is a hack to let the interpreter start at file2.py instead of main.py as I can be sure that file2.py is self-sufficient

When using the "spawn" start method, new processes are Python interpreters that are started from scratch. It's not possible for the new Python interpreters in the subprocesses to figure out what modules need to be imported, so they import the main module again, which in turn will import everything else. This means it must be possible to import the main module without any side effects.
If you are on a different platform than Windows, you can use the "fork" start method instead, and you won't have this problem.
That said, what's wrong with using if __name__ == "__main__":? It has a lot of additional benefits, e.g. documentation tools will be able to process your main module, and unit testing is easier etc, so you should use it in any case.

As others have mentioned the spawn() method on Windows will re-import the code for each instance of the interpreter. This import will execute your code again in the child process (and this will make it create it own child, and so on).
A workaround is to pull the multiprocessing script into a separate file and then use subprocess to launch it from the main script.
I pass variables into the script by pickling them in a temporary directory, and I pass the temporary directory into the subprocess with argparse.
I then pickle the results into the temporary directory, where the main script retrieves them.
Here is an example file_hasher() function that I wrote:
main_program.py
import os, pickle, shutil, subprocess, sys, tempfile
def file_hasher(filenames):
try:
subprocess_directory = tempfile.mkdtemp()
input_arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(input_arguments_file, 'wb') as func_inputs:
pickle.dump(filenames, func_inputs)
current_path = os.path.dirname(os.path.realpath(__file__))
file_hasher = os.path.join(current_path, 'file_hasher.py')
python_interpreter = sys.executable
proc = subprocess.call([python_interpreter, file_hasher, subprocess_directory],
timeout=60,
)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'rb') as func_outputs:
hashlist = pickle.load(func_outputs)
finally:
shutil.rmtree(subprocess_directory)
return hashlist
file_hasher.py
#! /usr/bin/env python
import argparse, hashlib, os, pickle
from multiprocessing import Pool
def file_hasher(input_file):
with open(input_file, 'rb') as f:
data = f.read()
md5_hash = hashlib.md5(data)
hashval = md5_hash.hexdigest()
return hashval
if __name__=='__main__':
argument_parser = argparse.ArgumentParser()
argument_parser.add_argument('subprocess_directory', type=str)
subprocess_directory = argument_parser.parse_args().subprocess_directory
arguments_file = os.path.join(subprocess_directory, 'input_arguments.dat')
with open(arguments_file, 'rb') as func_inputs:
filenames = pickle.load(func_inputs)
hashlist = []
p = Pool()
for r in p.imap(file_hasher, filenames):
hashlist.append(r)
output_file = os.path.join(subprocess_directory, 'function_outputs.dat')
with open(output_file, 'wb') as func_outputs:
pickle.dump(hashlist, func_outputs)
There must be a better way...

The main module is imported (but with __name__ != '__main__' because Windows is trying to simulate a forking-like behavior on a system that doesn't have forking). multiprocessing has no way to know that you didn't do anything important in you main module, so the import is done "just in case" to create an environment similar to the one in your main process. If it didn't do this, all sorts of stuff that happens by side-effect in main (e.g. imports, configuration calls with persistent side-effects, etc.) might not be properly performed in the child processes.
As such, if they're not protecting their __main__, the code is not multiprocessing safe (nor is it unittest safe, import safe, etc.). The if __name__ == '__main__': protective wrapper should be part of all correct main modules. Go ahead and distribute it, with a note about requiring multiprocessing-safe main module protection.

the if __name__ == '__main__' is needed on windows since windows doesnt have a "fork" option for processes.
In linux, for example, you can fork the process, so the parent process will be copied and the copy will become the child process (and it will have access to the already imported code you had loaded in the parent process)
Since you cant fork in windows, python simply imports all the code that was imported by the parent process, in the child process. This creates a similar effect, but if you dont do the __name__ trick, this import will execute your code again in the child process (and this will make it create it own child, and so on).
so even in your example main.py will be imported again (since all the files are imported again). python cant guess what specific python script the child process should import.
FYI there are other limitations you should be aware of like using globals, you can read about it here https://docs.python.org/2/library/multiprocessing.html#windows

NameError on global variables when multiprocessing, only in subdirectory

I have a main process which uses execfile and runs a script in a child process. This works fine unless the script is in another directory -- then everything breaks down.
This is in mainprocess.py:
from multiprocessing import Process
m = "subdir\\test.py"
if __name__ == '__main__':
p = Process(target = execfile, args = (m,))
p.start()
Then in a subdirectory aptly named subdir, I have test.py
import time
def foo():
print time.time()
foo()
When I run mainprocess.py, I get the error:
NameError: global name 'time' is not defined
but the issue isn't limited to module names -- sometimes I'll get an error on a function name on other pieces of code.
I've tried importing time in mainprocess.py and also inside the if statement there, but neither has any effect.
One way of avoiding the error (I haven't tried this), is to copy test.py into the parent directory and insert a line in the file to os.chdir back to the original directory. However, this seems rather sloppy.
So what is happening?

The solution is to change your Process initialization:
p = Process(target=execfile, args=(m, {}))
Honestly, I'm not entirely sure why this works. I know it has something to do with which dictionary (locals vs. globals) that the time import is added to. It seems like when your import is made in test.py, it's treated like a local variable, because the following works:
import time # no foo() anymore
print(time.time()) # the call to time.time() is in the same scope as the import
However, the following also works:
import time
def foo():
global time
print(time.time())
foo()
This second example shows me that the import is still assigned to some kind of global namespace, I just don't know how or why.
If you call execfile() normally, rather than in a subprocess, everything runs fine, and in fact, you can then use the time module any place after the call to execfile() call in your main process because time has been brought into the same namespace. I think that since you're launching it in a subprocess there is no module-level namespace for the import to be assigned to (execfile doesn't create a module object when called). I think that when we add the empty dictionary to the call to execfile, we're adding supplying the global dictionary argument, thus giving the import mechanism a global namespace to assign the name time to.
Some links for background:
1) Tutorial page on namespaces and scope
- look here for builtin, global, and local namespace explanations first
2) Python docs on execfile command
3) A very similar question on a non-SO site

python: functions sometimes maintain a reference to their module

If I execfile a module, and remove all (of my) reference to that module, it's functions continue to work as expected. That's normal.
However, if that execfile'd module imports other modules, and I remove all references to those modules, the functions defined in those modules start to see all their global values as None. This causes things to fail spectacularly, of course, and in a very supprising manner (TypeError NoneType on string constants, for example).
I'm surprised that the interpreter makes a special case here; execfile doesn't seem special enough to cause functions to behave differently wrt module references.
My question: Is there any clean way to make the execfile-function behavior recursive (or global for a limited context) with respect to modules imported by an execfile'd module?
To the curious:
The application is reliable configuration reloading under buildbot. The buildbot configuration is executable python, for better or for worse. If the executable configuration is a single file, things work fairly well. If that configuration is split into modules, any imports from the top-level file get stuck to the original version, due to the semantics of __import__ and sys.modules. My strategy is to hold the contents of sys.modules constant before and after configuration, so that each reconfig looks like an initial configuration. This almost works except for the above function-global reference issue.
Here's a repeatable demo of the issue:
import gc
import sys
from textwrap import dedent
class DisableModuleCache(object):
"""Defines a context in which the contents of sys.modules is held constant.
i.e. Any new entries in the module cache (sys.modules) are cleared when exiting this context.
"""
modules_before = None
def __enter__(self):
self.modules_before = sys.modules.keys()
def __exit__(self, *args):
for module in sys.modules.keys():
if module not in self.modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def reload_config(filename):
"""Reload configuration from a file"""
with DisableModuleCache():
namespace = {}
exec open(filename) in namespace
config = namespace['config']
del namespace
config()
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
# if I exec that file itself, its functions maintain a reference to its modules,
# keeping GLOBAL's refcount above zero
reload_config('config_module.py')
## output:
#config! (old implementation)
#GLOBAL
# If that file is once-removed from the exec, the functions no longer maintain a reference to their module.
# The GLOBAL's refcount goes to zero, and we get a None value (feels like weakref behavior?).
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
reload_config('main.py')
## output:
#config! (old implementation)
#None
## *desired* output:
#config! (old implementation)
#GLOBAL
acceptance_test()
def acceptance_test():
# Have to wait at least one second between edits (on ext3),
# or else we import the old version from the .pyc file.
from time import sleep
sleep(1)
open('config_module.py', 'w').write(dedent('''
GLOBAL2 = 'GLOBAL2'
def config():
print 'config2! (new implementation)'
print GLOBAL2
## There should be no such thing as GLOBAL. Naive reload() gets this wrong.
try:
print GLOBAL
except NameError:
print 'got the expected NameError :)'
else:
raise AssertionError('expected a NameError!')
'''))
reload_config('main.py')
## output:
#config2! (new implementation)
#None
#got the expected NameError :)
## *desired* output:
#config2! (new implementation)
#GLOBAL2
#got the expected NameError :)
if __name__ == '__main__':
main()

I don't think you need the 'acceptance_test' part of things here. The issue isn't actually weakrefs, it's modules' behavior on destruction. They clear out their __dict__ on delete. I vaguely remember that this is done to break ref cycles. I suspect that global references in function closures do something fancy to avoid a hash lookup on every invocation, which is why you get None and not a NameError.
Here's a much shorter sscce:
import gc
import sys
import contextlib
from textwrap import dedent
#contextlib.contextmanager
def held_modules():
modules_before = sys.modules.keys()
yield
for module in sys.modules.keys():
if module not in modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
with held_modules():
namespace = {}
exec open('main.py') in namespace
config = namespace['config']
config()
if __name__ == '__main__':
main()
Or, to put it another way, don't delete modules and expect their contents to continue functioning.

You should consider importing the configuration instead of execing it.
I use import for a similar purpose, and it works great. (specifically, importlib.import_module(mod)). Though, my configs consists mainly of primitives, not real functions.
Like you, I also have a "guard" context to restore the original contents of sys.modules after the import. Plus, I use sys.dont_write_bytecode = True (of course, you can add that to your DisableModuleCache -- set to True in __enter__ and to False in __exit__). This would ensure the config actually "runs" each time you import it.
The main difference between the two approaches, (other than the fact you don't have to rely on the state the interpreter stays in after execing (which I consider semi-unclean)), is that the config files are identified by their module-name/path (as used for importing) rather than the file name.
EDIT: A link to the implementation of this approach, as part of the Figura package.

Python 2.7 possible bug, imported modules "disappears"

I've found something very strange. See this short code below.
import os
class Logger(object):
def __init__(self):
self.pid = os.getpid()
print "os: %s." %os
def __del__(self):
print "os: %s." %os
def temp_test_path():
return "./[%d].log" %(os.getpid())
logger = Logger()
This is intended for illustrative purposes. It just prints the imported module os, on the construstion and destruction of a class (never mind the name Logger). However, when I run this, the module os seems to "disappear" to None in the class destructor. The following is the output.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: None.
Where is said os: None. is my problem. It should be identical to the first output line. However, look back at the python code above, at the function temp_test_path(). If I alter the name of this function slightly, to say temp_test_pat(), and keep all of the rest of the code exactly the same, and run it, I get the expected output (below).
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
I can't find any explanation for this except that it's a bug. Can you? By the way I'm using Windows 7 64 bit.

If you are relying on the interpreter shutdown to call your __del__ it could very well be that the os module has already been deleted before your __del__ gets called. Try explicitly doing a del logger in your code and sleep for a bit. This should show it clearly that the code functions as you expect.
I also want to link you to this note in the official documentation that __del__ is not guaranteed to be called in the CPython implementation.

I've reproduced this. Interesting behavior for sure. One thing that you need to realize is that __del__ isn't guaranteed to even be called when the interpreter exits -- Also there is no specified order for finalizing objects at interpreter exit.
Since you're exiting the interpreter, there is no guarantee that os hasn't been deleted first. In this case, it seems that os is in fact being finalized before your Logger object. These things probably happen depending on the order in the globals dictionary.
If we just print the keys of the globals dictionary right before we exit:
for k in globals().keys():
print k
you'll see:
temp_test_path
__builtins__
__file__
__package__
__name__
Logger
os
__doc__
logger
or:
logger
__builtins__
__file__
__package__
temp_test_pat
__name__
Logger
os
__doc__
Notice where your logger sits, particularly compared to where os sits in the list. With temp_test_pat, logger actually gets finalized First, so os is still bound to something meaningful. However, it gets finalize Last in the case where you use temp_test_path.
If you plan on having an object live until the interpreter is exiting, and you have some cleanup code that you want to run, you could always register a function to be run using atexit.register.

Others have given you the answer, it is undefined the order in which global variables (such as os, Logger and logger) are deleted from the module's namespace during shutdown.
However, if you want a workaround, just import os into the finaliser's local namespace:
def __del__(self):
import os
print "os: %s." %os
The os module will still be around at this point, it's just that you've lost your global reference to it.

This is to be expected. From the The Python Language Reference:
Also, when del() is invoked in response to a module being deleted
(e.g., when execution of the program is done), other globals
referenced by the del() method may already have been deleted or in
the process of being torn down (e.g. the import machinery shutting
down).
in big red warning box :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.