IPython Parallel Computing Namespace Issues - python

I've been reading and re-reading the IPython documentation/tutorial, and I can't figure out the issue with this particular piece of code. It seems to be that the function dimensionless_run is not visible to the namespace delivered to each of the engines, but I'm confused because the function is defined in __main__, and clearly visible as part of the global namespace.
wrapper.py:
import math, os
def dimensionless_run(inputs):
output_file = open(inputs['fn'],'w')
...
return output_stats
def parallel_run(inputs):
import math, os ## Removing this line causes a NameError: global name 'math'
## is not defined.
folder = inputs['folder']
zfill_amt = int(math.floor(math.log10(inputs['num_iters'])))
for i in range(inputs['num_iters']):
run_num_str = str(i).zfill(zfill_amt)
if not os.path.exists(folder + '/'):
os.mkdir(folder)
dimensionless_run(inputs)
return
if __name__ == "__main__":
inputs = [input1,input2,...]
client = Client()
lbview = client.load_balanced_view()
lbview.block = True
for x in sorted(globals().items()):
print x
lbview.map(parallel_run,inputs)
Executing this code after ipcluster start --n=6 yields the sorted global dictionary, including the math and os modules, and the parallel_run and dimensionless_run functions. This is followed by an IPython.parallel.error.CompositeError: one or more exceptions from call to method: parallel_run, which is composed of a large number of [n:apply]: NameError: global name 'dimensionless_run' is not defined, where n runs from 0-5.
There are two things I don't understand, and they're clearly linked.
Why doesn't the code identify dimensionless_run in the global namespace?
Why is import math, os necessary inside the definition of parallel_run?
Edited: This turned out not be much of a namespace error at all--I was executing ipcluster start --n=6 in a directory that didn't contain the code. To fix it, all I needed to do was execute the start command in my code's directory. I also fixed it by adding the lines:
inputs = input_pairs
os.system("ipcluster start -n 6") #NEW
client = Client()
...
lbview.map(parallel_run,inputs)
os.system("ipcluster stop") #NEW
which start the required cluster in the right place.

This is mostly a duplicate of Python name space issues with IPython.parallel, which has a more detailed answer, but the gist:
When the Client sends parallel_run to the engine, it just sends that function, not the entire namespace in which the function is defined (the __main__ module). So when running the remote parallel_run, lookups to math or os or dimensionless_run will look first in locals() (what has been defined already in the function, i.e. your in-function imports), then in the globals(), which is the __main__ module on the engine.
There are various approaches to making sure names available on the engines, but perhaps the simplest is to explicitly define/send them to the engines (the interactive namespace is __main__ on the engines, just like it is locally in IPython):
client[:].execute("import os, math")
client[:]['dimensionless_run'] = dimensionless_run
prior to making your run, in which case everything should work as you expect.
This is an issue unique to modules defined interactively / in a script - It does not come up if this file is a module instead of a script, e.g.
from mymod import parallel_run
lbview.map(parallel_run, inputs)
In which case the globals() is the module globals, which are generally the same everywhere.

Related

python global operator emulation

in Lutz's book I read how to emulate global operator in a function body.
I created p.py file in documents folder:
var = 0
def func():
import p #import itself
p.var = 15
func()
print(var)
output:
15
0
I thought is is supposed to simply print 15, but by some reason it also added 0 to output. So I'm wandering why has it happened.
for example, when i do the same thing in terminal, but in the main module, it works as I want:
var = 0
def func():
import __main__ #import itself
__main__.var = 15
func()
print(var)
and output is
15
I have python 3.7.7
Files are not modules: files are used to define modules. If you run p.py as a script that contains import p, there are two modules, __main__ and p, both created from the same file, but each with its own global namespace.
Okay, let's break this into few steps.
First of all:
self-importing of a module is not a good thing to do - one example why is because of the things you noticed in your question.
import p #
var = 0
def func():
p.var = 15
func()
print(var)
print(p.var)
Running this, you will notice that var and p.var are actually 2 separate variables, even though logically they are the same thing, just in a different namespace.
You've also got func and p.func, which again do the same but are 2 separate things.
Second problem:
Global keyword should be used only when absolutely necessary. When you have a global variable, it is much harder to track when it changes and control the flow of your program - Global Variables Are Bad (not everything applies to Python, but most points still stand).
Finally, why you are actually seeing the behaviour you see:
In Python when the module is imported, all of it is executed just like the main script you are running.
Let's say you have pr.py file containing only this:
print("text")
Importing such a module with import pr would be enough to see text on console, because it gets executed on import.
That's why you see 15 printed (in import p, print(var) happens - and because we are in module p, p.var and var are the same here). Then import p finishes and we get to print(var) - but because we are now in __main__ module, var is not the same as p.var, and it still has the original value of 0.
Why does import __main__ work different? It's because it is handled specially by Python, as can be read here. In short, __main__ is initialised from the start of the program, so import __main__ does not cause the code to run again, just like importing some module more only causes it to be executed once.

make function available without import globally

let's say I wanted to make a core library for a project, with functions like:
def foo(x):
"""common, useful function"""
and I want to make these functions globally available in my project so that when I call them in a file, I don't need to import them. I have a virtualenv, so I feel like I should be able to modify my interpreter to make them globally available, but wasn't sure if there was any established methodologies behind this. I am aware it defies some pythonic principles.
It is possible to create a custom "launcher" that sets up some global variables and executes the code in a python file:
from sys import argv
# we read the code of the file passed as the first CLI argument
with open(argv[1]) as fin:
code = fin.read()
# just an example: this will be available in the executed python file
def my_function():
return "World"
global_variables = {
'MY_CONSTANT': "Hello", # prepare a global variable
'my_function': my_function # prepare a global function
}
exec(code, global_variables) # run the file with new global variables
Use it like this: python launcher.py my_dsl_file.py.
Example my_dsl_file.py:
# notice: no imports at all
print(MY_CONSTANT)
print(my_function())
Interestingly Python (at least CPython) uses a different way to setup some useful functions like help. It runs a file called site.py that adds some values to the builtins module.
import builtins
def my_function():
return "World"
builtins.MY_CONSTANT = "Hello"
builtins.my_function = my_function
# run your file like above or simply import it
import <your file>
I wouldn't recommend either of these ways. A simple from <your library> import * is a much better approach.
The downside of the first two variants is that no tool will know anything about your injected globals. E.g. mypy, flake8 and all IDEs i know of will fail.

Is there a convenient way to translate a "from A import B as C" to an python import using a specific path

I want to import a python module without adding its containing folder to the python path. I would want the import look like
from A import B as C
Due to the specific path that shall be used, the import looks like
import imp
A = imp.load_source('A', 'path')
C = A.B
This is quite unhandy with long paths and module names. Is there an easier way? Is there A way, where the module is not added to the local variables (no A)?
If you just don't want A to be visible at a global level, you could stick the import (imp.load_source) inside a function. If you actually don't want a module object at all in the local scope, you can do that too, but I wouldn't recommend it.
If module A is a python source file you could read in the file (or even just the relevant portion that you want) and run an exec on it.
source.py
MY_GLOBAL_VAR = 1
def my_func():
print 'hello'
Let's say you have some code that wants my_func
path = '/path/to/source.py'
execfile(path)
my_func()
# 'hello'
Be aware that you're also going to get anything else defined in the file (like MY_GLOBAL_VAR). Again, this will work, but I wouldn't recommend it
Someone looking at your code won't be able to see where my_func came from.
You're essentially doing the same thing as a from A import * import, which is generally frowned upon in python, because , you could be importing all sorts of things into your namespace that you didn't want. And even if it works now, if the source code changes, it could import names that shadow your own global symbols.
It's potentially a security hole, since you could be exec'ing an untrusted source file.
It's way more verbose than a regular python import.

NameError on global variables when multiprocessing, only in subdirectory

I have a main process which uses execfile and runs a script in a child process. This works fine unless the script is in another directory -- then everything breaks down.
This is in mainprocess.py:
from multiprocessing import Process
m = "subdir\\test.py"
if __name__ == '__main__':
p = Process(target = execfile, args = (m,))
p.start()
Then in a subdirectory aptly named subdir, I have test.py
import time
def foo():
print time.time()
foo()
When I run mainprocess.py, I get the error:
NameError: global name 'time' is not defined
but the issue isn't limited to module names -- sometimes I'll get an error on a function name on other pieces of code.
I've tried importing time in mainprocess.py and also inside the if statement there, but neither has any effect.
One way of avoiding the error (I haven't tried this), is to copy test.py into the parent directory and insert a line in the file to os.chdir back to the original directory. However, this seems rather sloppy.
So what is happening?
The solution is to change your Process initialization:
p = Process(target=execfile, args=(m, {}))
Honestly, I'm not entirely sure why this works. I know it has something to do with which dictionary (locals vs. globals) that the time import is added to. It seems like when your import is made in test.py, it's treated like a local variable, because the following works:
import time # no foo() anymore
print(time.time()) # the call to time.time() is in the same scope as the import
However, the following also works:
import time
def foo():
global time
print(time.time())
foo()
This second example shows me that the import is still assigned to some kind of global namespace, I just don't know how or why.
If you call execfile() normally, rather than in a subprocess, everything runs fine, and in fact, you can then use the time module any place after the call to execfile() call in your main process because time has been brought into the same namespace. I think that since you're launching it in a subprocess there is no module-level namespace for the import to be assigned to (execfile doesn't create a module object when called). I think that when we add the empty dictionary to the call to execfile, we're adding supplying the global dictionary argument, thus giving the import mechanism a global namespace to assign the name time to.
Some links for background:
1) Tutorial page on namespaces and scope
- look here for builtin, global, and local namespace explanations first
2) Python docs on execfile command
3) A very similar question on a non-SO site

python: functions *sometimes* maintain a reference to their module

If I execfile a module, and remove all (of my) reference to that module, it's functions continue to work as expected. That's normal.
However, if that execfile'd module imports other modules, and I remove all references to those modules, the functions defined in those modules start to see all their global values as None. This causes things to fail spectacularly, of course, and in a very supprising manner (TypeError NoneType on string constants, for example).
I'm surprised that the interpreter makes a special case here; execfile doesn't seem special enough to cause functions to behave differently wrt module references.
My question: Is there any clean way to make the execfile-function behavior recursive (or global for a limited context) with respect to modules imported by an execfile'd module?
To the curious:
The application is reliable configuration reloading under buildbot. The buildbot configuration is executable python, for better or for worse. If the executable configuration is a single file, things work fairly well. If that configuration is split into modules, any imports from the top-level file get stuck to the original version, due to the semantics of __import__ and sys.modules. My strategy is to hold the contents of sys.modules constant before and after configuration, so that each reconfig looks like an initial configuration. This almost works except for the above function-global reference issue.
Here's a repeatable demo of the issue:
import gc
import sys
from textwrap import dedent
class DisableModuleCache(object):
"""Defines a context in which the contents of sys.modules is held constant.
i.e. Any new entries in the module cache (sys.modules) are cleared when exiting this context.
"""
modules_before = None
def __enter__(self):
self.modules_before = sys.modules.keys()
def __exit__(self, *args):
for module in sys.modules.keys():
if module not in self.modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def reload_config(filename):
"""Reload configuration from a file"""
with DisableModuleCache():
namespace = {}
exec open(filename) in namespace
config = namespace['config']
del namespace
config()
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
# if I exec that file itself, its functions maintain a reference to its modules,
# keeping GLOBAL's refcount above zero
reload_config('config_module.py')
## output:
#config! (old implementation)
#GLOBAL
# If that file is once-removed from the exec, the functions no longer maintain a reference to their module.
# The GLOBAL's refcount goes to zero, and we get a None value (feels like weakref behavior?).
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
reload_config('main.py')
## output:
#config! (old implementation)
#None
## *desired* output:
#config! (old implementation)
#GLOBAL
acceptance_test()
def acceptance_test():
# Have to wait at least one second between edits (on ext3),
# or else we import the old version from the .pyc file.
from time import sleep
sleep(1)
open('config_module.py', 'w').write(dedent('''
GLOBAL2 = 'GLOBAL2'
def config():
print 'config2! (new implementation)'
print GLOBAL2
## There should be no such thing as GLOBAL. Naive reload() gets this wrong.
try:
print GLOBAL
except NameError:
print 'got the expected NameError :)'
else:
raise AssertionError('expected a NameError!')
'''))
reload_config('main.py')
## output:
#config2! (new implementation)
#None
#got the expected NameError :)
## *desired* output:
#config2! (new implementation)
#GLOBAL2
#got the expected NameError :)
if __name__ == '__main__':
main()
I don't think you need the 'acceptance_test' part of things here. The issue isn't actually weakrefs, it's modules' behavior on destruction. They clear out their __dict__ on delete. I vaguely remember that this is done to break ref cycles. I suspect that global references in function closures do something fancy to avoid a hash lookup on every invocation, which is why you get None and not a NameError.
Here's a much shorter sscce:
import gc
import sys
import contextlib
from textwrap import dedent
#contextlib.contextmanager
def held_modules():
modules_before = sys.modules.keys()
yield
for module in sys.modules.keys():
if module not in modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
with held_modules():
namespace = {}
exec open('main.py') in namespace
config = namespace['config']
config()
if __name__ == '__main__':
main()
Or, to put it another way, don't delete modules and expect their contents to continue functioning.
You should consider importing the configuration instead of execing it.
I use import for a similar purpose, and it works great. (specifically, importlib.import_module(mod)). Though, my configs consists mainly of primitives, not real functions.
Like you, I also have a "guard" context to restore the original contents of sys.modules after the import. Plus, I use sys.dont_write_bytecode = True (of course, you can add that to your DisableModuleCache -- set to True in __enter__ and to False in __exit__). This would ensure the config actually "runs" each time you import it.
The main difference between the two approaches, (other than the fact you don't have to rely on the state the interpreter stays in after execing (which I consider semi-unclean)), is that the config files are identified by their module-name/path (as used for importing) rather than the file name.
EDIT: A link to the implementation of this approach, as part of the Figura package.

Categories