python: functions *sometimes* maintain a reference to their module - python

If I execfile a module, and remove all (of my) reference to that module, it's functions continue to work as expected. That's normal.
However, if that execfile'd module imports other modules, and I remove all references to those modules, the functions defined in those modules start to see all their global values as None. This causes things to fail spectacularly, of course, and in a very supprising manner (TypeError NoneType on string constants, for example).
I'm surprised that the interpreter makes a special case here; execfile doesn't seem special enough to cause functions to behave differently wrt module references.
My question: Is there any clean way to make the execfile-function behavior recursive (or global for a limited context) with respect to modules imported by an execfile'd module?
To the curious:
The application is reliable configuration reloading under buildbot. The buildbot configuration is executable python, for better or for worse. If the executable configuration is a single file, things work fairly well. If that configuration is split into modules, any imports from the top-level file get stuck to the original version, due to the semantics of __import__ and sys.modules. My strategy is to hold the contents of sys.modules constant before and after configuration, so that each reconfig looks like an initial configuration. This almost works except for the above function-global reference issue.
Here's a repeatable demo of the issue:
import gc
import sys
from textwrap import dedent
class DisableModuleCache(object):
"""Defines a context in which the contents of sys.modules is held constant.
i.e. Any new entries in the module cache (sys.modules) are cleared when exiting this context.
"""
modules_before = None
def __enter__(self):
self.modules_before = sys.modules.keys()
def __exit__(self, *args):
for module in sys.modules.keys():
if module not in self.modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def reload_config(filename):
"""Reload configuration from a file"""
with DisableModuleCache():
namespace = {}
exec open(filename) in namespace
config = namespace['config']
del namespace
config()
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
# if I exec that file itself, its functions maintain a reference to its modules,
# keeping GLOBAL's refcount above zero
reload_config('config_module.py')
## output:
#config! (old implementation)
#GLOBAL
# If that file is once-removed from the exec, the functions no longer maintain a reference to their module.
# The GLOBAL's refcount goes to zero, and we get a None value (feels like weakref behavior?).
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
reload_config('main.py')
## output:
#config! (old implementation)
#None
## *desired* output:
#config! (old implementation)
#GLOBAL
acceptance_test()
def acceptance_test():
# Have to wait at least one second between edits (on ext3),
# or else we import the old version from the .pyc file.
from time import sleep
sleep(1)
open('config_module.py', 'w').write(dedent('''
GLOBAL2 = 'GLOBAL2'
def config():
print 'config2! (new implementation)'
print GLOBAL2
## There should be no such thing as GLOBAL. Naive reload() gets this wrong.
try:
print GLOBAL
except NameError:
print 'got the expected NameError :)'
else:
raise AssertionError('expected a NameError!')
'''))
reload_config('main.py')
## output:
#config2! (new implementation)
#None
#got the expected NameError :)
## *desired* output:
#config2! (new implementation)
#GLOBAL2
#got the expected NameError :)
if __name__ == '__main__':
main()

I don't think you need the 'acceptance_test' part of things here. The issue isn't actually weakrefs, it's modules' behavior on destruction. They clear out their __dict__ on delete. I vaguely remember that this is done to break ref cycles. I suspect that global references in function closures do something fancy to avoid a hash lookup on every invocation, which is why you get None and not a NameError.
Here's a much shorter sscce:
import gc
import sys
import contextlib
from textwrap import dedent
#contextlib.contextmanager
def held_modules():
modules_before = sys.modules.keys()
yield
for module in sys.modules.keys():
if module not in modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
with held_modules():
namespace = {}
exec open('main.py') in namespace
config = namespace['config']
config()
if __name__ == '__main__':
main()
Or, to put it another way, don't delete modules and expect their contents to continue functioning.

You should consider importing the configuration instead of execing it.
I use import for a similar purpose, and it works great. (specifically, importlib.import_module(mod)). Though, my configs consists mainly of primitives, not real functions.
Like you, I also have a "guard" context to restore the original contents of sys.modules after the import. Plus, I use sys.dont_write_bytecode = True (of course, you can add that to your DisableModuleCache -- set to True in __enter__ and to False in __exit__). This would ensure the config actually "runs" each time you import it.
The main difference between the two approaches, (other than the fact you don't have to rely on the state the interpreter stays in after execing (which I consider semi-unclean)), is that the config files are identified by their module-name/path (as used for importing) rather than the file name.
EDIT: A link to the implementation of this approach, as part of the Figura package.

Related

How does import and reload actually work when dealing with packages

I have issues understanding some subtleties of the Python import system. I have condensed my doubts around a minimal example and a number of concrete and related questions detailed below.
I have defined a package in a folder called modules, whose content is an __init__.py and two regular modules, one with general functionality for the package and other with the definitions for the end user. The content is as simple as:
init.py
from .base import *
from .implementation import *
base.py
class FactoryClass():
registry = {}
#classmethod
def add_to_registry(cls, newclass):
cls.registry[newclass.__name__] = newclass
#classmethod
def getobject(cls, classname, *args, **kwargs):
return cls.registry[classname](*args, **kwargs)
class BaseClass():
def hello(self):
print(f"Hello from instance of class {type(self).__name__}")
implementation.py
from .base import BaseClass, FactoryClass
class First(BaseClass):
pass
class Second(BaseClass):
pass
FactoryClass.add_to_registry(First)
FactoryClass.add_to_registry(Second)
The user of the package will use it as:
import modules
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second")
a.hello()
b.hello()
This works. The problem comes because I'm developing this, and my workflow includes adding functionality in implementation.py and then continaully test it by reloading the module. But I can not understand/predict what module I have to reload to have the functions updated. I'm making changes that have no effect and it drives me crazy (until yesterday I was working on a large .py file with all code lumped together, so I had none of these problems).
Here are some test I have done, and I'd like to understand what's happening and why.
First, I start commenting out all mentions to Second class in implementation.py (to pretend it was not yet developed):
from importlib import reload
import modules
modules.base.FactoryClass is modules.FactoryClass # returns True
modules.FactoryClass.registry # just First class is in registry
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second") # raises KeyError as expected
This code and its output is pretty clear. The only thing that puzzles me is why there is a modules.base module at all (I did not import it!). Further, it is redundant as their classes point to the same objects. Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?
Now things become interesting as I comment out, i.e. I finish developing Second, and I'd like to test it without having to restart the Python session. I have tried 3 different reloads:
reload (modules)
This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.
Now I try to manually reload one of those "unexpected" modules:
reload (modules.implementation)
modules.base.FactoryClass is modules.FactoryClass # True
modules.FactoryClass.registry # First and Second
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second") # Works as expected
This seems to be the right way to go. It updates the module contents as expected and the new functionality is usable. What puzzles me is why modules.FactoryClass has been updated (its registry) despite the fact that I did not reload the modules.base module. I'd expect this function to stay "outdated".
Finally, and starting from the just freshly uncommented version, I have tried
reload (modules.base)
modules.base.FactoryClass is modules.FactoryClass # False
modules.FactoryClass.registry # just First class is in registry
modules.base.FactoryClass.registry # empty
a = modules.base.FactoryClass.getobject("First")
b = modules.base.FactoryClass.getobject("Second") # raises KeyError
This is very odd. modules.FactoryClass is outdated (Second is unknown). modules.base.Factory is empty. Why are now modules.FactoryClass and modules.base.FactoryClass different objects?
Could someone explain why the three different versions of reload a package have so different behaviour?
You are confused about how the Python import system works, so I strongly recommend you read the corresponding documentations : the import system and importlib.reload.
A foreword : code hot-reloading in Python is tricky. I recommend to not do that if it is not required. You have seen it yourself : bugs are very tricky.
Then to your questions :
Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?
As #Kemp answered as a comment (and I upvoted), imports are transitive. When you import a, Python will parse/compile/execute the corresponding library file. If the module does import b then Python will do it again for the b library file, and again and again. You don't see it, but when your program starts there is already a lot of things that have been imported.
Given this file :
print("nothing else")
When I set my debugger to pause before executing the print line, if I look into sys.modules I already have 338 different libraries imported : builtins (where print came from), sys, itertools, enum, json, ...
Understand that "no visible import statement" does not mean "nothing have been imported".
When you execute import a, Python will start by checking its sys.modules cache to determine if the library have already been read from disk, parsed, compiled and executed into a module object. If this library was not yet imported during this program, then Python will take the time to do all that. But because it is slow, Python optimize with a cache.
The result is a module object, that gets bind the current namespace, so that you can access it.
We can summerize it like that :
def import_library(name: str) -> Module:
if name not in sys.modules:
# cache miss
filepath = locate_library(name)
bytecode = compile_library(filepath)
module = execute(bytecode)
sys.modules[name] = module
# in any case, at this point, the module is available
return sys.modules[name]
You are thus confusing module objects with variables.
In any module you can declare variables with whatever name (but allowed by Python's grammar). And some of them will reference modules.
here is an example :
# file: main.py
import lib # create a variable named `lib` referencing the `lib` library
import lib as horse # create a variable named `horse` referencing the `lib` library
print(lib.a.number) # 14
print(horse.a.number) # 14
print(lib is horse) # True
print(lib.a.sublib.__name__) # lib.sublib
import lib.sublib
from lib import sublib
import lib.sublib as lib_sublib
print((lib.sublib is sublib, sublib is lib_sublib, lib.a.zebra is sublib)) # (True, True, True)
import sys
print(sys.modules["lib"] is lib) # True
print(sys.modules["lib.sublib"] is sublib) # True
print(lib.sublib.package_color) # blue
print(lib.sublib.color) # AttributeError: module 'lib.sublib' has no attribute 'color'
# file: lib/__init__.py
from . import a
# file: lib/a.py
from . import sublib
from . import sublib as zebra
number = 14
# file: lib/sublib/__init__.py
from .b import color as package_color
# file: lib/sublib/b.py
color = "blue"
Python offers a lot of flexibility about how to import things, what to expose, how to access. But I admit it is confusing.
Also take a look at the role of __all__ in __init__.py. Given that, you should now understand your question on subpackage naming/visibility.
reload (modules) This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.
Given what I explained, can you now understand what it does ? And why what it does is not what you want it to do ?
Because what you want is to get modules.implementations hot-reloaded, but you ask for modules.
>>> from importlib import reload
>>> import lib
>>> lib.sublib.package_color
'blue'
>>> # I edit the "b.py" file so that the color is "red"
>>> old_lib = lib
>>> new_lib = reload(lib)
>>> lib is new_lib, lib is old_lib
(True, True)
>>> lib.sublib.package_color
'blue'
>>> lib.sublib.b.color
'red'
>>> import sys
>>> sys.modules["lib.sublib.b"].color
'red'
First, the top-level reload did not work, because what the file only did was import sublib, which hit the cache, so nothing really gets done.
You have to reload the actual module for its content to takes effect. But it does not work magically : it will create new objects (module-level definitions) and put them into the same module object, but it can't update references that may exist on the preceding module's content. That is why we see a "blue" even after the module has been reloaded : the package_color is a reference to the first version's color variable, it does not get updated when the module is reloaded. This is dangerous : there may be different copies of similar things lying around.
why modules.FactoryClass has been updated
You are reloading modules.implementation in this case. What happens is that it reloads the whole file to populate the module object, I highlighted the perceived effects :
from .base import BaseClass, FactoryClass # relative library "base" already in the `sys.modules` cache
class First(BaseClass): # was already defined, redefined
pass
class Second(BaseClass): # was not defined yed, created
pass
FactoryClass.add_to_registry(First) # overwritting the registry for "First" with the redefinition of class `First`
FactoryClass.add_to_registry(Second) # registering for "Second" the definition of class `Second`
You can see it another way :
>>> import modules
>>> from importlib import reload
>>> First_before = modules.implementation.First
>>> reload(modules.implementation)
<module 'modules.implementation' from 'C:\\PycharmProjects\\stack_overflow\\68069397\\modules\\implementation.py'>
>>> First_after = modules.implementation.First
>>> First_before_reload is First_after_reload
False
When you are reloading a module, Python will re-execute all the code, in a way that may be different than the previous time(s). Here, each time you are doing FactoryClass.add_to_registry so the FactoryClass gets updated with the (re)definitions.
Why are now modules.FactoryClass and modules.base.FactoryClass different objects?
Because you reloaded modules.base, creating a new FactoryClass class object, but the from .base import BaseClass, FactoryClass does not get reloaded, so it is still using the class object from before the reload.
Because you reloaded, you got yourselves copy of everything. The problem is that you still have lingering references to versions of before the reload.
I hope it answers your questions.
Import is not easy, but reloading is notably tricky.
If you truly desire to reload your code, then you will have to take extra extra extra care to correctly re-import everything, in the correct order (if such an order exist, if there is not too much side-effects). You need a proper script to update everything correctly, and in case it does not work perfectly, you will frequently have horrible, sad and mind-bending bugs.
But if you prefer keep your sanity, if your workflow does not require you to reload parts of the program, just close it and restart it. There is nothing wrong with that. It is the simple solution. The sane solution.
TL;DR: don't use importlib.reload if you don't know how it works exactly and understand the risks.

Forcing Unload/Deconstruction of Dynamically Imported File from Source

Been a longtime browser of SO, finally asking my own questions!
So, I am writing an automation script/module that looks through a directory recursively for python modules with a specific name. If I find a module with that name, I load it dynamically, pull what I need from it, and then unload it. I noticed though that simply del'ing the module does not remove all references to that module, there is another lingering somewhere and I do not know where it is. I tried taking a peek at the source code, but couldn't make sense of it too well. Here is a sample of what I am seeing, greatly simplified:
I am using Python 3.5.2 (Anaconda v4.2.0). I am using importlib, and that is what I want to stick with. I also want to be able to do this with vanilla python-3.
I got the import from source from the python docs here (yes I am aware this is the Python 3.6 docs).
My main driver...
# main.py
import importlib.util
import sys
def foo():
spec = importlib.util.spec_from_file_location('a', 'a.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
print(sys.getrefcount(module))
del module
del spec
if __name__ == '__main__':
foo()
print('THE END')
And my sample module...
# a.py
print('hello from a')
class A():
def __del__(self):
print('SO LONG A!')
inst = A()
Output:
python main.py
HELLO FROM A!
2
THE END
SO LONG A!
I expected to see "SO LONG A!" printed before "THE END". So, where is this other hidden reference to my module? I understand that my del's are gratuitous with the fact that I have it wrapped in a function. I just wanted the deletion and scope to be explicit. How do I get a.py to completely unload? I plan on dynamically loading a ton of modules like a.py, and I do not want to hold on to them any longer than I really have to. Is there something I am missing?
There is a circular reference here, the module object references objects that reference the module again.
This means the module is not cleared immediately (as the reference count never goes to 0 by itself). You need to wait for the circle to be broken by the garbage collector.
You can force this by calling gc.collect():
import gc
# ...
if __name__ == '__main__':
foo()
gc.collect()
print('THE END')
With that in place, the output becomes:
$ python main.py
hello from a
2
SO LONG A!
THE END

NameError on global variables when multiprocessing, only in subdirectory

I have a main process which uses execfile and runs a script in a child process. This works fine unless the script is in another directory -- then everything breaks down.
This is in mainprocess.py:
from multiprocessing import Process
m = "subdir\\test.py"
if __name__ == '__main__':
p = Process(target = execfile, args = (m,))
p.start()
Then in a subdirectory aptly named subdir, I have test.py
import time
def foo():
print time.time()
foo()
When I run mainprocess.py, I get the error:
NameError: global name 'time' is not defined
but the issue isn't limited to module names -- sometimes I'll get an error on a function name on other pieces of code.
I've tried importing time in mainprocess.py and also inside the if statement there, but neither has any effect.
One way of avoiding the error (I haven't tried this), is to copy test.py into the parent directory and insert a line in the file to os.chdir back to the original directory. However, this seems rather sloppy.
So what is happening?
The solution is to change your Process initialization:
p = Process(target=execfile, args=(m, {}))
Honestly, I'm not entirely sure why this works. I know it has something to do with which dictionary (locals vs. globals) that the time import is added to. It seems like when your import is made in test.py, it's treated like a local variable, because the following works:
import time # no foo() anymore
print(time.time()) # the call to time.time() is in the same scope as the import
However, the following also works:
import time
def foo():
global time
print(time.time())
foo()
This second example shows me that the import is still assigned to some kind of global namespace, I just don't know how or why.
If you call execfile() normally, rather than in a subprocess, everything runs fine, and in fact, you can then use the time module any place after the call to execfile() call in your main process because time has been brought into the same namespace. I think that since you're launching it in a subprocess there is no module-level namespace for the import to be assigned to (execfile doesn't create a module object when called). I think that when we add the empty dictionary to the call to execfile, we're adding supplying the global dictionary argument, thus giving the import mechanism a global namespace to assign the name time to.
Some links for background:
1) Tutorial page on namespaces and scope
- look here for builtin, global, and local namespace explanations first
2) Python docs on execfile command
3) A very similar question on a non-SO site

Python 2.7 possible bug, imported modules "disappears"

I've found something very strange. See this short code below.
import os
class Logger(object):
def __init__(self):
self.pid = os.getpid()
print "os: %s." %os
def __del__(self):
print "os: %s." %os
def temp_test_path():
return "./[%d].log" %(os.getpid())
logger = Logger()
This is intended for illustrative purposes. It just prints the imported module os, on the construstion and destruction of a class (never mind the name Logger). However, when I run this, the module os seems to "disappear" to None in the class destructor. The following is the output.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: None.
Where is said os: None. is my problem. It should be identical to the first output line. However, look back at the python code above, at the function temp_test_path(). If I alter the name of this function slightly, to say temp_test_pat(), and keep all of the rest of the code exactly the same, and run it, I get the expected output (below).
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
I can't find any explanation for this except that it's a bug. Can you? By the way I'm using Windows 7 64 bit.
If you are relying on the interpreter shutdown to call your __del__ it could very well be that the os module has already been deleted before your __del__ gets called. Try explicitly doing a del logger in your code and sleep for a bit. This should show it clearly that the code functions as you expect.
I also want to link you to this note in the official documentation that __del__ is not guaranteed to be called in the CPython implementation.
I've reproduced this. Interesting behavior for sure. One thing that you need to realize is that __del__ isn't guaranteed to even be called when the interpreter exits -- Also there is no specified order for finalizing objects at interpreter exit.
Since you're exiting the interpreter, there is no guarantee that os hasn't been deleted first. In this case, it seems that os is in fact being finalized before your Logger object. These things probably happen depending on the order in the globals dictionary.
If we just print the keys of the globals dictionary right before we exit:
for k in globals().keys():
print k
you'll see:
temp_test_path
__builtins__
__file__
__package__
__name__
Logger
os
__doc__
logger
or:
logger
__builtins__
__file__
__package__
temp_test_pat
__name__
Logger
os
__doc__
Notice where your logger sits, particularly compared to where os sits in the list. With temp_test_pat, logger actually gets finalized First, so os is still bound to something meaningful. However, it gets finalize Last in the case where you use temp_test_path.
If you plan on having an object live until the interpreter is exiting, and you have some cleanup code that you want to run, you could always register a function to be run using atexit.register.
Others have given you the answer, it is undefined the order in which global variables (such as os, Logger and logger) are deleted from the module's namespace during shutdown.
However, if you want a workaround, just import os into the finaliser's local namespace:
def __del__(self):
import os
print "os: %s." %os
The os module will still be around at this point, it's just that you've lost your global reference to it.
This is to be expected. From the The Python Language Reference:
Also, when del() is invoked in response to a module being deleted
(e.g., when execution of the program is done), other globals
referenced by the del() method may already have been deleted or in
the process of being torn down (e.g. the import machinery shutting
down).
in big red warning box :-)

IPython Parallel Computing Namespace Issues

I've been reading and re-reading the IPython documentation/tutorial, and I can't figure out the issue with this particular piece of code. It seems to be that the function dimensionless_run is not visible to the namespace delivered to each of the engines, but I'm confused because the function is defined in __main__, and clearly visible as part of the global namespace.
wrapper.py:
import math, os
def dimensionless_run(inputs):
output_file = open(inputs['fn'],'w')
...
return output_stats
def parallel_run(inputs):
import math, os ## Removing this line causes a NameError: global name 'math'
## is not defined.
folder = inputs['folder']
zfill_amt = int(math.floor(math.log10(inputs['num_iters'])))
for i in range(inputs['num_iters']):
run_num_str = str(i).zfill(zfill_amt)
if not os.path.exists(folder + '/'):
os.mkdir(folder)
dimensionless_run(inputs)
return
if __name__ == "__main__":
inputs = [input1,input2,...]
client = Client()
lbview = client.load_balanced_view()
lbview.block = True
for x in sorted(globals().items()):
print x
lbview.map(parallel_run,inputs)
Executing this code after ipcluster start --n=6 yields the sorted global dictionary, including the math and os modules, and the parallel_run and dimensionless_run functions. This is followed by an IPython.parallel.error.CompositeError: one or more exceptions from call to method: parallel_run, which is composed of a large number of [n:apply]: NameError: global name 'dimensionless_run' is not defined, where n runs from 0-5.
There are two things I don't understand, and they're clearly linked.
Why doesn't the code identify dimensionless_run in the global namespace?
Why is import math, os necessary inside the definition of parallel_run?
Edited: This turned out not be much of a namespace error at all--I was executing ipcluster start --n=6 in a directory that didn't contain the code. To fix it, all I needed to do was execute the start command in my code's directory. I also fixed it by adding the lines:
inputs = input_pairs
os.system("ipcluster start -n 6") #NEW
client = Client()
...
lbview.map(parallel_run,inputs)
os.system("ipcluster stop") #NEW
which start the required cluster in the right place.
This is mostly a duplicate of Python name space issues with IPython.parallel, which has a more detailed answer, but the gist:
When the Client sends parallel_run to the engine, it just sends that function, not the entire namespace in which the function is defined (the __main__ module). So when running the remote parallel_run, lookups to math or os or dimensionless_run will look first in locals() (what has been defined already in the function, i.e. your in-function imports), then in the globals(), which is the __main__ module on the engine.
There are various approaches to making sure names available on the engines, but perhaps the simplest is to explicitly define/send them to the engines (the interactive namespace is __main__ on the engines, just like it is locally in IPython):
client[:].execute("import os, math")
client[:]['dimensionless_run'] = dimensionless_run
prior to making your run, in which case everything should work as you expect.
This is an issue unique to modules defined interactively / in a script - It does not come up if this file is a module instead of a script, e.g.
from mymod import parallel_run
lbview.map(parallel_run, inputs)
In which case the globals() is the module globals, which are generally the same everywhere.

Categories