Python 2.7 possible bug, imported modules "disappears" - python

I've found something very strange. See this short code below.
import os
class Logger(object):
def __init__(self):
self.pid = os.getpid()
print "os: %s." %os
def __del__(self):
print "os: %s." %os
def temp_test_path():
return "./[%d].log" %(os.getpid())
logger = Logger()
This is intended for illustrative purposes. It just prints the imported module os, on the construstion and destruction of a class (never mind the name Logger). However, when I run this, the module os seems to "disappear" to None in the class destructor. The following is the output.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: None.
Where is said os: None. is my problem. It should be identical to the first output line. However, look back at the python code above, at the function temp_test_path(). If I alter the name of this function slightly, to say temp_test_pat(), and keep all of the rest of the code exactly the same, and run it, I get the expected output (below).
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
os: <module 'os' from 'C:\Python27\lib\os.pyc'>.
I can't find any explanation for this except that it's a bug. Can you? By the way I'm using Windows 7 64 bit.

If you are relying on the interpreter shutdown to call your __del__ it could very well be that the os module has already been deleted before your __del__ gets called. Try explicitly doing a del logger in your code and sleep for a bit. This should show it clearly that the code functions as you expect.
I also want to link you to this note in the official documentation that __del__ is not guaranteed to be called in the CPython implementation.

I've reproduced this. Interesting behavior for sure. One thing that you need to realize is that __del__ isn't guaranteed to even be called when the interpreter exits -- Also there is no specified order for finalizing objects at interpreter exit.
Since you're exiting the interpreter, there is no guarantee that os hasn't been deleted first. In this case, it seems that os is in fact being finalized before your Logger object. These things probably happen depending on the order in the globals dictionary.
If we just print the keys of the globals dictionary right before we exit:
for k in globals().keys():
print k
you'll see:
temp_test_path
__builtins__
__file__
__package__
__name__
Logger
os
__doc__
logger
or:
logger
__builtins__
__file__
__package__
temp_test_pat
__name__
Logger
os
__doc__
Notice where your logger sits, particularly compared to where os sits in the list. With temp_test_pat, logger actually gets finalized First, so os is still bound to something meaningful. However, it gets finalize Last in the case where you use temp_test_path.
If you plan on having an object live until the interpreter is exiting, and you have some cleanup code that you want to run, you could always register a function to be run using atexit.register.

Others have given you the answer, it is undefined the order in which global variables (such as os, Logger and logger) are deleted from the module's namespace during shutdown.
However, if you want a workaround, just import os into the finaliser's local namespace:
def __del__(self):
import os
print "os: %s." %os
The os module will still be around at this point, it's just that you've lost your global reference to it.

This is to be expected. From the The Python Language Reference:
Also, when del() is invoked in response to a module being deleted
(e.g., when execution of the program is done), other globals
referenced by the del() method may already have been deleted or in
the process of being torn down (e.g. the import machinery shutting
down).
in big red warning box :-)

Related

Forcing Unload/Deconstruction of Dynamically Imported File from Source

Been a longtime browser of SO, finally asking my own questions!
So, I am writing an automation script/module that looks through a directory recursively for python modules with a specific name. If I find a module with that name, I load it dynamically, pull what I need from it, and then unload it. I noticed though that simply del'ing the module does not remove all references to that module, there is another lingering somewhere and I do not know where it is. I tried taking a peek at the source code, but couldn't make sense of it too well. Here is a sample of what I am seeing, greatly simplified:
I am using Python 3.5.2 (Anaconda v4.2.0). I am using importlib, and that is what I want to stick with. I also want to be able to do this with vanilla python-3.
I got the import from source from the python docs here (yes I am aware this is the Python 3.6 docs).
My main driver...
# main.py
import importlib.util
import sys
def foo():
spec = importlib.util.spec_from_file_location('a', 'a.py')
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
print(sys.getrefcount(module))
del module
del spec
if __name__ == '__main__':
foo()
print('THE END')
And my sample module...
# a.py
print('hello from a')
class A():
def __del__(self):
print('SO LONG A!')
inst = A()
Output:
python main.py
HELLO FROM A!
2
THE END
SO LONG A!
I expected to see "SO LONG A!" printed before "THE END". So, where is this other hidden reference to my module? I understand that my del's are gratuitous with the fact that I have it wrapped in a function. I just wanted the deletion and scope to be explicit. How do I get a.py to completely unload? I plan on dynamically loading a ton of modules like a.py, and I do not want to hold on to them any longer than I really have to. Is there something I am missing?
There is a circular reference here, the module object references objects that reference the module again.
This means the module is not cleared immediately (as the reference count never goes to 0 by itself). You need to wait for the circle to be broken by the garbage collector.
You can force this by calling gc.collect():
import gc
# ...
if __name__ == '__main__':
foo()
gc.collect()
print('THE END')
With that in place, the output becomes:
$ python main.py
hello from a
2
SO LONG A!
THE END

What happens to imports when a new process is spawned?

What happens to imported modules variables when a new process is spawned?
IE
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
where:
def foo(str):
x = someOtherModule.fooBar()
where foobar is accessing things declared at the start of someOtherModule:
someOtherModule.py:
myHat='green'
def fooBar():
return myHat
Specifically, I have a module (called Y) that has a py4j gateway initialized at the top, outside of any function. In module X I'm loading several files at once, and the function that sorts through the data after loading uses a function in Y which in turn uses the gateway.
Is this design pythonic?
Should I be importing my Y module after each new process is spawned? OR is there a better way to do this?
On Linux, fork will be used to spawn the child, so anything in the global scope of the parent will also be available in the child, with copy-on-write semantics.
On Windows, anything you import at the module-level in the __main__ module of the parent process will get re-imported in the child.
This means that if you have a parent module (let's call it someModule) like this:
import someOtherModule
import concurrent.futures
def foo(str):
x = someOtherModule.fooBar()
if __name__ == "__main__":
with concurrent.futures.ProcessPoolExecutor(max_workers=settings.MAX_PROCESSES) as executor:
for stuff in executor.map(foo, paths):
# stuff
And someOtherModule looks like this:
myHat='green'
def fooBar():
return myHat
In this example, someModule is the __main__ module of the script. So, on Linux, the myHat instance you get in the child will be a copy-on-write version of the one in someModule. On Windows, each child process will re-import someModule as soon as they load, which will result in someOtherModule being re-imported as well.
I don't know enough about py4j Gateway objects to tell if you for sure if this is the behavior you want or not. If the Gateway object is pickleable, you could explicitly pass it to each child instead, but you'd have to use a multiprocessing.Pool instead of concurrent.futures.ProcessPoolExecutor:
import someOtherModule
import multiprocessing
def foo(str):
x = someOtherModule.fooBar()
def init(hat):
someOtherModule.myHat = hat
if __name__ == "__main__":
hat = someOtherModule.myHat
pool = multiprocessing.Pool(settings.MAX_PROCESSES,
initializer=init, initargs=(hat,))
for stuff in pool.map(foo, paths):
# stuff
It doesn't seem like you have a need to do this for you use-case, though. You're probably fine using the re-import.
When you create a new process, a fork() is called, which clones the entire process and stack, memory space etc. This is why multi-processing is considered more expensive than multi-threading since the copying is expensive.
So to answer your question, all "imported module variables" are cloned. You can modify them as you wish, but your original parent process won't see this change.
EDIT:
This for Unix based systems only. See Dano's answer for Unix+Windows answer.

NameError on global variables when multiprocessing, only in subdirectory

I have a main process which uses execfile and runs a script in a child process. This works fine unless the script is in another directory -- then everything breaks down.
This is in mainprocess.py:
from multiprocessing import Process
m = "subdir\\test.py"
if __name__ == '__main__':
p = Process(target = execfile, args = (m,))
p.start()
Then in a subdirectory aptly named subdir, I have test.py
import time
def foo():
print time.time()
foo()
When I run mainprocess.py, I get the error:
NameError: global name 'time' is not defined
but the issue isn't limited to module names -- sometimes I'll get an error on a function name on other pieces of code.
I've tried importing time in mainprocess.py and also inside the if statement there, but neither has any effect.
One way of avoiding the error (I haven't tried this), is to copy test.py into the parent directory and insert a line in the file to os.chdir back to the original directory. However, this seems rather sloppy.
So what is happening?
The solution is to change your Process initialization:
p = Process(target=execfile, args=(m, {}))
Honestly, I'm not entirely sure why this works. I know it has something to do with which dictionary (locals vs. globals) that the time import is added to. It seems like when your import is made in test.py, it's treated like a local variable, because the following works:
import time # no foo() anymore
print(time.time()) # the call to time.time() is in the same scope as the import
However, the following also works:
import time
def foo():
global time
print(time.time())
foo()
This second example shows me that the import is still assigned to some kind of global namespace, I just don't know how or why.
If you call execfile() normally, rather than in a subprocess, everything runs fine, and in fact, you can then use the time module any place after the call to execfile() call in your main process because time has been brought into the same namespace. I think that since you're launching it in a subprocess there is no module-level namespace for the import to be assigned to (execfile doesn't create a module object when called). I think that when we add the empty dictionary to the call to execfile, we're adding supplying the global dictionary argument, thus giving the import mechanism a global namespace to assign the name time to.
Some links for background:
1) Tutorial page on namespaces and scope
- look here for builtin, global, and local namespace explanations first
2) Python docs on execfile command
3) A very similar question on a non-SO site

python: functions *sometimes* maintain a reference to their module

If I execfile a module, and remove all (of my) reference to that module, it's functions continue to work as expected. That's normal.
However, if that execfile'd module imports other modules, and I remove all references to those modules, the functions defined in those modules start to see all their global values as None. This causes things to fail spectacularly, of course, and in a very supprising manner (TypeError NoneType on string constants, for example).
I'm surprised that the interpreter makes a special case here; execfile doesn't seem special enough to cause functions to behave differently wrt module references.
My question: Is there any clean way to make the execfile-function behavior recursive (or global for a limited context) with respect to modules imported by an execfile'd module?
To the curious:
The application is reliable configuration reloading under buildbot. The buildbot configuration is executable python, for better or for worse. If the executable configuration is a single file, things work fairly well. If that configuration is split into modules, any imports from the top-level file get stuck to the original version, due to the semantics of __import__ and sys.modules. My strategy is to hold the contents of sys.modules constant before and after configuration, so that each reconfig looks like an initial configuration. This almost works except for the above function-global reference issue.
Here's a repeatable demo of the issue:
import gc
import sys
from textwrap import dedent
class DisableModuleCache(object):
"""Defines a context in which the contents of sys.modules is held constant.
i.e. Any new entries in the module cache (sys.modules) are cleared when exiting this context.
"""
modules_before = None
def __enter__(self):
self.modules_before = sys.modules.keys()
def __exit__(self, *args):
for module in sys.modules.keys():
if module not in self.modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def reload_config(filename):
"""Reload configuration from a file"""
with DisableModuleCache():
namespace = {}
exec open(filename) in namespace
config = namespace['config']
del namespace
config()
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
# if I exec that file itself, its functions maintain a reference to its modules,
# keeping GLOBAL's refcount above zero
reload_config('config_module.py')
## output:
#config! (old implementation)
#GLOBAL
# If that file is once-removed from the exec, the functions no longer maintain a reference to their module.
# The GLOBAL's refcount goes to zero, and we get a None value (feels like weakref behavior?).
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
reload_config('main.py')
## output:
#config! (old implementation)
#None
## *desired* output:
#config! (old implementation)
#GLOBAL
acceptance_test()
def acceptance_test():
# Have to wait at least one second between edits (on ext3),
# or else we import the old version from the .pyc file.
from time import sleep
sleep(1)
open('config_module.py', 'w').write(dedent('''
GLOBAL2 = 'GLOBAL2'
def config():
print 'config2! (new implementation)'
print GLOBAL2
## There should be no such thing as GLOBAL. Naive reload() gets this wrong.
try:
print GLOBAL
except NameError:
print 'got the expected NameError :)'
else:
raise AssertionError('expected a NameError!')
'''))
reload_config('main.py')
## output:
#config2! (new implementation)
#None
#got the expected NameError :)
## *desired* output:
#config2! (new implementation)
#GLOBAL2
#got the expected NameError :)
if __name__ == '__main__':
main()
I don't think you need the 'acceptance_test' part of things here. The issue isn't actually weakrefs, it's modules' behavior on destruction. They clear out their __dict__ on delete. I vaguely remember that this is done to break ref cycles. I suspect that global references in function closures do something fancy to avoid a hash lookup on every invocation, which is why you get None and not a NameError.
Here's a much shorter sscce:
import gc
import sys
import contextlib
from textwrap import dedent
#contextlib.contextmanager
def held_modules():
modules_before = sys.modules.keys()
yield
for module in sys.modules.keys():
if module not in modules_before:
del sys.modules[module]
gc.collect() # force collection after removing refs, for demo purposes.
def main():
open('config_module.py', 'w').write(dedent('''
GLOBAL = 'GLOBAL'
def config():
print 'config! (old implementation)'
print GLOBAL
'''))
open('main.py', 'w').write(dedent('''
from config_module import *
'''))
with held_modules():
namespace = {}
exec open('main.py') in namespace
config = namespace['config']
config()
if __name__ == '__main__':
main()
Or, to put it another way, don't delete modules and expect their contents to continue functioning.
You should consider importing the configuration instead of execing it.
I use import for a similar purpose, and it works great. (specifically, importlib.import_module(mod)). Though, my configs consists mainly of primitives, not real functions.
Like you, I also have a "guard" context to restore the original contents of sys.modules after the import. Plus, I use sys.dont_write_bytecode = True (of course, you can add that to your DisableModuleCache -- set to True in __enter__ and to False in __exit__). This would ensure the config actually "runs" each time you import it.
The main difference between the two approaches, (other than the fact you don't have to rely on the state the interpreter stays in after execing (which I consider semi-unclean)), is that the config files are identified by their module-name/path (as used for importing) rather than the file name.
EDIT: A link to the implementation of this approach, as part of the Figura package.

How can I access the current executing module or class name in Python?

I would like to be able to dynamically retrieve the current executing module or class name from within an imported module. Here is some code:
foo.py:
def f():
print __name__
bar.py:
from foo import f
def b(): f()
This obviously does not work as __name__ is the name of the module that contains the function. What I would like to be access inside the foo module is the name of the current executing module that is using foo. So in the case above it would be bar but if any other module imported foo I would like foo to dynamically have access to the name of that module.
Edit: The inspect module looks quite promising but it is not exactly what I was looking for. What I was hoping for was some sort of global or environment-level variable that I could access that would contain the name of the current executing module. Not that I am unwilling to traverse the stack to find that information - I just thought that Python may have exposed that data already.
Edit: Here is how I am trying to use this. I have two different Django applications that both need to log errors to file. Lets say that they are called "AppOne" and "AppTwo". I also have a place to which I would like to log these files: "/home/hare/app_logs". In each application at any given point I would like to be able to import my logger module and call the log function which writes the log string to file. However what I would like to do is create a directory under app_logs that is the name of the current application ("AppOne" or "AppTwo") so that each application's log files will go in their respective logging directories.
In order to do this I thought that the best way would be for the logger module to have access to some sort of global variable that denotes the current application's name as it is responsible for knowing the location of the parent logging directory and creating the application's logging directory if it does not yet exist.
From the comment -- not the question.
I am simply curious to see if what I am trying to do is possible.
The answer to "is it possible" is always "yes". Always. Unless your question involves time travel, anti-gravity or perpetual motion.
Since the answer is always "yes", your question is ill-formed. The real question is "what's a good way to have my logging module know the name of the client?" or something like that.
The answer is "Accept it as a parameter." Don't mess around with inspecting or looking for mysterious globals or other tricks.
Just follow the design pattern of logging.getLogger() and use explicitly-named loggers. A common idiom is the following
logger= logging.getLogger( __name__ )
That handles almost all log naming perfectly.
This should work for referencing the current module:
import sys
sys.modules[__name__]
The "currently executing module" clearly is foo, as that's what contains the function currently running - I think a better description as to what you want is the module of foo's immediate caller (which may itself be foo if you're calling a f() from a function in foo called by a function in bar. How far you want to go up depends on what you want this for.
In any case, assuming you want the immediate caller, you can obtain this by walking up the call stack. This can be accomplished by calling sys._getframe, with the aprropriate number of levels to walk.
def f():
caller = sys._getframe(1) # Obtain calling frame
print "Called from module", caller.f_globals['__name__']
[Edit]: Actually, using the inspect module as suggested above is probably a cleaner way of obtaining the stack frame. The equivalent code is:
def f():
caller = inspect.currentframe().f_back
print "Called from module", caller.f_globals['__name__']
(sys._getframe is documented as being for internal use - the inspect module is a more reliable API)
To obtain a reference to the "_main_" module when in another:
import sys
sys.modules['__main__']
To then obtain the module's file path, which includes its name:
sys.modules['__main__'].__file__ # type: str
If within the "__main__" module, simply use: __file__
To obtain just the file name from the file path:
import os
os.path.basename(file_path)
To separate the file name from its extension:
file_name.split(".")[0]
To obtain the name of a class instance:
instance.__class__.__name__
To obtain the name of a class:
class.__name__
__file__ is the path of current module the call is made.
I think what you want to use is the inspect module, to inspect the python runtime stack. Check out this tutorial. I think it provides an almost exact example of what you want to do.
Using __file__ alone gives you a relative path for the main module and an absolute path for imported modules. Being aware this we can get the module file constantly either way with a little help from our os.path tools.
For filename only use __file__.split(os.path.sep)[-1].
For complete path use os.path.abspath(__file__).
Demo:
/tmp $ cat f.py
from pprint import pprint
import os
import sys
pprint({
'sys.modules[__name__]': sys.modules[__name__],
'__file__': __file__,
'__file__.split(os.path.sep)[-1]': __file__.split(os.path.sep)[-1],
'os.path.abspath(__file__)': os.path.abspath(__file__),
})
/tmp $ cat i.py
import f
Results:
## on *Nix ##
/tmp $ python3 f.py
{'sys.modules[__name__]': <module '__main__' from 'f.py'>,
'__file__': 'f.py',
'__file__.split(os.path.sep)[-1]': 'f.py',
'os.path.abspath(__file__)': '/tmp/f.py'}
/tmp $ python3 i.py
{'sys.modules[__name__]': <module 'f' from '/tmp/f.pyc'>,
'__file__': '/tmp/f.pyc',
'__file__.split(os.path.sep)[-1]': 'f.pyc',
'os.path.abspath(__file__)': '/tmp/f.pyc'}
## on Windows ##
PS C:\tmp> python3.exe f.py
{'sys.modules[__name__]': <module '__main__' from 'f.py'>,
'__file__': 'f.py',
'__file__.split(os.path.sep)[-1]': 'f.py',
'os.path.abspath(__file__)': 'C:\\tools\\cygwin\\tmp\\f.py'}
PS C:\tmp> python3.exe i.py
{'sys.modules[__name__]': <module 'f' from 'C:\\tools\\cygwin\\tmp\\f.py'>,
'__file__': 'C:\\tools\\cygwin\\tmp\\f.py',
'__file__.split(os.path.sep)[-1]': 'f.py',
'os.path.abspath(__file__)': 'C:\\tools\\cygwin\\tmp\\f.py'}
If you want to strip the '.py' off the end, you can do that easily. (But don't forget that you may run a '.pyc' instead.)
If you want only the name of the file:
file_name = __file__.split("/")[len(__file__.split("/"))-1]
I don't believe that's possible since that's out of foo's scope. foo will only be aware of its internal scope since it may be being called by countless other modules and applications.
It's been a while since I've done python, but I believe that you can get access to the globals and locals of a caller through its traceback.
To get the current file module, containing folder, here is what worked for me:
import os
parts = os.path.splitext(__name__)
module_name = parts[len(parts) - 2]

Categories