How to load compiled python modules from memory?

How to load compiled python modules from memory? - python

I need to read all modules (pre-compiled) from a zipfile (built by py2exe compressed) into memory and then load them all.
I know this can be done by loading direct from the zipfile but I need to load them from memory.
Any ideas? (I'm using python 2.5.2 on windows)
TIA Steve

It depends on what exactly you have as "the module (pre-compiled)". Let's assume it's exactly the contents of a .pyc file, e.g., ciao.pyc as built by:
$ cat>'ciao.py'
def ciao(): return 'Ciao!'
$ python -c'import ciao; print ciao.ciao()'
Ciao!
IOW, having thus built ciao.pyc, say that you now do:
$ python
Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> b = open('ciao.pyc', 'rb').read()
>>> len(b)
200
and your goal is to go from that byte string b to an importable module ciao. Here's how:
>>> import marshal
>>> c = marshal.loads(b[8:])
>>> c
<code object <module> at 0x65188, file "ciao.py", line 1>
this is how you get the code object from the .pyc binary contents. Edit: if you're curious, the first 8 bytes are a "magic number" and a timestamp -- not needed here (unless you want to sanity-check them and raise exceptions if warranted, but that seems outside the scope of the question; marshal.loads will raise anyway if it detects a corrupt string).
Then:
>>> import types
>>> m = types.ModuleType('ciao')
>>> import sys
>>> sys.modules['ciao'] = m
>>> exec c in m.__dict__
i.e: make a new module object, install it in sys.modules, populate it by executing the code object in its __dict__. Edit: the order in which you do the sys.modules insertion and exec matters if and only if you may have circular imports -- but, this is the order Python's own import normally uses, so it's better to mimic it (which has no specific downsides).
You can "make a new module object" in several ways (e.g., from functions in standard library modules such as new and imp), but "call the type to get an instance" is the normal Python way these days, and the normal place to obtain the type from (unless it has a built-in name or you otherwise have it already handy) is from the standard library module types, so that's what I recommend.
Now, finally:
>>> import ciao
>>> ciao.ciao()
'Ciao!'
>>>
...you can import the module and use its functions, classes, and so on. Other import (and from) statements will then find the module as sys.modules['ciao'], so you won't need to repeat this sequence of operations (indeed you don't need this last import statement here if all you want is to ensure the module is available for import from elsewhere -- I'm adding it only to show it works;-).
Edit: If you absolutely must import in this way packages and modules therefrom, rather than "plain modules" as I just showed, that's doable, too, but a bit more complicated. As this answer is already pretty long, and I hope you can simplify your life by sticking to plain modules for this purpose, I'm going to shirk that part of the answer;-).
Also note that this may or may not do what you want in cases of "loading the same module from memory multiple times" (this rebuilds the module each time; you might want to check sys.modules and just skip everything if the module's already there) and in particular when such repeated "load from memory" occurs from multiple threads (needing locks -- but, a better architecture is to have a single dedicated thread devoted to performing the task, with other modules communicating with it via a Queue).
Finally, there's no discussion of how to install this functionality as a transparent "import hook" which automagically gets involved in the mechanisms of the import statement internals themselves -- that's feasible, too, but not exactly what you're asking about, so here, too, I hope you can simplify your life by doing things the simple way instead, as this answer outlines.

Compiled Python file consist of
magic number (4 bytes) to determine type and version of Python,
timestamp (4 bytes) to check whether we have newer source,
marshaled code object.
To load module you have to create module object with imp.new_module(), execute unmashaled code in new module's namespace and put it in sys.modules. Below in sample implementation:
import sys, imp, marshal
def load_compiled_from_memory(name, filename, data, ispackage=False):
if data[:4]!=imp.get_magic():
raise ImportError('Bad magic number in %s' % filename)
# Ignore timestamp in data[4:8]
code = marshal.loads(data[8:])
imp.acquire_lock() # Required in threaded applications
try:
mod = imp.new_module(name)
sys.modules[name] = mod # To handle circular and submodule imports
# it should come before exec.
try:
mod.__file__ = filename # Is not so important.
# For package you have to set mod.__path__ here.
# Here I handle simple cases only.
if ispackage:
mod.__path__ = [name.replace('.', '/')]
exec code in mod.__dict__
except:
del sys.modules[name]
raise
finally:
imp.release_lock()
return mod
Update: the code is updated to handle packages properly.
Note that you have to install import hook to handle imports inside loaded modules. One way to do this is adding your finder into sys.meta_path. See PEP302 for more information.

Related

How does import and reload actually work when dealing with packages

I have issues understanding some subtleties of the Python import system. I have condensed my doubts around a minimal example and a number of concrete and related questions detailed below.
I have defined a package in a folder called modules, whose content is an __init__.py and two regular modules, one with general functionality for the package and other with the definitions for the end user. The content is as simple as:
init.py
from .base import *
from .implementation import *
base.py
class FactoryClass():
registry = {}
#classmethod
def add_to_registry(cls, newclass):
cls.registry[newclass.__name__] = newclass
#classmethod
def getobject(cls, classname, *args, **kwargs):
return cls.registry[classname](*args, **kwargs)
class BaseClass():
def hello(self):
print(f"Hello from instance of class {type(self).__name__}")
implementation.py
from .base import BaseClass, FactoryClass
class First(BaseClass):
pass
class Second(BaseClass):
pass
FactoryClass.add_to_registry(First)
FactoryClass.add_to_registry(Second)
The user of the package will use it as:
import modules
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second")
a.hello()
b.hello()
This works. The problem comes because I'm developing this, and my workflow includes adding functionality in implementation.py and then continaully test it by reloading the module. But I can not understand/predict what module I have to reload to have the functions updated. I'm making changes that have no effect and it drives me crazy (until yesterday I was working on a large .py file with all code lumped together, so I had none of these problems).
Here are some test I have done, and I'd like to understand what's happening and why.
First, I start commenting out all mentions to Second class in implementation.py (to pretend it was not yet developed):
from importlib import reload
import modules
modules.base.FactoryClass is modules.FactoryClass # returns True
modules.FactoryClass.registry # just First class is in registry
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second") # raises KeyError as expected
This code and its output is pretty clear. The only thing that puzzles me is why there is a modules.base module at all (I did not import it!). Further, it is redundant as their classes point to the same objects. Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?
Now things become interesting as I comment out, i.e. I finish developing Second, and I'd like to test it without having to restart the Python session. I have tried 3 different reloads:
reload (modules)
This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.
Now I try to manually reload one of those "unexpected" modules:
reload (modules.implementation)
modules.base.FactoryClass is modules.FactoryClass # True
modules.FactoryClass.registry # First and Second
a = modules.FactoryClass.getobject("First")
b = modules.FactoryClass.getobject("Second") # Works as expected
This seems to be the right way to go. It updates the module contents as expected and the new functionality is usable. What puzzles me is why modules.FactoryClass has been updated (its registry) despite the fact that I did not reload the modules.base module. I'd expect this function to stay "outdated".
Finally, and starting from the just freshly uncommented version, I have tried
reload (modules.base)
modules.base.FactoryClass is modules.FactoryClass # False
modules.FactoryClass.registry # just First class is in registry
modules.base.FactoryClass.registry # empty
a = modules.base.FactoryClass.getobject("First")
b = modules.base.FactoryClass.getobject("Second") # raises KeyError
This is very odd. modules.FactoryClass is outdated (Second is unknown). modules.base.Factory is empty. Why are now modules.FactoryClass and modules.base.FactoryClass different objects?
Could someone explain why the three different versions of reload a package have so different behaviour?

You are confused about how the Python import system works, so I strongly recommend you read the corresponding documentations : the import system and importlib.reload.
A foreword : code hot-reloading in Python is tricky. I recommend to not do that if it is not required. You have seen it yourself : bugs are very tricky.
Then to your questions :
Why importing modules also imports modules.base and modules.implementation as separate but essentially identical objects?
As #Kemp answered as a comment (and I upvoted), imports are transitive. When you import a, Python will parse/compile/execute the corresponding library file. If the module does import b then Python will do it again for the b library file, and again and again. You don't see it, but when your program starts there is already a lot of things that have been imported.
Given this file :
print("nothing else")
When I set my debugger to pause before executing the print line, if I look into sys.modules I already have 338 different libraries imported : builtins (where print came from), sys, itertools, enum, json, ...
Understand that "no visible import statement" does not mean "nothing have been imported".
When you execute import a, Python will start by checking its sys.modules cache to determine if the library have already been read from disk, parsed, compiled and executed into a module object. If this library was not yet imported during this program, then Python will take the time to do all that. But because it is slow, Python optimize with a cache.
The result is a module object, that gets bind the current namespace, so that you can access it.
We can summerize it like that :
def import_library(name: str) -> Module:
if name not in sys.modules:
# cache miss
filepath = locate_library(name)
bytecode = compile_library(filepath)
module = execute(bytecode)
sys.modules[name] = module
# in any case, at this point, the module is available
return sys.modules[name]
You are thus confusing module objects with variables.
In any module you can declare variables with whatever name (but allowed by Python's grammar). And some of them will reference modules.
here is an example :
# file: main.py
import lib # create a variable named `lib` referencing the `lib` library
import lib as horse # create a variable named `horse` referencing the `lib` library
print(lib.a.number) # 14
print(horse.a.number) # 14
print(lib is horse) # True
print(lib.a.sublib.__name__) # lib.sublib
import lib.sublib
from lib import sublib
import lib.sublib as lib_sublib
print((lib.sublib is sublib, sublib is lib_sublib, lib.a.zebra is sublib)) # (True, True, True)
import sys
print(sys.modules["lib"] is lib) # True
print(sys.modules["lib.sublib"] is sublib) # True
print(lib.sublib.package_color) # blue
print(lib.sublib.color) # AttributeError: module 'lib.sublib' has no attribute 'color'
# file: lib/__init__.py
from . import a
# file: lib/a.py
from . import sublib
from . import sublib as zebra
number = 14
# file: lib/sublib/__init__.py
from .b import color as package_color
# file: lib/sublib/b.py
color = "blue"
Python offers a lot of flexibility about how to import things, what to expose, how to access. But I admit it is confusing.
Also take a look at the role of __all__ in __init__.py. Given that, you should now understand your question on subpackage naming/visibility.
reload (modules) This does absolutely nothing. I'd expect some sort of recursivity, but as I have found in many other threats, this is the expected behavior.
Given what I explained, can you now understand what it does ? And why what it does is not what you want it to do ?
Because what you want is to get modules.implementations hot-reloaded, but you ask for modules.
>>> from importlib import reload
>>> import lib
>>> lib.sublib.package_color
'blue'
>>> # I edit the "b.py" file so that the color is "red"
>>> old_lib = lib
>>> new_lib = reload(lib)
>>> lib is new_lib, lib is old_lib
(True, True)
>>> lib.sublib.package_color
'blue'
>>> lib.sublib.b.color
'red'
>>> import sys
>>> sys.modules["lib.sublib.b"].color
'red'
First, the top-level reload did not work, because what the file only did was import sublib, which hit the cache, so nothing really gets done.
You have to reload the actual module for its content to takes effect. But it does not work magically : it will create new objects (module-level definitions) and put them into the same module object, but it can't update references that may exist on the preceding module's content. That is why we see a "blue" even after the module has been reloaded : the package_color is a reference to the first version's color variable, it does not get updated when the module is reloaded. This is dangerous : there may be different copies of similar things lying around.
why modules.FactoryClass has been updated
You are reloading modules.implementation in this case. What happens is that it reloads the whole file to populate the module object, I highlighted the perceived effects :
from .base import BaseClass, FactoryClass # relative library "base" already in the `sys.modules` cache
class First(BaseClass): # was already defined, redefined
pass
class Second(BaseClass): # was not defined yed, created
pass
FactoryClass.add_to_registry(First) # overwritting the registry for "First" with the redefinition of class `First`
FactoryClass.add_to_registry(Second) # registering for "Second" the definition of class `Second`
You can see it another way :
>>> import modules
>>> from importlib import reload
>>> First_before = modules.implementation.First
>>> reload(modules.implementation)
<module 'modules.implementation' from 'C:\\PycharmProjects\\stack_overflow\\68069397\\modules\\implementation.py'>
>>> First_after = modules.implementation.First
>>> First_before_reload is First_after_reload
False
When you are reloading a module, Python will re-execute all the code, in a way that may be different than the previous time(s). Here, each time you are doing FactoryClass.add_to_registry so the FactoryClass gets updated with the (re)definitions.
Why are now modules.FactoryClass and modules.base.FactoryClass different objects?
Because you reloaded modules.base, creating a new FactoryClass class object, but the from .base import BaseClass, FactoryClass does not get reloaded, so it is still using the class object from before the reload.
Because you reloaded, you got yourselves copy of everything. The problem is that you still have lingering references to versions of before the reload.
I hope it answers your questions.
Import is not easy, but reloading is notably tricky.
If you truly desire to reload your code, then you will have to take extra extra extra care to correctly re-import everything, in the correct order (if such an order exist, if there is not too much side-effects). You need a proper script to update everything correctly, and in case it does not work perfectly, you will frequently have horrible, sad and mind-bending bugs.
But if you prefer keep your sanity, if your workflow does not require you to reload parts of the program, just close it and restart it. There is nothing wrong with that. It is the simple solution. The sane solution.
TL;DR: don't use importlib.reload if you don't know how it works exactly and understand the risks.

Set variable to the name of the module that has imported it

Assume I have two modules:
first.package.module
# Assign module to a variable
which_module = ???
print("I originally live in {}".format(__name__)) # Prints first.package.module
print("I was run from {}".format(which_module))
second.package.module
from first.package.module import *
third.package.module
from first.package.module import *
How can I get the second row, "I was run from", to print second.package.module after the import in second.package.module? Or third.package.module when it is run from there.
The reason I want this behaviour is due to building an app in Django that is used several times in the same project, i.e. there are several instances of the same app. Each of these instances has their own models, which inherits from an abstract model. To build reusable views and urls, I want to load the models dynamically, and to do that, I need to know which module that has imported the app and runs it.

You can print the name of the module triggering the module load with:
import sys
print sys._getframe(1).f_globals['__name__']
In Python 3 with the new importlib stack you'll need to increase the framecount to 6 (at least for 3.5, 3.6 and 3.7):
print(sys._getframe(6).f_globals['__name__'])
A cross-CPython solution would be to filter out any importlib._bootstrap* elements in the stack:
import sys
def imported_from(depth=0):
# skip the frames of this function, and the caller
f = sys._getframe(2 + depth)
while f and f.f_code.co_filename.startswith("<frozen importlib._bootstrap"):
f = f.f_back
return f and f.f_globals['__name__']
print(imported_from())
The above imported_from() function can be defined in a utility module and imported; the sys._getframe(2) call ensures that the starting context is whatever triggered the imported_from() call. Provide a positive integer as the depth argument to increase the number of frames skipped. The function works on any Python version that provides a sys._getframe() function, whether or not it uses the importlib stack.
Note that this:
Relies on a CPython implementation detail (exposing the frame stack is not available on other Python implementations). See the sys._getframe() function documentation:
CPython implementation detail: This function should be used for internal and specialized purposes only. It is not guaranteed to exist in all implementations of Python.
In Python 3, it depends on the exact implementation details of the importlib stack; ongoing development could add or remove calls in that stack. For example, in Python 3.3, which first introduced importlib, the correct stack count to skip is 9, in 3.4 it went down to 7, and 3.5 dropped it to 6 (a number that's since been stable). imported_from() works around this, but introduces a new implementation detail-dependent assumption: that the stack frames involved with importing can be detected by looking for the <frozen importlib._bootstrap prefix in the filename. In theory it is possible to compile CPython with importlib._bootstrap left un-frozen (not included in the interpreter binary as array of marshal data).
Only works when a module is imported for the first time, at which point Python executes the module code to produce a sys.modules object.
I cannot emphasise the last point enough. Python only ever loads a module once. Importing is a two-step process: loading and binding. The load step is only executed if the module object doesn't exist yet, all other imports for the module only bind names.
Binding names is not something you can practically hook into; import modulename is nothing more than a modulename = sys.modules['modulename'] assignment once the module is already in memory.

In Python how can one tell if a module comes from a C extension?

What is the correct or most robust way to tell from Python if an imported module comes from a C extension as opposed to a pure Python module? This is useful, for example, if a Python package has a module with both a pure Python implementation and a C implementation, and you want to be able to tell at runtime which one is being used.
One idea is to examine the file extension of module.__file__, but I'm not sure all the file extensions one should check for and if this approach is necessarily the most reliable.

tl;dr
See the "In Search of Perfection" subsection below for the well-tested answer.
As a pragmatic counterpoint to abarnert's helpful analysis of the subtlety involved in portably identifying C extensions, Stack Overflow Productions™ presents... an actual answer.
The capacity to reliably differentiate C extensions from non-C extensions is incredibly useful, without which the Python community would be impoverished. Real-world use cases include:
Application freezing, converting one cross-platform Python codebase into multiple platform-specific executables. PyInstaller is the standard example here. Identifying C extensions is critical to robust freezing. If a module imported by the codebase being frozen is a C extension, all external shared libraries transitively linked to by that C extension must be frozen with that codebase as well. Shameful confession: I contribute to PyInstaller.
Application optimization, either statically to native machine code (e.g., Cython) or dynamically in a just-in-time manner (e.g., Numba). For self-evident reasons, Python optimizers necessarily differentiate already compiled C extensions from uncompiled pure-Python modules.
Dependency analysis, inspecting external shared libraries on behalf of end users. In our case, we analyze a mandatory dependency (Numpy) to detect local installations of this dependency linking against non-parallelized shared libraries (e.g., the reference BLAS implementation) and inform end users when this is the case. Why? Because we don't want the blame when our application underperforms due to improper installation of dependencies over which we have no control. Bad performance is your fault, hapless user!
Probably other essential low-level stuff. Profiling, maybe?
We can all agree that freezing, optimization, and minimizing end user complaints are useful. Ergo, identifying C extensions is useful.
The Disagreement Deepens
I also disagree with abarnert's penultimate conclusion that:
The best heuristics anyone has come up with for this are the ones implemented in the inspect module, so the best thing to do is to use that.
No. The best heuristics anyone has come up with for this are those given below. All stdlib modules (including but not limited to inspect) are useless for this purpose. Specifically:
The inspect.getsource() and inspect.getsourcefile() functions ambiguously return None for both C extensions (which understandably have no pure-Python source) and other types of modules that also have no pure-Python source (e.g., bytecode-only modules). Useless.
importlib machinery only applies to modules loadable by PEP 302-compliant loaders and hence visible to the default importlib import algorithm. Useful, but hardly generally applicable. The assumption of PEP 302 compliance breaks down when the real world hits your package in the face repeatedly. For example, did you know that the __import__() built-in is actually overriddable? This is how we used to customize Python's import mechanism – back when the Earth was still flat.
abarnert's ultimate conclusion is also contentious:
…there is no perfect answer.
There is a perfect answer. Much like the oft-doubted Triforce of Hyrulean legend, a perfect answer exists for every imperfect question.
Let's find it.
In Search of Perfection
The pure-Python function that follows returns True only if the passed previously imported module object is a C extension: For simplicity, Python 3.x is assumed.
import inspect, os
from importlib.machinery import ExtensionFileLoader, EXTENSION_SUFFIXES
from types import ModuleType
def is_c_extension(module: ModuleType) -> bool:
'''
`True` only if the passed module is a C extension implemented as a
dynamically linked shared library specific to the current platform.
Parameters
----------
module : ModuleType
Previously imported module object to be tested.
Returns
----------
bool
`True` only if this module is a C extension.
'''
assert isinstance(module, ModuleType), '"{}" not a module.'.format(module)
# If this module was loaded by a PEP 302-compliant CPython-specific loader
# loading only C extensions, this module is a C extension.
if isinstance(getattr(module, '__loader__', None), ExtensionFileLoader):
return True
# Else, fallback to filetype matching heuristics.
#
# Absolute path of the file defining this module.
module_filename = inspect.getfile(module)
# "."-prefixed filetype of this path if any or the empty string otherwise.
module_filetype = os.path.splitext(module_filename)[1]
# This module is only a C extension if this path's filetype is that of a
# C extension specific to the current platform.
return module_filetype in EXTENSION_SUFFIXES
If it looks long, that's because docstrings, comments, and assertions are good. It's actually only six lines. Eat your elderly heart out, Guido.
Proof in the Pudding
Let's unit test this function with four portably importable modules:
The stdlib pure-Python os.__init__ module. Hopefully not a C extension.
The stdlib pure-Python importlib.machinery submodule. Hopefully not a C extension.
The stdlib _elementtree C extension.
The third-party numpy.core.multiarray C extension.
To wit:
>>> import os
>>> import importlib.machinery as im
>>> import _elementtree as et
>>> import numpy.core.multiarray as ma
>>> for module in (os, im, et, ma):
... print('Is "{}" a C extension? {}'.format(
... module.__name__, is_c_extension(module)))
Is "os" a C extension? False
Is "importlib.machinery" a C extension? False
Is "_elementtree" a C extension? True
Is "numpy.core.multiarray" a C extension? True
All's well that ends.
How to do this?
The details of our code are quite inconsequential. Very well, where do we begin?
If the passed module was loaded by a PEP 302-compliant loader (the common case), the PEP 302 specification requires the attribute assigned on importation to this module to define a special __loader__ attribute whose value is the loader object loading this module. Hence:
If this value for this module is an instance of the CPython-specific importlib.machinery.ExtensionFileLoader class, this module is a C extension.
Else, either (A) the active Python interpreter is not the official CPython implementation (e.g., PyPy) or (B) the active Python interpreter is CPython but this module was not loaded by a PEP 302-compliant loader, typically due to the default __import__() machinery being overridden (e.g., by a low-level bootloader running this Python application as a platform-specific frozen binary). In either case, fallback to testing whether this module's filetype is that of a C extension specific to the current platform.
Eight line functions with twenty page explanations. Thas just how we rolls.

First, I don't think this is at all useful. It's very common for modules to be pure-Python wrappers around a C extension module—or, in some cases, pure-Python wrappers around a C extension module if it's available, or a pure Python implementation if not.
For some popular third-party examples: numpy is pure Python, even though everything important is implemented in C; bintrees is pure Python, even though its classes may all be implemented either in C or in Python depending on how you build it; etc.
And this is true in most of the stdlib from 3.2 on. For example, if you just import pickle, the implementation classes will be built in C (what you used to get from cpickle in 2.7) in CPython, while they'll be pure-Python versions in PyPy, but either way pickle itself is pure Python.
But if you do want to do this, you actually need to distinguish three things:
Built-in modules, like sys.
C extension modules, like 2.x's cpickle.
Pure Python modules, like 2.x's pickle.
And that's assuming you only care about CPython; if your code runs in, say, Jython, or IronPython, the implementation could be JVM or .NET rather than native code.
You can't distinguish perfectly based on __file__, for a number of reasons:
Built-in modules have no __file__ at all. (This is documented in a few places—e.g., the Types and members table in the inspect docs.) Note that if you're using something like py2app or cx_freeze, what counts as "built-in" may be different from a standalone installation.
A pure-Python module may have a .pyc/.pyo file without having a .py file in a distributed app.
A module in a a package installed as a single-file egg (which is common with easy_install, less so with pip) will have either a blank or useless __file__.
If you build a binary distribution, there's a good chance your whole library will be packed in a zip file, causing the same problem as single-file eggs.
In 3.1+, the import process has been massively cleaned up, mostly rewritten in Python, and mostly exposed to the Python layer.
So, you can use the importlib module to see the chain of loaders used to load a module, and ultimately you'll get to BuiltinImporter (builtins), ExtensionFileLoader (.so/.pyd/etc.), SourceFileLoader (.py), or SourcelessFileLoader (.pyc/.pyo).
You can also see the suffixes assigned to each of the four, on the current target platform, as constants in importlib.machinery. So, you could check that the any(pathname.endswith(suffix) for suffix in importlib.machinery.EXTENSION_SUFFIXES)), but that won't actually help in, e.g., the egg/zip case unless you've already traveled up the chain anyway.
The best heuristics anyone has come up with for this are the ones implemented in the inspect module, so the best thing to do is to use that.
The best choice will be one or more of getsource, getsourcefile, and getfile; which is best depends on which heuristics you want.
A built-in module will raise a TypeError for any of them.
An extension module ought to return an empty string for getsourcefile. This seems to work in all the 2.5-3.4 versions I have, but I don't have 2.4 around. For getsource, at least in some versions, it returns the actual bytes of the .so file, even though it should be returning an empty string or raising an IOError. (In 3.x, you will almost certainly get a UnicodeError or SyntaxError, but you probably don't want to rely on that…)
Pure Python modules may return an empty string for getsourcefile if in an egg/zip/etc. They should always return a non-empty string for getsource if source is available, even inside an egg/zip/etc., but if they're sourceless bytecode (.pyc/etc.) they will return an empty string or raise an IOError.
The best bet is to experiment with the version you care about on the platform(s) you care about in the distribution/setup(s) you care about.

#Cecil Curry's function is excellent. Two minor comments: firsly, the _elementtree example raises a TypeError with my copy of Python 3.5.6. Secondly, as #crld points out, it's also helpful to know if a module contains C extensions, but a more portable version might help. More generic versions (with Python 3.6+ f-string syntax) may therefore be:
from importlib.machinery import ExtensionFileLoader, EXTENSION_SUFFIXES
import inspect
import logging
import os
import os.path
import pkgutil
from types import ModuleType
from typing import List
log = logging.getLogger(__name__)
def is_builtin_module(module: ModuleType) -> bool:
"""
Is this module a built-in module, like ``os``?
Method is as per :func:`inspect.getfile`.
"""
return not hasattr(module, "__file__")
def is_module_a_package(module: ModuleType) -> bool:
assert inspect.ismodule(module)
return os.path.basename(inspect.getfile(module)) == "__init__.py"
def is_c_extension(module: ModuleType) -> bool:
"""
Modified from
https://stackoverflow.com/questions/20339053/in-python-how-can-one-tell-if-a-module-comes-from-a-c-extension.
``True`` only if the passed module is a C extension implemented as a
dynamically linked shared library specific to the current platform.
Args:
module: Previously imported module object to be tested.
Returns:
bool: ``True`` only if this module is a C extension.
Examples:
.. code-block:: python
from cardinal_pythonlib.modules import is_c_extension
import os
import _elementtree as et
import numpy
import numpy.core.multiarray as numpy_multiarray
is_c_extension(os) # False
is_c_extension(numpy) # False
is_c_extension(et) # False on my system (Python 3.5.6). True in the original example.
is_c_extension(numpy_multiarray) # True
""" # noqa
assert inspect.ismodule(module), f'"{module}" not a module.'
# If this module was loaded by a PEP 302-compliant CPython-specific loader
# loading only C extensions, this module is a C extension.
if isinstance(getattr(module, '__loader__', None), ExtensionFileLoader):
return True
# If it's built-in, it's not a C extension.
if is_builtin_module(module):
return False
# Else, fallback to filetype matching heuristics.
#
# Absolute path of the file defining this module.
module_filename = inspect.getfile(module)
# "."-prefixed filetype of this path if any or the empty string otherwise.
module_filetype = os.path.splitext(module_filename)[1]
# This module is only a C extension if this path's filetype is that of a
# C extension specific to the current platform.
return module_filetype in EXTENSION_SUFFIXES
def contains_c_extension(module: ModuleType,
import_all_submodules: bool = True,
include_external_imports: bool = False,
seen: List[ModuleType] = None,
verbose: bool = False) -> bool:
"""
Extends :func:`is_c_extension` by asking: is this module, or any of its
submodules, a C extension?
Args:
module: Previously imported module object to be tested.
import_all_submodules: explicitly import all submodules of this module?
include_external_imports: check modules in other packages that this
module imports?
seen: used internally for recursion (to deal with recursive modules);
should be ``None`` when called by users
verbose: show working via log?
Returns:
bool: ``True`` only if this module or one of its submodules is a C
extension.
Examples:
.. code-block:: python
import logging
import _elementtree as et
import os
import arrow
import alembic
import django
import numpy
import numpy.core.multiarray as numpy_multiarray
log = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG) # be verbose
contains_c_extension(os) # False
contains_c_extension(et) # False
contains_c_extension(numpy) # True -- different from is_c_extension()
contains_c_extension(numpy_multiarray) # True
contains_c_extension(arrow) # False
contains_c_extension(alembic) # False
contains_c_extension(alembic, include_external_imports=True) # True
# ... this example shows that Alembic imports hashlib, which can import
# _hashlib, which is a C extension; however, that doesn't stop us (for
# example) installing Alembic on a machine with no C compiler
contains_c_extension(django)
""" # noqa
assert inspect.ismodule(module), f'"{module}" not a module.'
if seen is None: # only true for the top-level call
seen = [] # type: List[ModuleType]
if module in seen: # modules can "contain" themselves
# already inspected; avoid infinite loops
return False
seen.append(module)
# Check the thing we were asked about
is_c_ext = is_c_extension(module)
if verbose:
log.info(f"Is module {module!r} a C extension? {is_c_ext}")
if is_c_ext:
return True
if is_builtin_module(module):
# built-in, therefore we stop searching it
return False
# Now check any children, in a couple of ways
top_level_module = seen[0]
top_path = os.path.dirname(top_level_module.__file__)
# Recurse using dir(). This picks up modules that are automatically
# imported by our top-level model. But it won't pick up all submodules;
# try e.g. for django.
for candidate_name in dir(module):
candidate = getattr(module, candidate_name)
# noinspection PyBroadException
try:
if not inspect.ismodule(candidate):
# not a module
continue
except Exception:
# e.g. a Django module that won't import until we configure its
# settings
log.error(f"Failed to test ismodule() status of {candidate!r}")
continue
if is_builtin_module(candidate):
# built-in, therefore we stop searching it
continue
candidate_fname = getattr(candidate, "__file__")
if not include_external_imports:
if os.path.commonpath([top_path, candidate_fname]) != top_path:
if verbose:
log.debug(f"Skipping, not within the top-level module's "
f"directory: {candidate!r}")
continue
# Recurse:
if contains_c_extension(
module=candidate,
import_all_submodules=False, # only done at the top level, below # noqa
include_external_imports=include_external_imports,
seen=seen):
return True
if import_all_submodules:
if not is_module_a_package(module):
if verbose:
log.debug(f"Top-level module is not a package: {module!r}")
return False
# Otherwise, for things like Django, we need to recurse in a different
# way to scan everything.
# See https://stackoverflow.com/questions/3365740/how-to-import-all-submodules. # noqa
log.debug(f"Walking path: {top_path!r}")
try:
for loader, module_name, is_pkg in pkgutil.walk_packages([top_path]): # noqa
if not is_pkg:
log.debug(f"Skipping, not a package: {module_name!r}")
continue
log.debug(f"Manually importing: {module_name!r}")
# noinspection PyBroadException
try:
candidate = loader.find_module(module_name)\
.load_module(module_name) # noqa
except Exception:
# e.g. Alembic "autogenerate" gives: "ValueError: attempted
# relative import beyond top-level package"; or Django
# "django.core.exceptions.ImproperlyConfigured"
log.error(f"Package failed to import: {module_name!r}")
continue
if contains_c_extension(
module=candidate,
import_all_submodules=False, # only done at the top level # noqa
include_external_imports=include_external_imports,
seen=seen):
return True
except Exception:
log.error("Unable to walk packages further; no C extensions "
"detected so far!")
raise
return False
# noinspection PyUnresolvedReferences,PyTypeChecker
def test() -> None:
import _elementtree as et
import arrow
import alembic
import django
import django.conf
import numpy
import numpy.core.multiarray as numpy_multiarray
log.info(f"contains_c_extension(os): "
f"{contains_c_extension(os)}") # False
log.info(f"contains_c_extension(et): "
f"{contains_c_extension(et)}") # False
log.info(f"is_c_extension(numpy): "
f"{is_c_extension(numpy)}") # False
log.info(f"contains_c_extension(numpy): "
f"{contains_c_extension(numpy)}") # True
log.info(f"contains_c_extension(numpy_multiarray): "
f"{contains_c_extension(numpy_multiarray)}") # True # noqa
log.info(f"contains_c_extension(arrow): "
f"{contains_c_extension(arrow)}") # False
log.info(f"contains_c_extension(alembic): "
f"{contains_c_extension(alembic)}") # False
log.info(f"contains_c_extension(alembic, include_external_imports=True): "
f"{contains_c_extension(alembic, include_external_imports=True)}") # True # noqa
# ... this example shows that Alembic imports hashlib, which can import
# _hashlib, which is a C extension; however, that doesn't stop us (for
# example) installing Alembic on a machine with no C compiler
django.conf.settings.configure()
log.info(f"contains_c_extension(django): "
f"{contains_c_extension(django)}") # False
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO) # be verbose
test()

While Cecil Curry's answer works (and was very informative, as was abarnert's, I might add) it will return False for the "top level" of a module even if it includes sub-modules that use the C extension (e.g. numpy vs. numpy.core.multiarray).
While probably not as robust as it could be, the following is working for my use current use cases:
def is_c(module):
# if module is part of the main python library (e.g. os), it won't have a path
try:
for path, subdirs, files in os.walk(module.__path__[0]):
for f in files:
ftype = f.split('.')[-1]
if ftype == 'so':
is_c = True
break
return is_c
except AttributeError:
path = inspect.getfile(module)
suffix = path.split('.')[-1]
if suffix != 'so':
return False
elif suffix == 'so':
return True
is_c(os), is_c(im), is_c(et), is_c_extension(ma), is_c(numpy)
# (False, False, True, True, True)

If you, like me, saw #Cecil Curry' s great answer and thought, how could I do this for an entire requirements file in a super lazy way without #Rudolf Cardinal's complex child library traversal, look no further!
First, dump all your installed requirements (assuming you did this in a virtual env and don't have other stuff in here) into a file with pip freeze > requirements.txt.
Then run the following script to check each of those requirements.
Note: this is super lazy and WILL NOT work for many libraries who's import names don't match their pip names.
import inspect, os
import importlib
from importlib.machinery import ExtensionFileLoader, EXTENSION_SUFFIXES
from types import ModuleType
# function from Cecil Curry's answer:
def is_c_extension(module: ModuleType) -> bool:
'''
`True` only if the passed module is a C extension implemented as a
dynamically linked shared library specific to the current platform.
Parameters
----------
module : ModuleType
Previously imported module object to be tested.
Returns
----------
bool
`True` only if this module is a C extension.
'''
assert isinstance(module, ModuleType), '"{}" not a module.'.format(module)
# If this module was loaded by a PEP 302-compliant CPython-specific loader
# loading only C extensions, this module is a C extension.
if isinstance(getattr(module, '__loader__', None), ExtensionFileLoader):
return True
# Else, fallback to filetype matching heuristics.
#
# Absolute path of the file defining this module.
module_filename = inspect.getfile(module)
# "."-prefixed filetype of this path if any or the empty string otherwise.
module_filetype = os.path.splitext(module_filename)[1]
# This module is only a C extension if this path's filetype is that of a
# C extension specific to the current platform.
return module_filetype in EXTENSION_SUFFIXES
with open('requirements.txt') as f:
lines = f.readlines()
for line in lines:
# super lazy pip name to library name conversion
# there is probably a better way to do this.
lib = line.split("=")[0].replace("python-","").replace("-","_").lower()
try:
mod = importlib.import_module(lib)
print(f"is {lib} a c extension? : {is_c_extension(mod)}")
except:
print(f"could not check {lib}, perhaps the name for imports is different?")

How does Python manage sys.modules? [duplicate]

Importing the standard "logging" module pollutes sys.modules with a bunch of dummy entries:
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32
>>> import sys
>>> import logging
>>> sorted(x for x in sys.modules.keys() if 'log' in x)
['logging', 'logging.atexit', 'logging.cStringIO', 'logging.codecs',
'logging.os', 'logging.string', 'logging.sys', 'logging.thread',
'logging.threading', 'logging.time', 'logging.traceback', 'logging.types']
# and perhaps even more surprising:
>>> import traceback
>>> traceback is sys.modules['logging.traceback']
False
>>> sys.modules['logging.traceback'] is None
True
So importing this package puts extra names into sys.modules, except that they are not modules, just references to None. Other modules (e.g. xml.dom and encodings) have this issue as well. Why?
Edit: Building on bobince's answer, there are pages describing the origin (see section "Dummy Entries in sys.modules") and future of the feature.

None values in sys.modules are cached failures of relative lookups.
So when you're in package foo and you import sys, Python looks first for a foo.sys module, and if that fails goes to the top-level sys module. To avoid having to check the filesystem for foo/sys.py again on further relative imports, it stores None in the sys.modules to flag that the module didn't exist and a subsequent import shouldn't look there again, but go straight to the loaded sys.
This is a cPython implementation detail you can't usefully rely on, but you will need to know it if you're doing nasty magic import/reload hacking.
It happens to all packages, not just logging. For example, import xml.dom and see xml.dom.xml in the module list as it tries to import xml from inside xml.dom.
As Python moves towards absolute import this ugliness will happen less.

fake python modules via symlinks: on windows?

I have several compiled python modules; they are put into a single .so (to avoid runtime linking, there are cross-module symbol dependencies), but a number of symlinks points to this .so:
foo.so -> liball.so
bar.so -> liball.so
liball.so
This way, I can do import foo (Python will call initfoo() defined in liball.so) or import bar (calls initbar()).
I am wondering if this approach will work on Windows?

Probably not, but you could achieve your goal with
import sys
import liball
sys.modules['foo'] = liball
sys.modules['bar'] = liball
if you need to import them at several places, or with
import liball as foo, libalb as bar, liball
if you need that only at one place.
It might be, however, that the distinction between initfoo() and initbar() cannot be held and that both must be done so that the module effectively contains everything to be contained in both modules.
If foo partially contains the same symbols as bar, but with a different meaning, this approach won't work. But then you can just copy the file. This will occupy more disk space than needed, but that doesn't hurt so much, IMHO.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load compiled python modules from memory? - python

I need to read all modules (pre-compiled) from a zipfile (built by py2exe compressed) into memory and then load them all. I know this can be done by loading direct from the zipfile but I need to load them from memory. Any ideas? (I'm using python 2.5.2 on windows) TIA Steve

Related

How does import and reload actually work when dealing with packages

Set variable to the name of the module that has imported it

In Python how can one tell if a module comes from a C extension?

How does Python manage sys.modules? [duplicate]

fake python modules via symlinks: on windows?

Categories

Resources