Python `pkgutil.get_data` disrupts future imports - python

Consider the following package structure:
.
├── module
│   ├── __init__.py
│   └── submodule
│   ├── attribute.py
│   ├── data.txt
│   └── __init__.py
└── test.py
and the following piece of code:
import pkgutil
data = pkgutil.get_data('module.submodule', 'data.txt')
import module.submodule.attribute
retval = module.submodule.attribute.hello()
Running this will raise the error:
Traceback (most recent call last):
File "test.py", line 7, in <module>
retval = module.submodule.attribute.hello()
AttributeError: module 'module' has no attribute 'submodule'
However, if you run the following:
import pkgutil
import module.submodule.attribute
data = pkgutil.get_data('module.submodule', 'data.txt')
retval = module.submodule.attribute.hello()
or
import pkgutil
import module.submodule.attribute
retval = module.submodule.attribute.hello()
it works fine.
Why does running pkgutil.get_data disrupt the future import?

First of all, this was a great question and a great opportunity to learn something new about python's import system. So let's dig in!
If we look at the implementation of pkgutil.get_data we see something like this:
def get_data(package, resource):
spec = importlib.util.find_spec(package)
if spec is None:
return None
loader = spec.loader
if loader is None or not hasattr(loader, 'get_data'):
return None
# XXX needs test
mod = (sys.modules.get(package) or
importlib._bootstrap._load(spec))
if mod is None or not hasattr(mod, '__file__'):
return None
# Modify the resource name to be compatible with the loader.get_data
# signature - an os.path format "filename" starting with the dirname of
# the package's __file__
parts = resource.split('/')
parts.insert(0, os.path.dirname(mod.__file__))
resource_name = os.path.join(*parts)
return loader.get_data(resource_name)
And the answer to your question is in this part of the code:
mod = (sys.modules.get(package) or
importlib._bootstrap._load(spec))
It looks at the already loaded packages and if the package we're looking for (module.submodule in this example) exists it uses it and if not, then tries to load the package using importlib._bootstrap._load.
So let's look at the implementation of importlib._bootstrap._load to see what's going on.
def _load(spec):
"""Return a new module object, loaded by the spec's loader.
The module is not added to its parent.
If a module is already in sys.modules, that existing module gets
clobbered.
"""
with _ModuleLockManager(spec.name):
return _load_unlocked(spec)
Well, There's right there! The doc says "The module is not added to its parent."
It means the submodule module is loaded but it's not added to the module module. So when we try to access the submodule via module there's no connection, hence the AtrributeError.
It makes sense for the get_data method to use this function as it just wants some other file in the package and there is no need to import the whole package and add it to its parent and its parents' parent and so on.
to see it yourself I suggest using a debugger and setting some breakpoints. Then you can see what happens step by step along the way.

Related

python issue while importing a module from a file

the below is my main_call.py file
from flask import Flask, jsonify, request
from test_invoke.invoke import end_invoke
from config import config
app = Flask(__name__)
#app.route("/get/posts", methods=["GET"])
def load_data():
res = "True"
# setting a Host url
host_url = config()["url"]
# getting request parameter and validating it
generate_schedule= end_invoke(host_url)
if generate_schedule == 200:
return jsonify({"status_code": 200, "message": "success"})
elif generate_schedule == 400:
return jsonify(
{"error": "Invalid ", "status_code": 400}
)
if __name__ == "__main__":
app.run(debug=True)
invoke.py
import requests
import json
import urllib
from urllib import request, parse
from config import config
from flask import request
def end_invoke(schedule_url):
headers = {
"Content-Type":"application/json",
}
schedule_data = requests.get(schedule_url, headers=headers)
if not schedule_data.status_code // 100 == 2:
error = schedule_data.json()["error"]
print(error)
return 400
else:
success = schedule_data.json()
return 200
tree structure
test_invoke
├── __init__.py
├── __pycache__
│   ├── config.cpython-38.pyc
│   └── invoke.cpython-38.pyc
├── config.py
├── env.yaml
├── invoke.py
└── main_call.py
However when i run, i get the no module found error
python3 main_call.py
Traceback (most recent call last):
File "main_call.py", line 3, in <module>
from test_invoke.invoke import end_invoke
ModuleNotFoundError: No module named 'test_invoke'
Python looks for packages and modules in its Python path. It searches (in that order):
the current directory (which may not be the path of the current Python module...)
the content of the PYTHONPATH environment variable
various (implementation and system dependant) system paths
As test_invoke is indeed a package, nothing is a priori bad in using it at the root for its modules provided it is accessible from the Python path.
But IMHO, it is always a bad idea to directly start a python module that resides inside a package. Better to make the package accessible and then use relative imports inside the package:
rename main_call.py to __main__.py
replace the offending import line with from .invoke import end_invoke
start the package as python -m test_invoke either for the directory containing test_invoke or after adding that directory to the PYTHONPATH environment variable
That way, the import will work even if you start your program from a different current directory.
You are trying to import file available in the current directory.
So, please replace line
from test_invoke.invoke import end_invoke with from invoke import end_invoke

Implicitly use namespace in imported modules in Python

I'm trying to make a library out of a Python project I don't own.
The project has the following directory layout:
.
├── MANIFEST.in
├── pyproject.toml
└── src
   ├── all.py
   ├── the.py
   └── sources.py
In pyproject.toml I have:
[tool.setuptools]
packages = ["mypkg"]
[tool.setuptools.package-dir]
mypkg = "src"
The problem I'm facing is that when I build and install this package I can't use it because the author is importing stuff without mypkg prefix in the various source files.
F.ex. in all.py
from the import SomeThing
Since I don't own the package I can't go modify all the sources but I still want to be able to build a library from it by just adding MANIFEST.in and pyproject.toml.
Is it possible to somehow instruct setuptools to build a package that won't litter site-packages with all the sources while still allowing them to be imported without the mypkg prefix?
It isn't possible without adding a custom import hook with the package. The hook takes the form of a module that is shipped with the package, and it must be imported before usage from your module (e.g. in src/all.py)
src/mypkgimp.py
import sys
import importlib
class MyPkgLoader(importlib.abc.Loader):
def find_spec(self, name, path=None, target=None):
# update the list with modules that should be treated special
if name in ['sources', 'the']:
return importlib.util.spec_from_loader(name, self)
return None
def create_module(self, spec):
# Uncomment if "normal" imports should have precedence
# try:
# sys.meta_path = [x for x in sys.meta_path[:] if x is not self]
# return importlib.import_module(spec.name)
# except ImportError:
# pass
# finally:
# sys.meta_path = [self] + sys.meta_path
# Otherwise, this will unconditionally shadow normal imports
module = importlib.import_module('.' + spec.name, 'mypkg')
# Final step: inject the module to the "shortened" name
sys.modules[spec.name] = module
return module
def exec_module(self, module):
pass
if not hasattr(sys, 'frozen'):
sys.meta_path = [MyPkgLoader()] + sys.meta_path
Yes, the above uses different methods described by the thread I have linked previously, as importlib have deprecated those methods in Python 3.10, refer to documentation for details.
Anyway, for the demo, put some dummy classes in the modules:
src/the.py
class SomeThing: ...
src/sources.py
class Source: ...
Now, modify src/all.py to have the following:
import mypkg.mypkgimp
from the import SomeThing
Example usage:
>>> from sources import Source
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'sources'
>>> from mypkg import all
>>> all.SomeThing
<class 'mypkg.the.SomeThing'>
>>> from sources import Source
>>> Source
<class 'mypkg.sources.Source'>
>>> from sources import Error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Error' from 'mypkg.sources' (/tmp/mypkg/src/sources.py)
Note how the import initially didn't work, but after mypkg.all got imported, the sources import now works globally. Hence care may be needed to not shadow "real" imports and I have provided the example to import using the "default"[*] import mechanism.
If you want the module names to look different (i.e. without the mypkg. prefix), that will be a separate question, as code typically don't check for their own module name for functionality (and never mind that this actually shows how the namespace is implicitly used - changing the actual name is more akin to a module relocation, yes this can be done, but a bit more complicated and this answer is long enough as it is).
[*] "default" as in not including behaviors introduced by this custom import hook - other import hooks may do their own other weird shenanigans.

How can I get module name automatically from __init__.py or conftest.py?

I am running multiple tests in a tests package, and I want to print each module name in the package, without duplicating code.
So, I wanted to insert some code to __init__.py or conftest.py that will give me the executing module name.
Let's say my test modules are called: checker1, checker2, etc...
My directory structure is like this:
tests_dir/
├── __init__.py
├── conftest.py
├── checker1
├── checker2
└── checker3
So, inside __init__.py I tried inserting:
def module_name():
return os.path.splitext(__file__)[0]
But it still gives me __init__.py from each file when I call it.
I also tried using a fixture inside conftest.py, like:
#pytest.fixture(scope='module')
def module_name(request):
return request.node.name
But it seems as if I still need to define a function inside each module to get module_name as a parameter.
What is the best method of getting this to work?
Edit:
In the end, what I did is explained here:
conftest.py
#pytest.fixture(scope='module', autouse=True)
def module_name(request):
return request.node.name
example for a test file with a test function. The same needs to be added to each file and every function:
checker1.py
from conftest import *
def test_columns(expected_res, actual_res, module_name):
expected_cols = expected_res.columns
actual_cols = actual_res.columns
val = expected_cols.difference(actual_cols) # verify all expected cols are in actual_cols
if not val.empty:
log.error('[{}]: Expected columns are missing: {}'.format(module_name, val.values))
assert val.empty
Notice the module_name fixture I added to the function's parameters.
expected_res and actual_res are pandas Dataframes from excel file.
log is a Logger object from logging package
In each module (checker1, checker2, checker3, conftest.py), in the main function, execute
print(__name__)
When the __init__.py file imports those packages, it should print the module name along with it.
Based on your comment, you can perhaps modify the behaviour in the __init__.py file for local imports.
__init.py__
import sys, os
sys.path.append(os.path.split(__file__)[0])
def my_import(module):
print("Module name is {}".format(module))
exec("import {}".format(module))
testerfn.py
print(__name__)
print("Test")
Directory structure
tests_dir/
├── __init__.py
└── testerfn.py
Command to test
import tests_dir
tests_dir.my_import("testerfn")

How to access symbols in __init__.py from __main__.py?

I have a module - let's call it foo - and I want to make it usable via a python -m foo call. My program look like this:
my_project
├── foo
│   └── __init__.py
└── my_program.py
In __init__.py I have some code which I run when calling python -m foo:
def bar(name):
print(name)
# -- code used to 'run' the module
def main(name):
bar("fritz")
if __name__ == "__main__":
main()
Since I have a fair amount of execution code in __init__.py now (argparse stuff and some logic) I want to separate it into a __main__.py:
my_project
├── foo
│   ├── __init__.py
│   └── __main__.py
└── my_program.py
Despite that looks very simple to me I didn't manage to import stuff located in __init__.py from __main__.py yet.
I know - if foo is located in site-packages or accessible via PYTHONPATH I can just import foo..
But in case I want to execute __main__.py directly (e.g. from some IDE) with foo located anywhere (i.e. not a folder where Python looks for packages) - is there a way to import foo (__init__.py from the same directory)?
I tried import . and import foo - but both approaches fail (because they just mean something else of course)
What I can do - at least to explain my goal - is something like this:
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))
import foo
Works, but is ugly and a bit dangerous since I don't even know if I really import foo from the same directory..
You can manually set the module import state as if __main__.py were executed with -m:
# foo/__main__.py
import os
import sys
if __package__ is None and __name__ == "__main__": # executed without -m
# set special attributes as if part of the package
__file__ = os.path.abspath(__file__)
__package__ = os.path.basename(os.path.dirname(__file__))
# replace import path for __main__ with path for package
main_path = os.path.dirname(__file__)
try:
index = sys.path.index(dir_path)
if index != 0 or index != 1:
raise ValueError('expected script directory after current directory or matching it')
except ValueError:
raise RuntimeError('sys.path does not include script directory as expected')
else:
sys.path[index] = main_path
# import regularly
from . import bar
This exploits that python3 path/to/foo/__main__.py executes __main__ as a standalone script: __package__ is None and the __name__ does not include the package either. The search path in this case is <current directory>, <__main__ directory>, ..., though it gets collapsed if the two are the same: the index is either 0 or 1.
As with all trickery on internals, there is some transient state where invariants are violated. Do not perform any imports before the module is patched!

How to access the current executing module's attributes from other modules?

I have several 'app'-modules (which are being started by a main-application)
and a utility module with some functionality:
my_utility/
├── __init__.py
└── __main__.py
apps/
├── app1/
│ ├── __init__.py
│ └── __main__.py
├── app2/
│ ├── __init__.py
│ └── __main__.py
...
main_app.py
The apps are being started like this (by the main application):
python3 -m <app-name>
I need to provide some meta information (tied to the module) about each app which is readable by the main_app and the apps themselves:
apps/app1/__init__.py:
meta_info = {'min_platform_version': '1.0',
'logger_name': 'mm1'}
... and use it like this:
apps/app1/__main__.py:
from my_utility import handle_meta_info
# does something with meta_info (checking, etc.)
handle_meta_info()
main_app.py:
mod = importlib.import_module('app1')
meta_inf = getattr(mod, 'meta_info')
do_something(meta_inf)
The Problem
I don't know how to access meta_info from within the apps. I know I can
import the module itself and access meta_info:
apps/app1/__main__.py:
import app1
do_something(app1.meta_info)
But this is only possible if I know the name of the module. From inside another module - e.g. my_utility I don't know how to access the module which has been started in the first place (or it's name).
my_utility/__main__.py:
def handle_meta_info():
import MAIN_MODULE <-- don't know, what to import here
do_something(MAIN_MODULE.meta_info)
In other words
I don't know how to access meta_info from within an app's process (being started via python3 -m <name> but from another module which does not know the name of the 'root' module which has been started
Approaches
Always provide the module name when calling meta-info-functions (bad, because it's verbose and redundant)
from my_utility import handle_meta_info
handle_meta_info('app1')
add meta_info to __builtins__ (generally bad to pollute global space)
Parse the command line (ugly)
Analyze the call stack on import my_utility (dangerous, ugly)
The solution I'd like to see
It would be nice to be able to either access the "main" modules global space OR know it's name (to import)
my_utility/__main__.py:
def handle_meta_info():
do_something(__main_module__.meta_info)
OR
def handle_meta_info():
if process_has_been_started_as_module():
mod = importlib.import_module(name_of_main_module())
meta_inf = getattr(mod, 'meta_info')
do_something(meta_inf)
Any ideas?
My current (bloody) solution:
Inside my_utility I use psutil to get the command line the module has been started with (why not sys.argv? Because). There I extract the module name. This way I attach the desired meta information to my_utility (so I have to load it only once).
my_utility/__init__.py:
def __get_executed_modules_meta_info__() -> dict:
def get_executed_module_name()
from psutil import Process
from os import getpid
_cmdline = Process(getpid()).cmdline
try:
# normal case: app has been started via 'python3 -m <app>'
return _cmdline[_cmdline.index('-m') + 1]
except ValueError:
return None
from importlib import import_module
try:
_main_module = import_module(get_module_name())
return import_module(get_executed_module_name()).meta_info
except AttributeError:
return {}
__executed_modules_meta_info__ = __get_executed_modules_meta_info__()

Categories