I am always confused and never felt confident about relative imports in python. Here I am trying to understand what I don't understand and then seek StackOverflow's help to fix that.
The Reason why PEP366 exists:
Python relative imports are based on the __name__ attribute. __name__ is parsed to determine the relative position of the module in the package hierarchy. The __name__ attribute for foo.py is always __main__ if it is run like python foo.py from anywhere at all. This means there can never be a relative import statement in foo.py since __main__ has no package information. By package information I mean: __name__ is, for instance, set to a.b.foo. This gets parsed as package a, subpackage b and module foo.
This is the problem that gets fixed by PEP366. It included a -m switch and a __package__attribute that will be used to parse the relative position of the module in pakcae hierarchy. One can use -m to run foo like so : python -m foo
Now, here is PEP366.
Questions in bold
The major proposed change is the introduction of a new module level attribute, __package__. When it is present, relative imports will be based on this attribute rather than the module name attribute. Will it ever be the case that it is not present? I know it can be None, '', or __name__.rpartition('.')[0] but, will it ever be that referencing __package__ will throw me an attribute not found error? Is it safe to say that __package__ is always present actually?
As with the current __name__ attribute, setting __package__ will be the responsibility of the PEP 302 loader used to import a module. Loaders which use imp.new_module() to create the module object will have the new attribute set automatically to None. When the import system encounters an explicit relative import in a module without package set (Just to be clear, by module here they mean import foo or python -m foo only? python foo.py means script and not module. Am I correct?)(or with it set to None), it will calculate and store the correct value (name.rpartition('.')[0] for normal modules and name for package initialisation modules). The language suggests that package gets set only if there is a relative import statement present. Is that the case? Will __package__ never be set if there are no relative import statements?. If __package__ has already been set (This means the user manually setting __package__ in code?) then the import system will use it in preference to recalculating the package name from the name and path attributes.
The runpy module will explicitly set the new attribute, basing it off the name used to locate the module to be executed rather than the name used to set the module's name attribute. This will allow relative imports to work correctly from main modules executed with the -m switch.What is being said here? I thought there was just one name attribute.Here they are talking about 2 different names
When the main module is specified by its filename (What is meant by this statement? Is that not always the case? Can you give an example where it is not?), then the __package__ attribute will be set to None. To allow relative imports when the module is executed directly, boilerplate similar to the following would be needed before the first relative import statement:
if __name__ == "__main__" and __package__ is None:
__package__ = "expected.package.name"
Note that this boilerplate is sufficient only if the top level package is already accessible via sys.path. Additional code that manipulates sys.path would be needed in order for direct execution to work without the top level package already being importable. (** Does this boilerplating actually happen under the hood and the user does not have to take care of this explicitly?**)
This approach also has the same disadvantage as the use of absolute imports of sibling modules - if the script is moved to a different package or subpackage, the boilerplate will need to be updated manually. It has the advantage that this change need only be made once per file, regardless of the number of relative imports.Please elucidate this, preferably giving an example of disadvantage and advantage that they are talking about here
Note that setting __package__ to the empty string explicitly is permitted, and has the effect of disabling all relative imports from that module (since the import machinery will consider it to be a top level module in that case). This means that tools like runpy do not need to provide special case handling for top level modules when setting __package__.
Related
The Python documentation for the import statement (link) contains the following:
The public names defined by a module are determined by checking the module’s namespace for a variable named __all__; if defined, it must be a sequence of strings which are names defined or imported by that module.
The Python documentation for modules (link) contains what is seemingly a contradictory statement:
if a package’s __init__.py code defines a list named __all__, it is taken to be the list of module names that should be imported when from package import * is encountered.
It then gives an example where an __init__.py file imports nothing, and simply defines __all__ to be some of the names of modules in that package.
I have tested both ways of using __all__, and both seem to work; indeed one can mix and match within the same __all__ value.
For example, consider the directory structure
foopkg/
__init__.py
foo.py
Where __init__.py contains
# Note no imports
def bar():
print("BAR")
__all__ = ["bar", "foo"]
NOTE: I know one shouldn't define functions in an __init__.py file. I'm just doing it to illustrate that the same __all__ can export both names that do exist in the current namespace, and those which do not.
The following code runs, seemingly auto-importing the foo module:
>>> from foopkg import *
>>> dir()
[..., 'bar', 'foo']
Why does the __all__ attribute have this strange double-behaviour?
The docs seem really unclear on how it is supposed to be used, only mentioning one of its two sides in each place I linked. I understand the overall purpose is to explicitly set the names imported by a wildcard import, but am confused by the additional, seemingly auto-importing behaviour. Is this just a magic shortcut that avoids having to write the import out as well?
The documentation is a bit hard to parse because it does not mention that packages generally also have the behavior of modules, including their __all__ attribute. The behavior of packages is necessarily a superset of the behavior of modules, because packages, unlike modules, can have sub-packages and sub-modules. Behaviors not related to that feature are identical between the two as far as the end-user is concerned.
The python docs can be minimalistic at times. They did not bother to mention that
Package __init__ performs all the module-like code for a package, including support for star-import for direct attributes via __all__, just like a module does.
Modules support all the features of a package __init__.py, except that they can't have a sub-package or sub-module.
It goes without saying that to make a name refer to a sub-module, it has to be imported, hence the apparent, but not really double-standard.
Update: How from M import * actually works?
The __all__ in __init__.py of folder foopkg works the same way as __all__ in foopkg.py
Why it'll auto-import foo you can see here: https://stackoverflow.com/a/54799108/12565014
The most import thing is to look at the cpython implementation: https://github.com/python/cpython/blob/fee552669f21ca294f57fe0df826945edc779090/Python/ceval.c#L5152
It basically loop through __all__ and try to import each element in __all__
That's why it'll auto-import foo and also achieve white listing
As a simple example, in the following code of 2 submodules (a.py and b.py in the same directory). The link to the same submodule function :func:`hook` works but not the link cross-referencing to a different moduel, ie, :func:`foo`. I also tried the syntax of :func:`.a.foo` - still does not work. How can I cross reference to a.foo()?
# script a.py
def foo():
'''foo func'''
# script b.py
def hook():
'''hook func'''
def spam():
'''spam func.
:func:`foo`
:func:`hook`
'''
As described in the docs:
Normally, names in these roles are searched first without any further
qualification, then with the current module name prepended, then with
the current module and class name (if any) prepended. If you prefix
the name with a dot, this order is reversed.
In that case, :func:`.a.foo` means an object named a inside of the module b. It will look for b.a.foo function.
You should try :func:`..a.foo`, which will point to b..a.foo, or just a.foo (cannot check that locally now, sorry; but I remember I was using that syntax before).
But note, that a.py & b.py should be the modules, i.e. importable under their name. If they are just the scripts, and located not in a package (no __init__.py files up to the root of the project), there is no way to cross-reference with these roles.
You can try to use :any: role — :any:`foo` — and hope that it will find the object in the general index of the described objects.
I need a function that returns the name of the package of the module from which the function was called. Getting the module's name is easy:
import inspect
module_name = inspect.currentframe().f_back.f_globals['__name__']
And stripping the last part to get the module's package is also easy:
package_name = '.'.join(module_name.split('.')[:-1])
But if the function is called from a package's __init__.py, the last part of the name should not be stripped. E.g. if called from foo/bar/__init__.py, module_name in the above example will be set to 'foo.bar', which is already the name of the package.
How can I check, from the module name or the module object, whether it refers to a package or a module?
The best way I found is getting the module object's __file__ attribute, if it exists, and check whether it points to a file whose name is __init__ plus extension. But this seems very brittle to me.
from the module object:
module.__package__
is available for a couple that I looked at, and appears to be correct. it may not give you exactly what you want though:
import os.path
import httplib2
import xml.etree.ElementTree as etree
os.path.__package__
<blank... this is a builtin>
httplib2.__package__
'httplib2'
etree.__package__
'xml.etree'
you can use
function.__module__
also, but that gives you the module name, not the actual module - not sure if there is an easier way to get the module object itself than importing again to a local variable.
etree.parse.__module__
'xml.etree.ElementTree'
os.path.split.__module__
'ntpath'
The good thing is this appears to be more correct, following the actual location of things, e.g.:
httplib2.Response.__module__
'httplib2'
httplib2.copy.copy.__module__
'copy'
httplib2.urlparse.ParseResult.__module__
'urlparse'
etc.
This is what importlib.__import__() does, which needs to re-implement most of the Python's built-in import logic and needs to find a module's package to support relative imports:
# __package__ is not guaranteed to be defined or could be set to None
# to represent that it's proper value is unknown
package = globals.get('__package__')
if package is None:
package = globals['__name__']
if '__path__' not in globals:
package = package.rpartition('.')[0]
module = _gcd_import(name, package, level)
So it seems that there is no reliable "direct" way to get a module's package.
I'm using the following code to populate __all__ in my module's __init__.py and I was wandering if there was a more efficient way. Any ideas?
import fnmatch
import os
__all__ = []
for root, dirnames, filenames in os.walk(os.path.dirname(__file__)):
root = root[os.path.dirname(__file__).__len__():]
for filename in fnmatch.filter(filenames, "*.py"):
__all__.append(os.path.join(root, filename[:-3]))
You probably shouldn't be doing this: The default behaviour of import is quite flexible. If you don't want a module (or any other variable) to be automatically exported, give it a name that starts with _ and python won't export it. That's the standard python way, and reinventing the wheel is considered unpythonic. Also, don't forget that other things besides modules may need exporting; once you set __all__, you'll need to find and export them as well.
Still, you ask how to best generate a list of your exportable modules. Since you can't export what's not present, I'd just check what modules of your own are known to your main module:
basedir = os.path.dirname(__file__)
for m in sys.modules:
if m in locals() and not m.startswith('_'): # Only export regular names
mod = locals()[m]
if '__file__' in mod.__dict__ and mod.__file__.startswith(basedir):
print m
sys.modules includes the names of every module that python has loaded, including many that have not been exported to your main module-- so we check if they're in locals().
This is faster than scanning your filesystem, and more robust than assuming that every .py file in your directory tree will somehow end up as a top-level submodule. Naturally you should run this code near the end of your __init__.py, when everything has been loaded.
I work with a few complex packages that have sub-packages and sub-modules. I like to control this on a module by module basis. I use a simple package called auto-all which makes it easy (full disclosure - I am the author).
https://pypi.org/project/auto-all/
Here's an example:
from auto_all import start_all, end_all
# Define some internal stuff
start_all(globals())
# Define some external stuff
end_all(globals())
The reason I use this approach is mainly because of imports. As mentioned by alexis, you can implicitly make things private by prefixing object names with an underscore, however this can get messy or just impractical for imported objects. Consider the following code:
from pyspark.sql.session import SparkSession
If this appears in your module then you will be implicitly making SparkSession available to be accessed from outside the module. The alternative is to prefix all imported items with underscores, for example:
from pyspark.sql.session import SparkSession as _SparkSession
This also isn't ideal, so manually managing __all__ is the only way (I'm aware of) to manage what you make externally available.
You can easily do this by explicitly setting the contents of the __all__ variable (which is the pythonic way), but this can become tedious when managing a large number of objects, and can also lead to issues if a developer adds a new object and doesn't expose it by adding to the __all__ variable. This type of thing can slip through code reviews. Using simple helper functions to manage the variable contents makes this much easier.
Let's assume I have a main script, main.py, that imports another python file with import coolfunctions and another: import chores
Now, suppose coolfunctions also uses stuff from chores, hence I declare import chores inside coolfunctions.
Since both main.py, and coolfunctions import chores ~ is this redundant? Is there any other way of doing this? Am I doing it correctly?
I'm confused about how python projects should be structured in general. I have a "conf.py" file, that I import for a bunch of variables ~ is this a module or not? I load this conf file in multiple places as well.
If two modules want to use chores, then each one must import chores (or some equivalent import). Each import creates a name binding only in the namespace of the module that does the import; that is, import's namespace effect is local to a module's namespace.
This is good, because by looking at a module's code you can (barring pathological cases) know where each name is bound to by the import statements that explicitly bind modules or module attributes to names. Imports made in other modules won't affect this module's namespace.
Each module X should import all (and only) the modules Y, Z, T, ... whose functionality it requires, without any worry about what other modules Fee, Fie, Foo ... (if any) may have already done part or all of those imports, or may be going to do so in the future.
It would make a module extremely fragile (indeed, it would be the very opposite of modularity!) if each module had to worry about such subtle, "covert-channel" effects.
What other modules Y, Z, T, ..., each module X chooses to import (if any) is part of X's implementation details, and shouldn't concern anybody except the developers who are coding, testing, or maintaining X.
In order to ensure that this is the case, and that this clearly-best strategy of decoupling can and will fully be followed by sane code, Python "caches" modules as they get imported: a module is "loaded" only once per run of a program, the first time anybody imports it (or anything from inside it) -- all other imports use the same object obtained by that first loading, which Python keeps in a cache (which is specified as being the dict sys.modules, but you need to know that detail only for somewhat-advanced programming techniques... don't worry about it, 98.7% of the time -- just remember that "import is cheap"!-).
Sure, a conf.py that you use from several other modules via import conf is definitely a module (you may think you're loading it multiple times, but you aren't unless you're using pretty advanced and deliberate techniques indeed for the purpose) -- why shouldn't it be?
No, this isn't redundant - it's fine to import chores in both the main module and coolfunctions.
The exact import mechanics of Python are complex (for example, module imports are only done once, meaning in your case that the actual parsing and loading of the chores module will only happen once, which is a nice optimization) but in general you shouldn't worry about it because it just works.
Each Python file is a module, so your conf.py is also a module.
It is always the best practice to import all necessary modules in the file that uses them. Take for example:
A.py contains: import coolfunctions
B.py contains: import A
Main.py contains: import B and uses functions that are defined in A.py (this is possible because by importing B, Main.py has imported everything that B imports)
If in the future, you change B.py to function without needing to import A.py and therefore remove the import A, then your Main.py will suffer the loss of not having imported A.