Access data in package subdirectory [duplicate]

Access data in package subdirectory [duplicate] - python

This question already has answers here:
How to read a (static) file from inside a Python package?
(6 answers)
Closed 2 years ago.
I am writing a python package with modules that need to open data files in a ./data/ subdirectory. Right now I have the paths to the files hardcoded into my classes and functions. I would like to write more robust code that can access the subdirectory regardless of where it is installed on the user's system.
I've tried a variety of methods, but so far I have had no luck. It seems that most of the "current directory" commands return the directory of the system's python interpreter, and not the directory of the module.
This seems like it ought to be a trivial, common problem. Yet I can't seem to figure it out. Part of the problem is that my data files are not .py files, so I can't use import functions and the like.
Any suggestions?
Right now my package directory looks like:
/
__init__.py
module1.py
module2.py
data/
data.txt
I am trying to access data.txt from module*.py!

The standard way to do this is with setuptools packages and pkg_resources.
You can lay out your package according to the following hierarchy, and configure the package setup file to point it your data resources, as per this link:
http://docs.python.org/distutils/setupscript.html#installing-package-data
You can then re-find and use those files using pkg_resources, as per this link:
http://peak.telecommunity.com/DevCenter/PkgResources#basic-resource-access
import pkg_resources
DATA_PATH = pkg_resources.resource_filename('<package name>', 'data/')
DB_FILE = pkg_resources.resource_filename('<package name>', 'data/sqlite.db')

There is often not point in making an answer that details code that does not work as is, but I believe this to be an exception. Python 3.7 added importlib.resources that is supposed to replace pkg_resources. It would work for accessing files within packages that do not have slashes in their names, i.e.
foo/
__init__.py
module1.py
module2.py
data/
data.txt
data2.txt
i.e. you could access data2.txt inside package foo with for example
importlib.resources.open_binary('foo', 'data2.txt')
but it would fail with an exception for
>>> importlib.resources.open_binary('foo', 'data/data.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.7/importlib/resources.py", line 87, in open_binary
resource = _normalize_path(resource)
File "/usr/lib/python3.7/importlib/resources.py", line 61, in _normalize_path
raise ValueError('{!r} must be only a file name'.format(path))
ValueError: 'data/data2.txt' must be only a file name
This cannot be fixed except by placing __init__.py in data and then using it as a package:
importlib.resources.open_binary('foo.data', 'data.txt')
The reason for this behaviour is "it is by design"; but the design might change...

You can use __file__ to get the path to the package, like this:
import os
this_dir, this_filename = os.path.split(__file__)
DATA_PATH = os.path.join(this_dir, "data", "data.txt")
print open(DATA_PATH).read()

To provide a solution working today. Definitely use this API to not reinvent all those wheels.
A true filesystem filename is needed. Zipped eggs will be extracted to a cache directory:
from pkg_resources import resource_filename, Requirement
path_to_vik_logo = resource_filename(Requirement.parse("enb.portals"), "enb/portals/reports/VIK_logo.png")
Return a readable file-like object for the specified resource; it may be an actual file, a StringIO, or some similar object. The stream is in “binary mode”, in the sense that whatever bytes are in the resource will be read as-is.
from pkg_resources import resource_stream, Requirement
vik_logo_as_stream = resource_stream(Requirement.parse("enb.portals"), "enb/portals/reports/VIK_logo.png")
Package Discovery and Resource Access using pkg_resources
https://setuptools.readthedocs.io/en/latest/pkg_resources.html#resource-extraction
https://setuptools.readthedocs.io/en/latest/pkg_resources.html#basic-resource-access

You need a name for your whole module, you're given directory tree doesn't list that detail, for me this worked:
import pkg_resources
print(
pkg_resources.resource_filename(__name__, 'data/data.txt')
)
Notibly setuptools does not appear to resolve files based on a name match with packed data files, soo you're gunna have to include the data/ prefix pretty much no matter what. You can use os.path.join('data', 'data.txt) if you need alternate directory separators, Generally I find no compatibility problems with hard-coded unix style directory separators though.

I think I hunted down an answer.
I make a module data_path.py, which I import into my other modules containing:
data_path = os.path.join(os.path.dirname(__file__),'data')
And then I open all my files with
open(os.path.join(data_path,'filename'), <param>)

Related

Skipping init.py files with iSort

I am working on a python package and in order to format imports I run isort on all the python files. I would like iSort to skip __init__.py files as in some (rare) cases the order of the imports is critical and does not line up with isort's ordering scheme.
I tried playing around with different configuration options in my pyproject.toml file and I believe I am looking for the extend_skip_glob configuration option. However, the issue is I am unable to figure out a glob pattern that matches any __init__.py file located in any subdirectory of src (where the code lives) located at any depth (I believe that's part of the issue I am experiencing).
I have tried a few combinations listed below, but none of them appear to be working.
[tool.isort]
# option 1
extend_skip_glob = ["__init__.py"]
# option 2
extend_skip_glob = ["src/**/__init__.py"]
# option 3
extend_skip_glob = ["src/**/*init__.py"]
Has anybody encountered something like this before and figured out a way to solve it?

I think you should use skip instead. In pyproject.toml file for example,
[tool.isort]
skip = ["__init__.py"]

File not found Jupyter notebook

I am having trouble loading a file in jupyter notebook.
Here is my project tree:
-- home
---- cdsw
------ my_main.py
------ notebooks
-------- my_notebook.ipynb
------ dns
-------- assets
---------- stopwords.txt
-------- bilans
---------- my_module.py
Know that '/home/cdsw/" is in my PYTHONPATH - the same interpreter in which I launch jupyter -.
In my_module.py I have these lines:
PATH_STOPWORDS: Final = os.path.join("dns", "assets", "stopwords.txt")
STOPWORDS: Final = load_stopwords(PATH_STOPWORDS)
load_stopwords is basically just a open(PATH_STOPWORDS, 'r').
So my problem is that when I import dns.bilans.my_module inside my_main.py it works fine: file is correctly loaded.
Yet, when I import it from my_notebook.ipynb, it does not :
FileNotFoundError: [Errno 2] No such file or directory: 'dns/assets/stopwords.txt'
So my_module is indeed founded by jupyter kernel (because it reads the code lines of the file) but can't use the relative path provided like it does from a run in a terminal.
When I use a open(relpath, 'r') inside a module, I don't need to go all through the project tree right ? Indeed it DOES work in my_main.py ...
I really don't get it ...
The output of os.getcwd() in jupyter is "/home/cdsw/notebooks".

This existing SO question suggests how to find files relative to the position of a Python code file. It isn't exactly the same question, however, and I believe that this technique is so important for every Python programmer to understand, that I'm going to provide a more thorough answer.
Given a piece of Python code, one can compute the path of the directory of the source file containing that code via:
here = os.path.dirname(__file__)
Having the position of the relevant source file, it is easy to compute an absolute path to any data file that has a well known location relative to that source file. In this case, the way to do that is:
stopwords_path = os.path.join(here, '..', '..', 'assets', 'stopwords.txt')
This path can be supplied to open() or used in any other way to refer to the stopwords.txt data file. Here, the way to use this path would be:
load_stopwords(stopwords_path)
I use this technique to not only find files that accompany code in a particular module, but also to find files that are in other locations throughout my source tree. As long as the code and data file exist in the same source repository, or are shipped together in a single Python package, the relative path will not change from installation to installation, and so this technique will work.
In general, you should avoid the use of relative paths. Whenever possible, you should also avoid having to tell your code where to find something. For any situation, ask yourself how you can obtain a reliable absolute path that you can then use to then locate whatever it is you're wanting to access.

Read Excel file that is located outside the folder containing the module into Pandas DataFrame

I want to read an excel file into pandas DataFrame. The module from which I want to read the file is inputs.py and the excel file (schoolsData.xlsx) that I want to read is outside the folder containing the module.
I'm doing it like this in my code
def read_data_excel(path):
df_file = pd.read_excel(path)
return df_file
school_data = read_data_excel('../schoolsData.xlsx')
Error: No such file or directory: '../schoolsData.xlsx'
The strange thing is that it works fine when I run the function containing this code locally but I get an error when I run the function after installing my published package from PyPi.
What is the right way to do it? Also would is it possible to read the file normally from the installed distributable that is a compressed folder?

The error could be arised because of the current working directory is different when you execute in local than when you execute after installing. Take a look to this to generalize the path without hardcoding it.

Your code should work. Try this to check the folder you are at:
import os
os.path.dirname(os.path.realpath(__file__))

You can always do
df_file = pd.read_excel("../schoolsData.xlsx")
".." will go back outside the current folder and this will be a relative reference.
You can always define an absolute path to that folder as well (that starts from C://whatever).

init.py reads a file inside a module which fails when top module is zipped

I have code that reads a configuration file in a __init__.py file inside a certain module. Let's say I have a structure like this:
dir/
setup.py
src/
__init__.py
properties/
config.yaml
module/
__init__.py ---> this file reads src/properties/config.yaml
The code to read is something like this:
with open(os.path.join(_ROOT, os.path.normpath('src/properties/config.yaml'))) as f:
config = yaml.load(f)
Where _ROOT is defined in the top src/__init__.py as follows:
ROOT = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
Basically, _ROOT=dir.
This works like a charm in every platform, except when src is used as a zip package on which reading returns a FileNotFoundError: ... No such file or directory: ... package.zip\\src\\properties\\config.yaml.
Is there a way to address this problem, do I need to deal the situation when package is zipped, should I avoid loading the file inside the package code...?
I tried an extensive google search for this but I found nothing.
Thank you in advance.

I was able to solve it with pkgutil.get_data!
import src
a = pkgutil.get_data('src', 'properties/config.yaml')
config = yaml.load(a)
Et voilá!

Force python to ignore pyc files when dynamically loading modules from filepaths

I need to import python modules by filepath (e.g., "/some/path/to/module.py") known only at runtime and ignore any .pyc files should they exist.
This previous question suggests the use of imp.load_module as a solution but this approach also appears to use a .pyc version if it exists.
importme.py
SOME_SETTING = 4
main.py:
import imp
if __name__ == '__main__':
name = 'importme'
openfile, pathname, description = imp.find_module(name)
module = imp.load_module(name, openfile, pathname, description)
openfile.close()
print module
Executing twice, the .pyc file is used after first invocation:
$ python main.py
<module 'importme' from '/Users/dbryant/temp/pyc/importme.py'>
$ python main.py
<module 'importme' from '/Users/dbryant/temp/pyc/importme.pyc'>
Unfortunately, imp.load_source has the same behavior (from the docs):
Note that if a properly matching byte-compiled file (with suffix .pyc
or .pyo) exists, it will be used instead of parsing the given source
file.
Making every script-containing directory read-only is the only solution that I know of (prevents generation of .pyc files in the first place) but I would rather avoid it if possible.
(note: using python 2.7)

load_source does the right thing for me, i.e.
dir, name = os.path.split(path)
mod = imp.load_source(name, path)
uses the .py variant even if a pyc file is available - name ends in .py under python3. The obvious solution is obviously to delete all .pyc files before loading the file - the race condition may be a problem though if you run more than one instance of your program.
One other possibility: Iirc you can let python interpret files from memory - i.e. load the file with the normal file API and then compile the in-memory variant. Something like:
path = "some Filepath.py"
with open(path, "r", encoding="utf-8") as file:
data = file.read()
exec(compile(data, "<string>", "exec")) # fair use of exec, that's a first!

How about using zip files containing python sources instead:
import sys
sys.path.insert("package.zip")

You could mark the directories containing the .py files as read-only.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.