What are the usage of package_data which is in the setuptools?

What are the usage of package_data which is in the setuptools? - python

I tried to understand what is it for and why we should use it by referring to this site:
https://python-packaging.readthedocs.io/en/latest/non-code-files.html
But I still could not understand it. Could anyone show me more examples or ways to use it please.

The docs you point to is outdated and rather terse. Better read this: https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files
The package_data argument is a dictionary that maps from package names to lists of glob patterns.
Example:
from setuptools import setup
setup(
...
package_data={
# If any package contains *.txt or *.rst files, include them:
'': ['*.txt', '*.rst'],
# And include any *.msg files found in the 'hello' package, too:
'hello': ['*.msg'],
}
)

Related

distutils Extension arguments -- include vs depends vs source?

I'm trying to understand the dependency structure of pandas' cython extensions in setup.py.
distutils.extension.Extension has arguments sources, depends, and include_dirs, and I can't figure out the difference between these. In particular, there are a bunch of places in the pandas case where I can delete entries in depends (or pxdfiles) without breaking the build.
What is the distinction between these three arguments?
Update following answer from #phd:
I appreciate the thought, will try to better communicate the source of my confusion.
In the pandas setup.py file linked above, the pandas._libs.tslib extension is passed to distutils.extension.Extension with the args/kwargs:
ext = Extension('pandas._libs.tslib',
sources=['pandas/_libs/tslib.pyx',
'pandas/_libs/src/util.pxd',
'pandas/_libs/src/datetime/np_datetime.c',
'pandas/_libs/src/datetime/np_datetime_strings.c',
'pandas/_libs/src/period_helper.c'],
depends=['pandas/_libs/src/datetime/np_datetime.h',
'pandas/_libs/src/datetime/np_datetime_strings.h',
'pandas/_libs/src/period_helper.h',
'pandas/_libs/src/datetime.pxd'],
include_dirs=['pandas/_libs/src/klib', 'pandas/_libs/src'])
Take e.g. util.pxd in the sources entry. Is this not redundant with the presence of pandas/_libs/src in the include_dirs entry? tslib imports directly from datetime.pxd which has "imports" of the form cdef extern from "datetime/np_datetime.h" and cdef extern from "datetime/np_datetime_strings.h". Are those "allowed" because of the presence of the "*.c" files in the sources or the "*.h" files in the depends or both or...
I've tried a whole bunch of permutations of removing subsets of these dependencies, have not seen many patterns in terms of which break the build.

See the detailed docs and the source code for build_ext command.
sources is a list of source files (*.c) to compile the extension.
depends — a list of additional files the extensions is required to compile.
include_dirs — a list of directories where a compiler will look for include (header) files (*.h).
pxdfiles are Cython-specific.

How to add package data recursively in Python setup.py?

I have a new library that has to include a lot of subfolders of small datafiles, and I'm trying to add them as package data. Imagine I have my library as so:
library
- foo.py
- bar.py
data
subfolderA
subfolderA1
subfolderA2
subfolderB
subfolderB1
...
I want to add all of the data in all of the subfolders through setup.py, but it seems like I manually have to go into every single subfolder (there are 100 or so) and add an init.py file. Furthermore, will setup.py find these files recursively, or do I need to manually add all of these in setup.py like:
package_data={
'mypackage.data.folderA': ['*'],
'mypackage.data.folderA.subfolderA1': ['*'],
'mypackage.data.folderA.subfolderA2': ['*']
},
I can do this with a script, but seems like a super pain. How can I achieve this in setup.py?
PS, the hierarchy of these folders is important because this is a database of material files and we want the file tree to be preserved when we present them in a GUI to the user, so it would be to our advantage to keep this file structure intact.

The problem with the glob answer is that it only does so much. I.e. it's not fully recursive. The problem with the copy_tree answer is that the files that are copied will be left behind on an uninstall.
The proper solution is a recursive one which will let you set the package_data parameter in the setup call.
I've written this small method to do this:
import os
def package_files(directory):
paths = []
for (path, directories, filenames) in os.walk(directory):
for filename in filenames:
paths.append(os.path.join('..', path, filename))
return paths
extra_files = package_files('path_to/extra_files_dir')
setup(
...
packages = ['package_name'],
package_data={'': extra_files},
....
)
You'll notice that when you do a pip uninstall package_name, that you'll see your additional files being listed (as tracked with the package).

Use Setuptools instead of distutils.
Use data files instead of package data. These do not require __init__.py.
Generate the lists of files and directories using standard Python code, instead of writing it literally:
data_files = []
directories = glob.glob('data/subfolder?/subfolder??/')
for directory in directories:
files = glob.glob(directory+'*')
data_files.append((directory, files))
# then pass data_files to setup()

To add all the subfolders using package_data in setup.py:
add the number of * entries based on you subdirectory structure
package_data={
'mypackage.data.folderA': ['*','*/*','*/*/*'],
}

Use glob to select all subfolders in your setup.py
...
packages=['your_package'],
package_data={'your_package': ['data/**/*']},
...

Update
According to the change log setuptools now supports recursive globs, using **, in package_data (as of v62.3.0, released May 2022).
Original answer
#gbonetti's answer, using a recursive glob pattern, i.e. **, would be perfect.
However, as commented by #daniel-himmelstein, that does not work yet in setuptools package_data.
So, for the time being, I like to use the following workaround, based on pathlib's Path.glob():
def glob_fix(package_name, glob):
# this assumes setup.py lives in the folder that contains the package
package_path = Path(f'./{package_name}').resolve()
return [str(path.relative_to(package_path))
for path in package_path.glob(glob)]
This returns a list of path strings relative to the package path, as required.
Here's one way to use this:
setuptools.setup(
...
package_data={'my_package': [*glob_fix('my_package', 'my_data_dir/**/*'),
'my_other_dir/some.file', ...], ...},
...
)
The glob_fix() can be removed as soon as setuptools supports ** in package_data.

If you don't have any problem with getting your setup.py code dirty use distutils.dir_util.copy_tree.
The whole problem is how to exclude files from it.
Heres some the code:
import os.path
from distutils import dir_util
from distutils import sysconfig
from distutils.core import setup
__packagename__ = 'x'
setup(
name = __packagename__,
packages = [__packagename__],
)
destination_path = sysconfig.get_python_lib()
package_path = os.path.join(destination_path, __packagename__)
dir_util.copy_tree(__packagename__, package_path, update=1, preserve_mode=0)
Some Notes:
This code recursively copy the source code into the destination path.
You can just use the same setup(...) but use copy_tree() to extend the directory you want into the path of installation.
The default paths of distutil installation can be found in it's API.
More information about copy_tree() module of distutils can be found here.

I can suggest a little code to add data_files in setup():
data_files = []
start_point = os.path.join(__pkgname__, 'static')
for root, dirs, files in os.walk(start_point):
root_files = [os.path.join(root, i) for i in files]
data_files.append((root, root_files))
start_point = os.path.join(__pkgname__, 'templates')
for root, dirs, files in os.walk(start_point):
root_files = [os.path.join(root, i) for i in files]
data_files.append((root, root_files))
setup(
name = __pkgname__,
description = __description__,
version = __version__,
long_description = README,
...
data_files = data_files,
)

I can do this with a script, but seems like a super pain. How can I achieve this in setup.py?
Here is a reusable, simple way:
Add the following function in your setup.py, and call it as per the Usage instructions. This is essentially the generic version of the accepted answer.
def find_package_data(specs):
"""recursively find package data as per the folders given
Usage:
# in setup.py
setup(...
include_package_data=True,
package_data=find_package_data({
'package': ('resources', 'static')
}))
Args:
specs (dict): package => list of folder names to include files from
Returns:
dict of list of file names
"""
return {
package: list(''.join(n.split('/', 1)[1:]) for n in
flatten(glob('{}/{}/**/*'.format(package, f), recursive=True) for f in folders))
for package, folders in specs.items()}

I'm going to throw my solution in here in case anyone is looking for a clean way to include their compiled sphinx docs as data_files.
setup.py
from setuptools import setup
import pathlib
import os
here = pathlib.Path(__file__).parent.resolve()
# Get documentation files from the docs/build/html directory
documentation = [doc.relative_to(here) for doc in here.glob("docs/build/html/**/*") if pathlib.Path.is_file(doc)]
data_docs = {}
for doc in documentation:
doc_path = os.path.join("your_top_data_dir", "docs")
path_parts = doc.parts[3:-1] # remove "docs/build/html", ignore filename
if path_parts:
doc_path = os.path.join(doc_path, *path_parts)
# create all appropriate subfolders and append relative doc path
data_docs.setdefault(doc_path, []).append(str(doc))
setup(
...
include_package_data=True,
# <sys.prefix>/your_top_data_dir
data_files=[("your_top_data_dir", ["data/test-credentials.json"]), *list(data_docs.items())]
)
With the above solution, once you install your package you'll have all the compiled documentation available at os.path.join(sys.prefix, "your_top_data_dir", "docs"). So, if you wanted to serve the now-static docs using nginx you could add the following to your nginx file:
location /docs {
# handle static files directly, without forwarding to the application
alias /www/your_app_name/venv/your_top_data_dir/docs;
expires 30d;
}
Once you've done that, you should be able to visit {your-domain.com}/docs and see your Sphinx documentation.

If you don't want to add custom code to iterate through the directory contents, you can use pbr library, which extends setuptools. See here for documentation on how to use it to copy an entire directory, preserving the directory structure:
https://docs.openstack.org/pbr/latest/user/using.html#files

You need to write a function to return all files and its paths , you can use the following
def sherinfind():
# Add all folders contain files or other sub directories
pathlist=['templates/','scripts/']
data={}
for path in pathlist:
for root,d_names,f_names in os.walk(path,topdown=True, onerror=None, followlinks=False):
data[root]=list()
for f in f_names:
data[root].append(os.path.join(root, f))
fn=[(k,v) for k,v in data.items()]
return fn
Now change the data_files in setup() as follows,
data_files=sherinfind()

How to specify header files in setup.py script for Python extension module?

How do I specify the header files in a setup.py script for a Python extension module? Listing them with source files as follows does not work. But I can not figure out where else to list them.
from distutils.core import setup, Extension
from glob import glob
setup(
name = "Foo",
version = "0.1.0",
ext_modules = [Extension('Foo', glob('Foo/*.cpp') + glob('Foo/*.h'))]
)

Add MANIFEST.in file besides setup.py with following contents:
graft relative/path/to/directory/of/your/headers/

Try the headers kwarg to setup(). I don't know that it's documented anywhere, but it works.
setup(name='mypkg', ..., headers=['src/includes/header.h'])

I've had so much trouble with setuptools it's not even funny anymore.
Here's how I ended up having to use a workaround in order to produce a working source distribution with header files: I used package_data.
I'm sharing this in order to potentially save someone else the aggravation. If you know a better working solution, let me know.
See here for details:
https://bitbucket.org/blais/beancount/src/ccb3721a7811a042661814a6778cca1c42433d64/setup.py?fileviewer=file-view-default#setup.py-36
# A note about setuptools: It's profoundly BROKEN.
#
# - The header files are needed in order to distribution a working
# source distribution.
# - Listing the header files under the extension "sources" fails to
# build; distutils cannot make out the file type.
# - Listing them as "headers" makes them ignored; extra options to
# Extension() appear to be ignored silently.
# - Listing them under setup()'s "headers" makes it recognize them, but
# they do not get included.
# - Listing them with "include_dirs" of the Extension fails as well.
#
# The only way I managed to get this working is by working around and
# including them as "packaged data" (see {63fc8d84d30a} below). That
# includes the header files in the sdist, and a source distribution can
# be installed using pip3 (and be built locally). However, the header
# files end up being installed next to the pure Python files in the
# output. This is the sorry situation we're living in, but it works.
There's a corresponding ticket in my OSS project:
https://bitbucket.org/blais/beancount/issues/72

If I remember right you should only need to specify the source files and it's supposed to find/use the headers.
In the setup-tools manual, I see something about this I believe.
"For example, if your extension requires header files in the include directory under your distribution root, use the include_dirs option"
Extension('foo', ['foo.c'], include_dirs=['include'])
http://docs.python.org/distutils/setupscript.html#preprocessor-options

Having py2exe include my data files (like include_package_data)

I have a Python app which includes non-Python data files in some of its subpackages. I've been using the include_package_data option in my setup.py to include all these files automatically when making distributions. It works well.
Now I'm starting to use py2exe. I expected it to see that I have include_package_data=True and to include all the files. But it doesn't. It puts only my Python files in the library.zip, so my app doesn't work.
How do I make py2exe include my data files?

I ended up solving it by giving py2exe the option skip_archive=True. This caused it to put the Python files not in library.zip but simply as plain files. Then I used data_files to put the data files right inside the Python packages.

include_package_data is a setuptools option, not a distutils one. In classic distutils, you have to specify the location of data files yourself, using the data_files = [] directive. py2exe is the same. If you have many files, you can use glob or os.walk to retrieve them. See for example the additional changes (datafile additions) required to setup.py to make a module like MatPlotLib work with py2exe.
There is also a mailing list discussion that is relevant.

Here's what I use to get py2exe to bundle all of my files into the .zip. Note that to get at your data files, you need to open the zip file. py2exe won't redirect the calls for you.
setup(windows=[target],
name="myappname",
data_files = [('', ['data1.dat', 'data2.dat'])],
options = {'py2exe': {
"optimize": 2,
"bundle_files": 2, # This tells py2exe to bundle everything
}},
)
The full list of py2exe options is here.

I have been able to do this by overriding one of py2exe's functions, and then just inserting them into the zipfile that is created by py2exe.
Here's an example:
import py2exe
import zipfile
myFiles = [
"C:/Users/Kade/Documents/ExampleFiles/example_1.doc",
"C:/Users/Kade/Documents/ExampleFiles/example_2.dll",
"C:/Users/Kade/Documents/ExampleFiles/example_3.obj",
"C:/Users/Kade/Documents/ExampleFiles/example_4.H",
]
def better_copy_files(self, destdir):
"""Overriden so that things can be included in the library.zip."""
#Run function as normal
original_copy_files(self, destdir)
#Get the zipfile's location
if self.options.libname is not None:
libpath = os.path.join(destdir, self.options.libname)
#Re-open the zip file
if self.options.compress:
compression = zipfile.ZIP_DEFLATED
else:
compression = zipfile.ZIP_STORED
arc = zipfile.ZipFile(libpath, "a", compression = compression)
#Add your items to the zipfile
for item in myFiles:
if self.options.verbose:
print("Copy File %s to %s" % (item, libpath))
arc.write(item, os.path.basename(item))
arc.close()
#Connect overrides
original_copy_files = py2exe.runtime.Runtime.copy_files
py2exe.runtime.Runtime.copy_files = better_copy_files
I got the idea from here, but unfortunately py2exe has changed how they do things sense then. I hope this helps someone out.

Access data in package subdirectory [duplicate]

This question already has answers here:
How to read a (static) file from inside a Python package?
(6 answers)
Closed 2 years ago.
I am writing a python package with modules that need to open data files in a ./data/ subdirectory. Right now I have the paths to the files hardcoded into my classes and functions. I would like to write more robust code that can access the subdirectory regardless of where it is installed on the user's system.
I've tried a variety of methods, but so far I have had no luck. It seems that most of the "current directory" commands return the directory of the system's python interpreter, and not the directory of the module.
This seems like it ought to be a trivial, common problem. Yet I can't seem to figure it out. Part of the problem is that my data files are not .py files, so I can't use import functions and the like.
Any suggestions?
Right now my package directory looks like:
/
__init__.py
module1.py
module2.py
data/
data.txt
I am trying to access data.txt from module*.py!

The standard way to do this is with setuptools packages and pkg_resources.
You can lay out your package according to the following hierarchy, and configure the package setup file to point it your data resources, as per this link:
http://docs.python.org/distutils/setupscript.html#installing-package-data
You can then re-find and use those files using pkg_resources, as per this link:
http://peak.telecommunity.com/DevCenter/PkgResources#basic-resource-access
import pkg_resources
DATA_PATH = pkg_resources.resource_filename('<package name>', 'data/')
DB_FILE = pkg_resources.resource_filename('<package name>', 'data/sqlite.db')

There is often not point in making an answer that details code that does not work as is, but I believe this to be an exception. Python 3.7 added importlib.resources that is supposed to replace pkg_resources. It would work for accessing files within packages that do not have slashes in their names, i.e.
foo/
__init__.py
module1.py
module2.py
data/
data.txt
data2.txt
i.e. you could access data2.txt inside package foo with for example
importlib.resources.open_binary('foo', 'data2.txt')
but it would fail with an exception for
>>> importlib.resources.open_binary('foo', 'data/data.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.7/importlib/resources.py", line 87, in open_binary
resource = _normalize_path(resource)
File "/usr/lib/python3.7/importlib/resources.py", line 61, in _normalize_path
raise ValueError('{!r} must be only a file name'.format(path))
ValueError: 'data/data2.txt' must be only a file name
This cannot be fixed except by placing __init__.py in data and then using it as a package:
importlib.resources.open_binary('foo.data', 'data.txt')
The reason for this behaviour is "it is by design"; but the design might change...

You can use __file__ to get the path to the package, like this:
import os
this_dir, this_filename = os.path.split(__file__)
DATA_PATH = os.path.join(this_dir, "data", "data.txt")
print open(DATA_PATH).read()

To provide a solution working today. Definitely use this API to not reinvent all those wheels.
A true filesystem filename is needed. Zipped eggs will be extracted to a cache directory:
from pkg_resources import resource_filename, Requirement
path_to_vik_logo = resource_filename(Requirement.parse("enb.portals"), "enb/portals/reports/VIK_logo.png")
Return a readable file-like object for the specified resource; it may be an actual file, a StringIO, or some similar object. The stream is in “binary mode”, in the sense that whatever bytes are in the resource will be read as-is.
from pkg_resources import resource_stream, Requirement
vik_logo_as_stream = resource_stream(Requirement.parse("enb.portals"), "enb/portals/reports/VIK_logo.png")
Package Discovery and Resource Access using pkg_resources
https://setuptools.readthedocs.io/en/latest/pkg_resources.html#resource-extraction
https://setuptools.readthedocs.io/en/latest/pkg_resources.html#basic-resource-access

You need a name for your whole module, you're given directory tree doesn't list that detail, for me this worked:
import pkg_resources
print(
pkg_resources.resource_filename(__name__, 'data/data.txt')
)
Notibly setuptools does not appear to resolve files based on a name match with packed data files, soo you're gunna have to include the data/ prefix pretty much no matter what. You can use os.path.join('data', 'data.txt) if you need alternate directory separators, Generally I find no compatibility problems with hard-coded unix style directory separators though.

I think I hunted down an answer.
I make a module data_path.py, which I import into my other modules containing:
data_path = os.path.join(os.path.dirname(__file__),'data')
And then I open all my files with
open(os.path.join(data_path,'filename'), <param>)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.