How to add package data recursively in Python setup.py? - python

I have a new library that has to include a lot of subfolders of small datafiles, and I'm trying to add them as package data. Imagine I have my library as so:
library
- foo.py
- bar.py
data
subfolderA
subfolderA1
subfolderA2
subfolderB
subfolderB1
...
I want to add all of the data in all of the subfolders through setup.py, but it seems like I manually have to go into every single subfolder (there are 100 or so) and add an init.py file. Furthermore, will setup.py find these files recursively, or do I need to manually add all of these in setup.py like:
package_data={
'mypackage.data.folderA': ['*'],
'mypackage.data.folderA.subfolderA1': ['*'],
'mypackage.data.folderA.subfolderA2': ['*']
},
I can do this with a script, but seems like a super pain. How can I achieve this in setup.py?
PS, the hierarchy of these folders is important because this is a database of material files and we want the file tree to be preserved when we present them in a GUI to the user, so it would be to our advantage to keep this file structure intact.

The problem with the glob answer is that it only does so much. I.e. it's not fully recursive. The problem with the copy_tree answer is that the files that are copied will be left behind on an uninstall.
The proper solution is a recursive one which will let you set the package_data parameter in the setup call.
I've written this small method to do this:
import os
def package_files(directory):
paths = []
for (path, directories, filenames) in os.walk(directory):
for filename in filenames:
paths.append(os.path.join('..', path, filename))
return paths
extra_files = package_files('path_to/extra_files_dir')
setup(
...
packages = ['package_name'],
package_data={'': extra_files},
....
)
You'll notice that when you do a pip uninstall package_name, that you'll see your additional files being listed (as tracked with the package).

Use Setuptools instead of distutils.
Use data files instead of package data. These do not require __init__.py.
Generate the lists of files and directories using standard Python code, instead of writing it literally:
data_files = []
directories = glob.glob('data/subfolder?/subfolder??/')
for directory in directories:
files = glob.glob(directory+'*')
data_files.append((directory, files))
# then pass data_files to setup()

To add all the subfolders using package_data in setup.py:
add the number of * entries based on you subdirectory structure
package_data={
'mypackage.data.folderA': ['*','*/*','*/*/*'],
}

Use glob to select all subfolders in your setup.py
...
packages=['your_package'],
package_data={'your_package': ['data/**/*']},
...

Update
According to the change log setuptools now supports recursive globs, using **, in package_data (as of v62.3.0, released May 2022).
Original answer
#gbonetti's answer, using a recursive glob pattern, i.e. **, would be perfect.
However, as commented by #daniel-himmelstein, that does not work yet in setuptools package_data.
So, for the time being, I like to use the following workaround, based on pathlib's Path.glob():
def glob_fix(package_name, glob):
# this assumes setup.py lives in the folder that contains the package
package_path = Path(f'./{package_name}').resolve()
return [str(path.relative_to(package_path))
for path in package_path.glob(glob)]
This returns a list of path strings relative to the package path, as required.
Here's one way to use this:
setuptools.setup(
...
package_data={'my_package': [*glob_fix('my_package', 'my_data_dir/**/*'),
'my_other_dir/some.file', ...], ...},
...
)
The glob_fix() can be removed as soon as setuptools supports ** in package_data.

If you don't have any problem with getting your setup.py code dirty use distutils.dir_util.copy_tree.
The whole problem is how to exclude files from it.
Heres some the code:
import os.path
from distutils import dir_util
from distutils import sysconfig
from distutils.core import setup
__packagename__ = 'x'
setup(
name = __packagename__,
packages = [__packagename__],
)
destination_path = sysconfig.get_python_lib()
package_path = os.path.join(destination_path, __packagename__)
dir_util.copy_tree(__packagename__, package_path, update=1, preserve_mode=0)
Some Notes:
This code recursively copy the source code into the destination path.
You can just use the same setup(...) but use copy_tree() to extend the directory you want into the path of installation.
The default paths of distutil installation can be found in it's API.
More information about copy_tree() module of distutils can be found here.

I can suggest a little code to add data_files in setup():
data_files = []
start_point = os.path.join(__pkgname__, 'static')
for root, dirs, files in os.walk(start_point):
root_files = [os.path.join(root, i) for i in files]
data_files.append((root, root_files))
start_point = os.path.join(__pkgname__, 'templates')
for root, dirs, files in os.walk(start_point):
root_files = [os.path.join(root, i) for i in files]
data_files.append((root, root_files))
setup(
name = __pkgname__,
description = __description__,
version = __version__,
long_description = README,
...
data_files = data_files,
)

I can do this with a script, but seems like a super pain. How can I achieve this in setup.py?
Here is a reusable, simple way:
Add the following function in your setup.py, and call it as per the Usage instructions. This is essentially the generic version of the accepted answer.
def find_package_data(specs):
"""recursively find package data as per the folders given
Usage:
# in setup.py
setup(...
include_package_data=True,
package_data=find_package_data({
'package': ('resources', 'static')
}))
Args:
specs (dict): package => list of folder names to include files from
Returns:
dict of list of file names
"""
return {
package: list(''.join(n.split('/', 1)[1:]) for n in
flatten(glob('{}/{}/**/*'.format(package, f), recursive=True) for f in folders))
for package, folders in specs.items()}

I'm going to throw my solution in here in case anyone is looking for a clean way to include their compiled sphinx docs as data_files.
setup.py
from setuptools import setup
import pathlib
import os
here = pathlib.Path(__file__).parent.resolve()
# Get documentation files from the docs/build/html directory
documentation = [doc.relative_to(here) for doc in here.glob("docs/build/html/**/*") if pathlib.Path.is_file(doc)]
data_docs = {}
for doc in documentation:
doc_path = os.path.join("your_top_data_dir", "docs")
path_parts = doc.parts[3:-1] # remove "docs/build/html", ignore filename
if path_parts:
doc_path = os.path.join(doc_path, *path_parts)
# create all appropriate subfolders and append relative doc path
data_docs.setdefault(doc_path, []).append(str(doc))
setup(
...
include_package_data=True,
# <sys.prefix>/your_top_data_dir
data_files=[("your_top_data_dir", ["data/test-credentials.json"]), *list(data_docs.items())]
)
With the above solution, once you install your package you'll have all the compiled documentation available at os.path.join(sys.prefix, "your_top_data_dir", "docs"). So, if you wanted to serve the now-static docs using nginx you could add the following to your nginx file:
location /docs {
# handle static files directly, without forwarding to the application
alias /www/your_app_name/venv/your_top_data_dir/docs;
expires 30d;
}
Once you've done that, you should be able to visit {your-domain.com}/docs and see your Sphinx documentation.

If you don't want to add custom code to iterate through the directory contents, you can use pbr library, which extends setuptools. See here for documentation on how to use it to copy an entire directory, preserving the directory structure:
https://docs.openstack.org/pbr/latest/user/using.html#files

You need to write a function to return all files and its paths , you can use the following
def sherinfind():
# Add all folders contain files or other sub directories
pathlist=['templates/','scripts/']
data={}
for path in pathlist:
for root,d_names,f_names in os.walk(path,topdown=True, onerror=None, followlinks=False):
data[root]=list()
for f in f_names:
data[root].append(os.path.join(root, f))
fn=[(k,v) for k,v in data.items()]
return fn
Now change the data_files in setup() as follows,
data_files=sherinfind()

Related

What are the usage of package_data which is in the setuptools?

I tried to understand what is it for and why we should use it by referring to this site:
https://python-packaging.readthedocs.io/en/latest/non-code-files.html
But I still could not understand it. Could anyone show me more examples or ways to use it please.
The docs you point to is outdated and rather terse. Better read this: https://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files
The package_data argument is a dictionary that maps from package names to lists of glob patterns.
Example:
from setuptools import setup
setup(
...
package_data={
# If any package contains *.txt or *.rst files, include them:
'': ['*.txt', '*.rst'],
# And include any *.msg files found in the 'hello' package, too:
'hello': ['*.msg'],
}
)

What is difference between root and base directory?

I am trying to use shutils.py , make_archive function.
here: https://docs.python.org/2/library/shutil.html#archiving-operations
but I can't understand the difference between root_dir and base_dir.
Here's a simple code using make_archive:
#!user/bin/python
from os import path
from os import curdir
from shutil import make_archive
# Setting current Directory
current = path.realpath(curdir)
# Now Compressing
make_archive("Backup", "gztar", current)
This will create an archive named Backup.tar.gz which contains a . Directory inside it.
I don't want the . directory but the whole thing inside in the archive.
root_dir refers to base directory of output file, or working directory for your working script.
base_dir refers to content you want pack.
For example, if you have a directory tree like:
/home/apast/git/someproject
And you want to build a package for someproject folder, you can set:
root_dir="/home/apast/git"
base_dir="someproject"
If the contents of your tree is like following, for example:
/home/apast/git/someproject/test.py
/home/apast/git/someproject/model.py
The content of your package will acquire following structure:
someproject/test.py
someproject/model.py
And your package file will be stored at:
/home/apast/git/<packfile-name>
Like doc shows, by default, root_dir and base_dir are initialized for your current working directory (cwd, or curdir). But, you can use it in a more flexible way.
Let's consider following dir structure:
/home/apast/git/web/tornado.py
/home/apast/git/web/setup.py
/home/apast/git/core/service.py
/home/apast/git/mobile/gui.py
/home/apast/git/mobile/restfulapi.py
We will try two snippets to clarify examples:
1. Defining base_dir
2. Without defined base_dir
Defining base_dir, we specify which directory we will include on our file:
from shutil import make_archive
root_dir = "/home/apast/git/"
make_archive(base_name="/tmp/outputfile",
format="gztar",
root_dir=root_dir,
base_dir="web")
This code will generate a file called /tmp/outputfile.tar.gz with following struct:
web/tornado.py
web/setup.py
Running without base_dir, like following:
from shutil import make_archive
root_dir = "/home/apast/git/"
make_archive(base_name="/tmp/outputfile",
format="gztar",
root_dir=root_dir)
It will product a file containing:
web/tornado.py
web/setup.py
core/service.py
mobile/gui.py
mobile/restfulapi.py
To define specific folders, maybe it will be necessary use some other technique, like gzip lib directly.
cd in the root_dir... then tar the base_dir...
the docs makes me confuse too, read the code, that will make u clear.
It's a bit confusing if you read the documentation but if you see it visually, it can help quite a bit.
The root_dir is the directory that you will be storing the file in.
If I were storing a file in C:\Users\Elipzer\Desktop\MyFolder\, That would be my root_dir.
The base_dir is the part added onto the root_dir so if I were storing it under my MyFolder in ...\MyFolder\MySubFolder\, I would put that as the base_dir.
In many cases, there is no need to use these since you can just change the default directory to the directory that you want to store the file in and the make_archive function will just use the default directory as the root_dir and base_dir.

Include entire directory in python setup.py data_files

The data_files parameter for setup takes input in the following format:
setup(...
data_files = [(target_directory, [list of files to be put there])]
....)
Is there a way for me to specify an entire directory of data instead, so I don't have to name each file individually and update it as I change implementation in my project?
I attempted to use os.listdir(), but I don't know how to do that with relative paths, I couldn't use os.getcwd() or os.realpath(__file__) since those don't point to my repository root correctly.
karelv has the right idea, but to answer the stated question more directly:
from glob import glob
setup(
#...
data_files = [
('target_directory_1', glob('source_dir/*')), # files in source_dir only - not recursive
('target_directory_2', glob('nested_source_dir/**/*', recursive=True)), # includes sub-folders - recursive
# etc...
],
#...
)
import glob
for filename in glob.iglob('inner_dir/**/*', recursive=True):
print (filename)
Doing this, you get directly a list of files relative to current directory.
I don't know how to do that with relative paths
You need to get the path of the directory first, so...
Say you have this directory structure:
cur_directory
|- setup.py
|- inner_dir
|- file2.py
To get the directory of the current file (in this case setup.py), use this:
cur_directory_path = os.path.abspath(os.path.dirname(__file__))
Then, to get a directory path relative to current_directory, just join some other directories, eg:
inner_dir_path = os.path.join(cur_directory_path, 'inner_dir')
If you want to move up a directory, just use "..", for example:
parent_dir_path = os.path.join(current_directory, '..')
Once you have that path, you can do os.listdir
For completeness:
If you want the path of a file, in this case "file2.py" relative to setup.py, you could do:
file2_path = os.path.join(cur_directory_path, 'inner_dir', 'file2.py')
I ran into the same problem with directories containing nested subdirectories. The glob solutions didn't work, as they would include the directory in the list, which setup would blow up on, and in cases where I excluded matched directories, it still dumped them all in the same directory, which is not what I wanted, either. I ended up just falling back on os.walk():
def generate_data_files():
data_files = []
data_dirs = ('data', 'plugins')
for path, dirs, files in chain.from_iterable(os.walk(data_dir) for data_dir in data_dirs):
install_dir = os.path.join(sys.prefix, 'share/<my-app>/' + path)
list_entry = (install_dir, [os.path.join(path, f) for f in files if not f.startswith('.')])
data_files.append(list_entry)
return data_files
and then setting data_files=generate_data_files() in the setup() block.
With nested subdirectories, if you want to preserve the original directory structure, you can use os.walk(), as proposed in another answer.
However, an easier solution uses pbr library, which extends setuptools. See here for documentation on how to use it to install an entire directory structure:
https://docs.openstack.org/pbr/latest/user/using.html#files

How to specify header files in setup.py script for Python extension module?

How do I specify the header files in a setup.py script for a Python extension module? Listing them with source files as follows does not work. But I can not figure out where else to list them.
from distutils.core import setup, Extension
from glob import glob
setup(
name = "Foo",
version = "0.1.0",
ext_modules = [Extension('Foo', glob('Foo/*.cpp') + glob('Foo/*.h'))]
)
Add MANIFEST.in file besides setup.py with following contents:
graft relative/path/to/directory/of/your/headers/
Try the headers kwarg to setup(). I don't know that it's documented anywhere, but it works.
setup(name='mypkg', ..., headers=['src/includes/header.h'])
I've had so much trouble with setuptools it's not even funny anymore.
Here's how I ended up having to use a workaround in order to produce a working source distribution with header files: I used package_data.
I'm sharing this in order to potentially save someone else the aggravation. If you know a better working solution, let me know.
See here for details:
https://bitbucket.org/blais/beancount/src/ccb3721a7811a042661814a6778cca1c42433d64/setup.py?fileviewer=file-view-default#setup.py-36
# A note about setuptools: It's profoundly BROKEN.
#
# - The header files are needed in order to distribution a working
# source distribution.
# - Listing the header files under the extension "sources" fails to
# build; distutils cannot make out the file type.
# - Listing them as "headers" makes them ignored; extra options to
# Extension() appear to be ignored silently.
# - Listing them under setup()'s "headers" makes it recognize them, but
# they do not get included.
# - Listing them with "include_dirs" of the Extension fails as well.
#
# The only way I managed to get this working is by working around and
# including them as "packaged data" (see {63fc8d84d30a} below). That
# includes the header files in the sdist, and a source distribution can
# be installed using pip3 (and be built locally). However, the header
# files end up being installed next to the pure Python files in the
# output. This is the sorry situation we're living in, but it works.
There's a corresponding ticket in my OSS project:
https://bitbucket.org/blais/beancount/issues/72
If I remember right you should only need to specify the source files and it's supposed to find/use the headers.
In the setup-tools manual, I see something about this I believe.
"For example, if your extension requires header files in the include directory under your distribution root, use the include_dirs option"
Extension('foo', ['foo.c'], include_dirs=['include'])
http://docs.python.org/distutils/setupscript.html#preprocessor-options

Having py2exe include my data files (like include_package_data)

I have a Python app which includes non-Python data files in some of its subpackages. I've been using the include_package_data option in my setup.py to include all these files automatically when making distributions. It works well.
Now I'm starting to use py2exe. I expected it to see that I have include_package_data=True and to include all the files. But it doesn't. It puts only my Python files in the library.zip, so my app doesn't work.
How do I make py2exe include my data files?
I ended up solving it by giving py2exe the option skip_archive=True. This caused it to put the Python files not in library.zip but simply as plain files. Then I used data_files to put the data files right inside the Python packages.
include_package_data is a setuptools option, not a distutils one. In classic distutils, you have to specify the location of data files yourself, using the data_files = [] directive. py2exe is the same. If you have many files, you can use glob or os.walk to retrieve them. See for example the additional changes (datafile additions) required to setup.py to make a module like MatPlotLib work with py2exe.
There is also a mailing list discussion that is relevant.
Here's what I use to get py2exe to bundle all of my files into the .zip. Note that to get at your data files, you need to open the zip file. py2exe won't redirect the calls for you.
setup(windows=[target],
name="myappname",
data_files = [('', ['data1.dat', 'data2.dat'])],
options = {'py2exe': {
"optimize": 2,
"bundle_files": 2, # This tells py2exe to bundle everything
}},
)
The full list of py2exe options is here.
I have been able to do this by overriding one of py2exe's functions, and then just inserting them into the zipfile that is created by py2exe.
Here's an example:
import py2exe
import zipfile
myFiles = [
"C:/Users/Kade/Documents/ExampleFiles/example_1.doc",
"C:/Users/Kade/Documents/ExampleFiles/example_2.dll",
"C:/Users/Kade/Documents/ExampleFiles/example_3.obj",
"C:/Users/Kade/Documents/ExampleFiles/example_4.H",
]
def better_copy_files(self, destdir):
"""Overriden so that things can be included in the library.zip."""
#Run function as normal
original_copy_files(self, destdir)
#Get the zipfile's location
if self.options.libname is not None:
libpath = os.path.join(destdir, self.options.libname)
#Re-open the zip file
if self.options.compress:
compression = zipfile.ZIP_DEFLATED
else:
compression = zipfile.ZIP_STORED
arc = zipfile.ZipFile(libpath, "a", compression = compression)
#Add your items to the zipfile
for item in myFiles:
if self.options.verbose:
print("Copy File %s to %s" % (item, libpath))
arc.write(item, os.path.basename(item))
arc.close()
#Connect overrides
original_copy_files = py2exe.runtime.Runtime.copy_files
py2exe.runtime.Runtime.copy_files = better_copy_files
I got the idea from here, but unfortunately py2exe has changed how they do things sense then. I hope this helps someone out.

Categories