Can I read non-code files in a Python Zip archive?

Can I read non-code files in a Python Zip archive? - python

I have a Python application in a directory dir. This directory has a __main__.py file and several data files that are read by the application using open(...,'r'). Without editing the code, it it possible to bundle the code and data files into a single zip file and execute it using something like python app.pyz
My goal is to share the file and data easily.
Running the application using python dir works fine.
If I make a zip file using python -m zipfile -c app.pyz dir/*, the resulting application will run but cannot read the files. This makes sense.
I can ask the customers to unzip the compressed folder before running or I could embed the files as strings within the code. That said, I'm curious of this can be avoided.
Can I bundle code and data into one file?

As of Python 3.9 you can use importlib.resources from the standard library. This module uses Python's import machinery to resolve the paths of data files as though they were modules inside a package.
Create a new package inside dir. Let's call it data. Make sure it has an __init__.py.
Add your data files to data. Let's say you added a text file text.txt and a binary file binary.dat.
Now from your __main__.py script or any part of your code with access to the module data, you can access files inside that package like so:
To read text.txt to memory as a string:
txt_file = importlib.resources.files("data").joinpath("text.txt").read_text(encoding="utf-8")
To read binary.dat to memory as bytes:
bin_file = importlib.resources.files("data").joinpath("binary.dat").read_bytes()
To open any file:
path = importlib.resources.files("data").joinpath("text.txt")
with path.open("rt", encoding="utf-8") as file:
lines = file.readlines()
# As streams:
textio_stream = importlib.resources.files("data").joinpath("text.txt").open("rt", encoding="utf-8")
bytesio_stream = importlib.resources.files("data").joinpath("binary.dat").open("rb")
If something requires an actual real file on the filesystem, or you simply want to wrap zipapp compatibility over existing code (e.g. with open()) without having to modify it:
# Old, incompatible with zipfiles.
file_path = "data/text.txt"
with open(file_path, "rt", encoding="utf-8") as file:
lines = file.readlines()
# New, compatible with zipfiles.
file_path = importlib.resources.files("data").joinpath("text.txt")
# If file is inside a zipfile, unzips it in a temporary file, then
# destroys it once the context manager closes. Otherwise, reads the file normally.
with importlib.resources.as_file(file_path) as path:
with open(path, "rt", encoding="utf-8") as file:
lines = file.readlines()
# Since it is a context manager, you can even store it like this:
file_path = importlib.resources.files("data").joinpath("text.txt")
real_path = importlib.resources.as_file(file_path)
with real_path as path:
with open(path, "rt", encoding="utf-8") as file:
lines = file.readlines()
The Traversable objects returned from importlib.resources functions can be mixed with Path objects using as_posix, since joinpath requires posix separators:
file_path = pathlib.Path("subdirectory", "text.txt")
txt_file = importlib.resources.files("data").joinpath(file_path.as_posix()).read_text(encoding="utf-8")
You can use slashes to grow a Traversable, just like pathlib.Path objects:
resources_root = importlib.resources.files("data")
text_path = resources_root / "text.txt"
bin_file = (resources_root / "subdirectory" / "bin.dat").read_bytes()
You can also import the data package like any other package, and use the module object directly. Subpackages are also supported. The only Python files inside the data tree are the __init__.py files of each subpackage:
# __main__.py
import importlib.resources
import data.config
import data.models.b
# Load binary file `file.dat` from `data.models.b`.
# Subpackages are being used as subdirectories.
bin_file = importlib.resources.files(data.models.b).joinpath("file.dat").read_bytes()
...
You technically only need to make your resource root directory be a package. For max brevity:
# __main__.py
from importlib.resources import files
data = files("data") # Resources root.
# In this example, `models` and `b` are regular directories:
bin_file = (data / "models" / "b" / "file.dat").read_bytes()
...
Note that importlib.resources and zipfiles in general support reading only and you will get an exception if you try to write to any file-like object returned from the above functions. It might technically be possible to support modifying data files inside zips but this is way out of scope. If you want to write files, just open a file in the filesystem as normal.
Now your data files have become file-system agnostic and your program should work via zipapp and normal invocation just the same.

Related

Remove auto-generated __MACOSX folder from inside a zip file in Python

I have zip files uploaded by clients through a web server that sometimes contain pesky __MACOSX directories inside that gum things up. How can I remove these?
I thought of using ZipFile, but this answer says that isn't possible and gives this suggestion:
Read out the rest of the archive and write it to a new zip file.
How can I do this with ZipFile? Another Python based alternative like shutil or something similar would also be fine.

The examples below are designed to determine if a '__MACOSX' file is contained within a zip file. If this pesky exist then a new zip archive is created and all the files that are not __MACOSX files are written to this new archive. This code can be extended to include .ds_store files. Please let me if you need to delete the old zip file and replace it with the new clean zip file.
Hopefully, these answers help you solve your issue.
Example One
from zipfile import ZipFile
original_zip = ZipFile ('original.zip', 'r')
new_zip = ZipFile ('new_archve.zip', 'w')
for item in original_zip.infolist():
buffer = original_zip.read(item.filename)
if not str(item.filename).startswith('__MACOSX/'):
new_zip.writestr(item, buffer)
new_zip.close()
original_zip.close()
Example Two
def check_archive_for_bad_filename(file):
zip_file = ZipFile(file, 'r')
for filename in zip_file.namelist():
print(filename)
if filename.startswith('__MACOSX/'):
return True
def remove_bad_filename_from_archive(original_file, temporary_file):
zip_file = ZipFile(original_file, 'r')
for item in zip_file.namelist():
buffer = zip_file.read(item)
if not item.startswith('__MACOSX/'):
if not os.path.exists(temporary_file):
new_zip = ZipFile(temporary_file, 'w')
new_zip.writestr(item, buffer)
new_zip.close()
else:
append_zip = ZipFile(temporary_file, 'a')
append_zip.writestr(item, buffer)
append_zip.close()
zip_file.close()
archive_filename = 'old.zip'
temp_filename = 'new.zip'
results = check_archive_for_bad_filename(archive_filename)
if results:
print('Removing MACOSX file from archive.')
remove_bad_filename_from_archive(archive_filename, temp_filename)
else:
print('No MACOSX file in archive.')

The idea would be to use ZipFile to extract the contents into some defined folder then remove the __MACOSX entry (os.rmdir, os.remove) and then compress it again.
Depending if you have zip command on your OS you might be able to skip the re-compressing part. You could as well control this command from python by using os.system or subprocess module.

How to construct an in-memory virtual file system and then write this structure to disk

I'm looking for a way to create a virtual file system in Python for creating directories and files, before writing these directories and files to disk.
Using PyFilesystem I can construct a memory filesystem using the following:
>>> import fs
>>> dir = fs.open_fs('mem://')
>>> dir.makedirs('fruit')
SubFS(MemoryFS(), '/fruit')
>>> dir.makedirs('vegetables')
SubFS(MemoryFS(), '/vegetables')
>>> with dir.open('fruit/apple.txt', 'w') as apple: apple.write('braeburn')
...
8
>>> dir.tree()
├── fruit
│ └── apple.txt
└── vegetables
Ideally, I want to be able to do something like:
dir.write_to_disk('<base path>')
To write this structure to disk, where <base path> is the parent directory in which this structure will be created.
As far as I can tell, PyFilesystem has no way of achieving this. Is there anything else I could use instead or would I have to implement this myself?

You can use fs.copy.copy_fs() to copy from one filesystem to another, or fs.move.move_fs() to move the filesystem altogether.
Given that PyFilesystem also abstracts around the underlying system filesystem - OSFS - in fact, it's the default protocol, all you need is to copy your in-memory filesystem (MemoryFS) to it and, in effect, you'll have it written to the disk:
import fs
import fs.copy
mem_fs = fs.open_fs('mem://')
mem_fs.makedirs('fruit')
mem_fs.makedirs('vegetables')
with mem_fs.open('fruit/apple.txt', 'w') as apple:
apple.write('braeburn')
# write to the CWD for testing...
with fs.open_fs(".") as os_fs: # use a custom path if you want, i.e. osfs://<base_path>
fs.copy.copy_fs(mem_fs, os_fs)

If you just want to stage a file system tree in memory, look at the tarfile module.
Creating files and directories is a bit involved:
tarblob = io.BytesIO()
tar = tarfile.TarFile(mode="w", fileobj=tarblob)
dirinfo = tarfile.TarInfo("directory")
dirinfo.mode = 0o755
dirinfo.type = tarfile.DIRTYPE
tar.addfile(dirinfo, None)
filedata = io.BytesIO(b"Hello, world!\n")
fileinfo = tarfile.TarInfo("directory/file")
fileinfo.size = len(filedata.getbuffer())
tar.addfile(fileinfo, filedata)
tar.close()
But then you can create the file system hierarchy using TarFile.extractall:
tarblob.seek(0) # Rewind to the beginning of the buffer.
tar = tarfile.TarFile(mode="r", fileobj=tarblob)
tar.extractall()

Python tar.add files but omit parent directories

I am trying to create a tar file from a list of files stored in a text file, I have working code to create the tar, but I wish to start the archive from a certain directory (app and all subdirectories), and remove the parents directories. This is due to the software only opening the file from a certain directory.
package.list files are as below:
app\myFile
app\myDir\myFile
app\myDir\myFile2
If I omit the path in restore.add, it cannot find the files due to my program running from elsewhere. How do I tell the tar to start at a particular directory, or to add the files, but maintain the directory structure it got from the text file, e.g starting with app not all the parent dirs
My objective is to do this tar cf restore.tar -T package.list but with Python on Windows.
I have tried basename from here: How to compress a tar file in a tar.gz without directory?, this strips out ALL the directories.
I have also tried using arcname='app' in the .add method, however this gives some weird results by breaking the directory structure and renames loads of files to app
path = foo + '\\' + bar
file = open(path + '\\package.list', 'r')
restore = tarfile.open(path + '\\restore.tar', 'w')
for line in file:
restore.add(path + '\\' + line.strip())
restore.close()
file.close()
Using Python 2.7

You can use 2nd argument for TarFile.add, it specified the name inside the archive.
So assuming every path is sane something like this would work:
import tarfile
prefix = "some_dir/"
archive_path = "inside_dir/file.txt"
with tarfile.open("test.tar", "w") as tar:
tar.add(prefix+archive_path, archive_path)
Usage:
> cat some_dir/inside_dir/file.txt
test
> python2 test_tar.py
> tar --list -f ./test.tar
inside_dir/file.txt
In production, i'd advise to use appropriate module for path handling to make sure every slash and backslash is in right place.

What do I do to make a python script that can run from any directory: the script file doesn’t have to be in the same directory as the .csv files?

Currently, I have created a code that makes graphs from data in .csv files. However, I can only run the code if that code is present in the folder with the csv files. How can I make the the script file so that it doesn't have to be in the same directory as the .csv files.

Assuming you mean to include a fixed CSV file with your code, store an absolute path based on the script path:
HERE = os.path.dirname(os.path.abspath(__file__))
csv_filename = open(os.path.join(HERE, 'somefile.csv')
__file__ is the filename of the current module or script, os.path.dirname(__file__) is the directory the module resides in. For scripts, __file__ can be a relative pathname, so we use os.path.abspath() to turn that into an absolute path.
This means you can run your script from anywhere.
If you meant to make your script work with arbitrary CSV input files, use command line options:
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser('CSV importer')
parser.add_argument('csvfile', type=argparse.FileType('w'),
default='somedefaultfilename.csv')
options = parser.parse_args()
import_function(options.csvfile)
where csvfile will be an open file object, so your import_function() can just do:
def import_function(csvfile):
with csvfile:
reader = csv.reader(csvfile)
for row in reader:
# etc.

If you don't plan on moving around the csv files too much, the best answer is to hard code the absolute path to the csv folder into the script.
import os
csvdir = "/path/to/csv/dir"
csvfpath = os.path.join(csvdir, "myfile.csv")
csvfile = open(csvfpath)
You can also use a command line parser like argparse to let the user easily change the path to the csv files.
Using Martijn Pieters's solution will work only if you are going to be moving the folder containing both the script and csv files around. However in that case, you may as well just use relative paths to the csv files.

create zip of complete directory using zipfile python module

zip = zipfile.ZipFile(destination+ff_name,"w")
zip.write(source)
zip.close()
Above is the code that I am using, and here "source" is the path of the directory. But when I run this code it just zips the source folder and not the files and and folders contained in it. I want it to compress the source folder recursively. Using tarfile module I can do this without passing any additional information.

The standard os.path.walk() function will likely be of great use for this.
Alternatively, reading the tarfile module to see how it does its work will certainly be of benefit. Indeed, looking at how pieces of the standard library were written was an invaluable part of my learning Python.

I haven't tested this exactly, but it's something similar to what I use.
zip = zipfile.ZipFile(destination+ff_name, 'w', zipfile.ZIP_DEFLATED)
rootlen = len(source) + 1
for base, dirs, files in os.walk(source):
for file in files:
fn = os.path.join(base, file)
zip.write(fn, fn[rootlen:])
This example is from here:
http://bitbucket.org/jgrigonis/mathfacts/src/ff57afdf07a1/setupmac.py

I'd like to add a "new" python 2.7 feature to this topic: ZipFile can be used as a context manager and therefore you can do things like this:
with zipfile.ZipFile(my_file, 'w') as myzip:
rootlen = len(xxx) #use the sub-part of path which you want to keep in your zip file
for base, dirs, files in os.walk(pfad):
for ifile in files:
fn = os.path.join(base, ifile)
myzip.write(fn, fn[rootlen:])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can I read non-code files in a Python Zip archive? - python

Related

Remove auto-generated __MACOSX folder from inside a zip file in Python

How to construct an in-memory virtual file system and then write this structure to disk

Python tar.add files but omit parent directories

What do I do to make a python script that can run from any directory: the script file doesn’t have to be in the same directory as the .csv files?

create zip of complete directory using zipfile python module

Categories

Resources