Python: Untar a single folder from a tarball - python

Given a tarball containing multiple directories, how do I extract just a single, specific directory?
import tarfile
tar = tarfile.open("/path/to/tarfile.tar.gz")
tar.list()
... rootdir/subdir_1/file_1.ext
... rootdir/subdir_1/file_n.ext
... rootdir/subdir_2/file_1.ext
etc.
How would I extract just the files from subdir_2?
NOTE: The entire operation is being done in memory a la...
import tarfile, urllib2, StringIO
data = urllib2.urlopen(url)
tar = tarfile.open(mode = 'r|*', fileobj = StringIO.StringIO(data.read()))
... so it's not feasible to extract all to disk and move the necessary folder.

You seem to be almost there - I think you can just use the contents of getnames() and combine it with extractfile() to process the files in memory, e.g.:
import re
files = (file for file in tar.getnames() if file.startswith('rootdir/'))

Related

Can I read non-code files in a Python Zip archive?

I have a Python application in a directory dir. This directory has a __main__.py file and several data files that are read by the application using open(...,'r'). Without editing the code, it it possible to bundle the code and data files into a single zip file and execute it using something like python app.pyz
My goal is to share the file and data easily.
Running the application using python dir works fine.
If I make a zip file using python -m zipfile -c app.pyz dir/*, the resulting application will run but cannot read the files. This makes sense.
I can ask the customers to unzip the compressed folder before running or I could embed the files as strings within the code. That said, I'm curious of this can be avoided.
Can I bundle code and data into one file?
As of Python 3.9 you can use importlib.resources from the standard library. This module uses Python's import machinery to resolve the paths of data files as though they were modules inside a package.
Create a new package inside dir. Let's call it data. Make sure it has an __init__.py.
Add your data files to data. Let's say you added a text file text.txt and a binary file binary.dat.
Now from your __main__.py script or any part of your code with access to the module data, you can access files inside that package like so:
To read text.txt to memory as a string:
txt_file = importlib.resources.files("data").joinpath("text.txt").read_text(encoding="utf-8")
To read binary.dat to memory as bytes:
bin_file = importlib.resources.files("data").joinpath("binary.dat").read_bytes()
To open any file:
path = importlib.resources.files("data").joinpath("text.txt")
with path.open("rt", encoding="utf-8") as file:
lines = file.readlines()
# As streams:
textio_stream = importlib.resources.files("data").joinpath("text.txt").open("rt", encoding="utf-8")
bytesio_stream = importlib.resources.files("data").joinpath("binary.dat").open("rb")
If something requires an actual real file on the filesystem, or you simply want to wrap zipapp compatibility over existing code (e.g. with open()) without having to modify it:
# Old, incompatible with zipfiles.
file_path = "data/text.txt"
with open(file_path, "rt", encoding="utf-8") as file:
lines = file.readlines()
# New, compatible with zipfiles.
file_path = importlib.resources.files("data").joinpath("text.txt")
# If file is inside a zipfile, unzips it in a temporary file, then
# destroys it once the context manager closes. Otherwise, reads the file normally.
with importlib.resources.as_file(file_path) as path:
with open(path, "rt", encoding="utf-8") as file:
lines = file.readlines()
# Since it is a context manager, you can even store it like this:
file_path = importlib.resources.files("data").joinpath("text.txt")
real_path = importlib.resources.as_file(file_path)
with real_path as path:
with open(path, "rt", encoding="utf-8") as file:
lines = file.readlines()
The Traversable objects returned from importlib.resources functions can be mixed with Path objects using as_posix, since joinpath requires posix separators:
file_path = pathlib.Path("subdirectory", "text.txt")
txt_file = importlib.resources.files("data").joinpath(file_path.as_posix()).read_text(encoding="utf-8")
You can use slashes to grow a Traversable, just like pathlib.Path objects:
resources_root = importlib.resources.files("data")
text_path = resources_root / "text.txt"
bin_file = (resources_root / "subdirectory" / "bin.dat").read_bytes()
You can also import the data package like any other package, and use the module object directly. Subpackages are also supported. The only Python files inside the data tree are the __init__.py files of each subpackage:
# __main__.py
import importlib.resources
import data.config
import data.models.b
# Load binary file `file.dat` from `data.models.b`.
# Subpackages are being used as subdirectories.
bin_file = importlib.resources.files(data.models.b).joinpath("file.dat").read_bytes()
...
You technically only need to make your resource root directory be a package. For max brevity:
# __main__.py
from importlib.resources import files
data = files("data") # Resources root.
# In this example, `models` and `b` are regular directories:
bin_file = (data / "models" / "b" / "file.dat").read_bytes()
...
Note that importlib.resources and zipfiles in general support reading only and you will get an exception if you try to write to any file-like object returned from the above functions. It might technically be possible to support modifying data files inside zips but this is way out of scope. If you want to write files, just open a file in the filesystem as normal.
Now your data files have become file-system agnostic and your program should work via zipapp and normal invocation just the same.

Operating on files in current directory and putting them into new directory

I am currently in a working directory where following files are present
abcde_file
gvmdgv_file
qst_file
rl.txt
qp.txt
trs_file
I want to do some operations on all files with _file at end and put them into a new directory called newdir.
My try:
from glob import glob
files = glob("*_file")
with open('newdir/{}'.format(files),'a') as a:
with open(files,'r') as r:
#required operations
It gives error saying file name too long for with open('newdir/{}'.format(files),'a')
That is because variable files is a list containing all matching files with _file as a suffix. You should loop through every single element of the list and copy it instead.
Something like this should work:
from glob import glob
files = glob("*_file")
for file in files:
oldf = open(file,'r')
newf = open(newdir+"\\"+file,'w+')
data = oldf.read()
newf.write(data)
oldf.close()
newf.close()
If you are copying files different than textfiles, you might want to open the two file handles with rb and wb+ instead. Variable newdir could be a constant with the new directory path, for example.
you might want to have a look at python's native "high-level file operations" library shutil
import shutil
shutil.copy("filepath to copy", "path-to-destination-folder")

Extract Tar File inside Memory Filesystem

I have trouble using memoryfs:
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html:
I'm trying to extract tar inside a memoryFS, but I cant use mem_fs because it is an object and cant get the real / memory path...
from fs import open_fs, copy
import fs
import tarfile
mem_fs = open_fs('mem://')
print(mem_fs.isempty('.'))
fs.copy.copy_file('//TEST_FS', 'test.tar', mem_fs, 'test.tar')
print(mem_fs.listdir('/'))
with mem_fs.open('test.tar') as tar_file:
print(tar_file.read())
tar = tarfile.open(tar_file) // I cant create the tar ...
tar.extractall(mem_fs + 'Extract_Dir') // Cant extract it too...
Can someone help me, it is possible to do that ?
The first argument to tarfile.open is a filename. You're (a) passing it an open file object, and (b) even if you were to pass in a filename, tarfile doesn't know anything about your in-memory filesystem and so wouldn't be able to find the file.
Fortunately, tarfile.open has a fileobj argument that accepts an open file object, so you can write:
with mem_fs.open('test.tar', 'rb') as tar_file:
tar = tarfile.open(fileobj=tar_file)
t.list()
Note that you need to open the file in binary mode (rb).
Of course, now you have a second problem: while you can open and read the archive, the tarfile module still doesn't know about your in-memory filesystem, so attempting to extract files will simply extract them to your local filesystem, which is probably not what you want.
To extract into your in-memory filesystem, you're going to need to read the data from the tar archive member and write it yourself. Here's one option for doing that:
import fs
import os
import pathlib
import tarfile
mem_fs = fs.open_fs('mem://')
fs.copy.copy_file('/', '{}/example.tar.gz'.format(os.getcwd()),
mem_fs, 'example.tar.gz')
with mem_fs.open('example.tar.gz', 'rb') as fd:
tar = tarfile.open(fileobj=fd)
# iterate over list of members
for member in tar.getmembers():
# if the member is a file
if member.isfile():
# create any necessary directories
p = pathlib.Path(member.path)
mem_fs.makedirs(str(p.parent), recreate=True)
# open the archive member
with mem_fs.open(member.path, 'wb') as memfd, \
tar.extractfile(member.path) as tarfd:
# and write the data into the memory fs
memfd.write(tarfd.read())
The tarfile.TarFile.extractfile method returns an open file object to a tar archive member, rather than extracting the file to disk.
Note that the above isn't an optimal solution if you're working with large files (since it reads the entire archive member into memory before writing it out).

Collecting comment data from multiple Rar files without unzipping

I wanted to collect comment data of a zip file from multiple files(as the optional comment you get on the side when opening a Zip or a Rar file)
but now I realize that they are not Zip but Rar files, what do i need to change in order for it to work on a Rar file?
import unicodedata
from zipfile import ZipFile
rootFolder = u"C:/Users/user/Desktop/archives/"
zipfiles = [os.path.join(rootFolder, f) for f in
os.listdir(rootFolder)] for zfile in zipfiles:
print("Opening: {}".format(zfile))
with ZipFile(zfile, 'r') as testzip:
print(testzip.comment) # comment for entire zip
l = testzip.infolist() #list all files in archive
for finfo in l:
# per file/directory comments
print("{}:{}".format(finfo.filename, finfo.comment))
You need to use RARFILE module. ZipFile.comment() can only get a comment object from the ZIP file.

How do you concatenate all the HDF5 files in a given directory?

I have many HDF5 files in a directory and I want to concatenate all of them. I tried the following:
from glob import iglob
import shutil
import os
PATH = r'C:\Dropbox\data_files'
destination = open('data.h5','wb')
for filename in iglob(os.path.join(PATH, '*.h5')):
shutil.copyfileobj(open(filename, 'rb'), destination)
destination.close()
However, this only creates an empty file. Each HDF5 file contains two datasets, but I only care about taking the second one (which is named the same thing in each) and adding it to a new file.
Is there a better way of concatenating HDF files? Is there a way to fix my method?
You can combine ipython with h5py module and h5copy tool.
Once installed h5copy ahd h5py just open the ipython console in the folder where all your .h5 files are stored and use this code to merge them in a output.h5 file:
import h5py
import os
d_names = os.listdir(os.getcwd())
d_struct = {} #Here we will store the database structure
for i in d_names:
f = h5py.File(i,'r+')
d_struct[i] = f.keys()
f.close()
for i in d_names:
for j in d_struct[i]:
!h5copy -i '{i}' -o 'output.h5' -s {j} -d {j}

Categories