Python: tarfile stream - python

I would like to read some files from a tarball and save it to a new tarball.
This is the code I wrote.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
input_tar.close()
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo)
output_tar.close()
Unfortunately, the output tarball is no good:
$ tar tf foo.tar
./1QZP_A--2JED_A--not_reformatted.dat.bz2
tar: Truncated input file (needed 1548288 bytes, only 1545728 available)
tar: Error exit delayed from previous errors.
Any clue how to read and write tarballs on the fly with Python?

OK so this is how I managed to do it.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
fileobj = input_tar.extractfile(tarinfo)
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo, fileobj)
input_tar.close()
output_tar.close()

Related

Python: Stream gzip files from s3

I have files in s3 as gzip chunks, thus I have to read the data continuously and cant read random ones. I always have to start with the first file.
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d. If I do cat f2.gz | gzip -d, it will fail with gzip: stdin: not in gzip format.
How can I stream these data from s3 using python? I saw smart-open and it has the ability to decompress gz files with
from smart_open import smart_open, open
with open(path, compression='.gz') as f:
for line in f:
print(line.strip())
Where path is the path for f1.gz. This works until it hits the end of the file, where it will abort. Same thing will happen locally, if I do cat f1.gz | gzip -d, it will error with gzip: stdin: unexpected end of file when it hits the end.
Is there a way to make it stream the files continuously using python?
This one will not abort, and can iterate through f1.gz, f2.gz and f3.gz
with open(path, 'rb', compression='disable') as f:
for line in f:
print(line.strip(), end="")
but the output are just bytes. I was thinking it will work by doing python test.py | gzip -d, with the above code but I get an error gzip: stdin: not in gzip format. Is there a way to have python print using smart-open that gzip can read?
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d.
One idea would be to make a file object to implement this. The file object reads from one filehandle, exhausts it, reads from the next one, exhausts it, etc. This is similar to how cat works internally.
The handy thing about this is that it does the same thing as concatenating all of your files, without the memory use of reading in all of your files at the same time.
Once you have the combined file object wrapper, you can pass it to Python's gzip module to decompress the file.
Examples:
import gzip
class ConcatFileWrapper:
def __init__(self, files):
self.files = iter(files)
self.current_file = next(self.files)
def read(self, *args):
ret = self.current_file.read(*args)
if len(ret) == 0:
# EOF
# Optional: close self.current_file here
# self.current_file.close()
# Advance to next file and try again
try:
self.current_file = next(self.files)
except StopIteration:
# Out of files
# Return an empty string
return ret
# Recurse and try again
return self.read(*args)
return ret
def write(self):
raise NotImplementedError()
filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)
with gzip.open(wrapper) as gf:
for line in gf:
print(line)
# Close all files
[f.close() for f in filehandles]
Here's how I tested this:
I created a file to test this through the following commands.
Create a file with the contents 1 thru 1000.
$ seq 1 1000 > foo
Compress it.
$ gzip foo
Split the file. This produces four files named xaa-xad.
$ split -b 500 foo.gz
Run the above Python file on it, and it should print out 1 - 1000.
Edit: extra remark about lazy-opening the files
If you have a huge number of files, you might want to open only one file at a time. Here's an example:
def open_files(filenames):
for filename in filenames:
# Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
yield open(filename, "rb")

Is it possible to add raw bytes to a TarFile object in python 3?

I'm creating a Python script that does a backup of various files, and data on my server.
It looks something like this:
#!/usr/bin/env python3
import subprocess
import tarfile
import os
DIRS_TO_BACKUP = []
FILES_TO_BACKUP = []
backup_destination = "/tmp/out.tar.gz"
# Code that adds directories to DIRS_TO_BACKUP
DIRS_TO_BACKUP.append("/opt/PROJECT_DIR/...")
# Code that adds files to FILES_TO_BACKUP
FILES_TO_BACKUP.append("/etc/SOME_FILE")
# Code to backup my database
db_table = subprocess.run(['mysqldump', 'my_database'], stdout=subprocess.PIPE).stdout
with tarfile.open(backup_destination, "w:gz") as tar:
for DIR in DIRS_TO_BACKUP:
tar.add(DIR, arcname=os.path.basename(DIR))
for FILE in FILES_TO_BACKUP:
tar.add(FILE, arcname=os.path.basename(FILE))
# Code to save db_table (<class 'bytes'>) to tar somehow
Here, db_table are the raw bytes that represent my database. I want to give this data a filename, and save it in my output tar.gz file as a regular file. Is this possible without first saving db_table to the filesystem?
As you can see in the tarfile docs: https://docs.python.org/3/library/tarfile.html, you can add a file object to a tar using gettarinfo and addfile. Just convert your bytes to a file object using io.BytesIO.
#!/usr/bin/env python3
import subprocess
import tarfile
import os
import io
DIRS_TO_BACKUP = []
FILES_TO_BACKUP = []
backup_destination = "/tmp/out.tar.gz"
# Code that adds directories to DIRS_TO_BACKUP
DIRS_TO_BACKUP.append("/opt/PROJECT_DIR/...")
# Code that adds files to FILES_TO_BACKUP
FILES_TO_BACKUP.append("/etc/SOME_FILE")
# Code to backup my database
db_table = subprocess.run(['mysqldump', 'my_database'], stdout=subprocess.PIPE).stdout
db_fileobj = io.BytesIO(db_table)
with tarfile.open(backup_destination, "w:gz") as tar:
for DIR in DIRS_TO_BACKUP:
tar.add(DIR, arcname=os.path.basename(DIR))
for FILE in FILES_TO_BACKUP:
tar.add(FILE, arcname=os.path.basename(FILE))
# Code to save db_table (<class 'bytes'>) to tar somehow
db_info = tar.gettarinfo(name="database", arcname="database", fileobj=db_fileobj)
tar.addfile(db_info, fileobj=db_fileobj)
This has worked for me adding an image that was extracted as a String Tensor inside a TFRecord (no extra intermediate file needs to be saved)
with tarfile.open('image.tar.gz', 'w:gz') as tar:
for img, filename, *_ in tqdm(ds.take(5)): # save 5 for testing
fname = filename.numpy().decode('utf-8')
img = tf.io.encode_jpeg(img, quality=100)
img_fileobj = BytesIO(img.numpy())
tarinfo = tarfile.TarInfo(name=fname)
tarinfo.size = img_fileobj.getbuffer().nbytes
tar.addfile(tarinfo, img_fileobj)

How to extract a specific file from an archive donwloaded from internet using only memory

I'm looking for a way to extract a specific file (knowing his name) from an archive containing multiple ones, without writing any file on the hard drive.
I tried to use both StringIO and zipfile, but I only get the entire archive, or the same error from Zipfile (open require another argument than a StringIo object)
Needed behaviour:
archive.zip #containing ex_file1.ext, ex_file2.ext, target.ext
extracted_file #the targeted unzipped file
archive.zip = getFileFromUrl("file_url")
extracted_file = extractFromArchive(archive.zip, target.ext)
What I've tried so far:
import zipfile, requests
data = requests.get("file_url")
zfile = StringIO.StringIO(zipfile.ZipFile(data.content))
needed_file = zfile.open("Needed file name", "r").read()
There is a builtin library, zipfile, made for working with zip archives.
https://docs.python.org/2/library/zipfile.html
You can list the files in an archive:
ZipFile.namelist()
and extract a subset:
ZipFile.extract(member[, path[, pwd]])
EDIT:
This question has in-memory zip info. TLDR, Zipfile does work with in-memory file-like objects.
Python in-memory zip library
I finally found why I didn't succeed to do it after few hours of testing :
I was bufferring the zipfile object instead of buffering the file itself and then open it as a Zipfile object, which raised a type error.
Here is the way to do :
import zipfile, requests
data = requests.get(url) # Getting the archive from the url
zfile = zipfile.ZipFile(StringIO.StringIO(data.content)) # Opening it in an emulated file
filenames = zfile.namelist() # Listing all files
for name in filesnames:
if name == "Needed file name": # Verify the file is present
needed_file = zfile.open(name, "r").read() # Getting the needed file content
break

Read tarfile in as bytes

I have a setup in AWS where I have a python lambda proxying an s3 bucket containing .tar.gz files. I need to return the .tar.gz file from the python lambda back through the API to the user.
I do not want to untar the file, I want to return the tarfile as is, and it seems the tarfile module does not support reading in as bytes.
I have tried using python's .open method (which returns a codec error in utf-8). Then codecs.open with errors set to both ignore and replace which leads to the resulting file not being recognized as .tar.gz
Implementation (tar binary unpackaging)
try:
data = client.get_object(Bucket=bucket, Key=key)
headers['Content-Type'] = data['ContentType']
if key.endswith('.tar.gz'):
with open('/tmp/tmpfile', 'wb') as wbf:
bucketobj.download_fileobj(key, wbf)
with codecs.open('/tmp/tmpfile', "rb",encoding='utf-8', errors='ignore') as fdata:
body = fdata.read()
headers['Content-Disposition'] = 'attachment; filename="{}"'.format(key.split('/')[-1])
Usage (package/aws information redacted for security)
$ wget -v https://<apigfqdn>/release/simple/<package>/<package>-1.0.4.tar.gz
$ tar -xzf <package>-1.0.4.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Read .tar.gz file in Python

I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question . but in my case code doesn't work. the code is as follows :
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f=tar.extractfile(member)
content = f.read()
Data = np.loadtxt(content)
the error is as follows :
Traceback (most recent call last):
File "dataExtPlot.py", line 21, in <module>
content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'
also, Is there any other method to do this task ?
The docs tell us that None is returned by extractfile() if the member is a not a regular file or link.
One possible solution is to skip over the None results:
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f is not None:
content = f.read()
tarfile.extractfile() can return None if the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f:
content = f.read()
Data = np.loadtxt(content)
You may try this one
t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
try:
f = t.extractfile(filename)
Data = f.read()
print filename, ':', Data
except :
print 'ERROR: Did not find %s in tar archive' % filename
My needs:
Python3.
My tar.gz file consists of multiple utf-8 text files and dir.
Need to read text lines from all files.
Problems:
The tar object returned by tar.getmembers() maybe None.
The content extractfile(fname) returns is a bytes str (e.g. b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd'). Unicode char doesn't display correctly.
Solutions:
Check the type of tar object first. I reference the example in doc of tarfile lib. (Search "How to read a gzip compressed tar archive and display some member information")
Decode from byte str to normal str. (ref - most voted answer)
Code:
with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
if tarinfo.isreg():
logger.info(f"Is regular file: {tarinfo.name}")
f = tar.extractfile(tarinfo.name)
# To get the str instead of bytes str
# Decode with proper coding, e.g. utf-8
content = f.read().decode('utf-8', errors='ignore')
# Split the long str into lines
# Specify your line-sep: e.g. \n
lines = content.split('\n')
for i, line in enumerate(lines):
print(f"[{i}]: {line}\n")
elif tarinfo.isdir():
logger.info(f"Is dir: {tarinfo.name}")
else:
logger.info(f"Is something else: {tarinfo.name}.")
You cannot "read" the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfile extracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.
One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can "read" the file. You can achieve this by calling tarfile.getmembers() returns tarfile.TarInfos that contain detailed information about the type of file contained in the tarball.
The tarfile.TarInfo class has all the attributes and methods you need to determine the type of tar member such as isfile() or isdir() or tinfo.islnk() or tinfo.issym() and then accordingly decide what do to with each member (extract or not, etc).
For instance I use these to test the type of file in this patched tarfile to skip extracting special files and process links in a special way:
for tinfo in tar.getmembers():
is_special = not (tinfo.isfile() or tinfo.isdir()
or tinfo.islnk() or tinfo.issym())
...
In Jupyter notebook you can do like below
!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz

Categories