Read tarfile in as bytes - python

I have a setup in AWS where I have a python lambda proxying an s3 bucket containing .tar.gz files. I need to return the .tar.gz file from the python lambda back through the API to the user.
I do not want to untar the file, I want to return the tarfile as is, and it seems the tarfile module does not support reading in as bytes.
I have tried using python's .open method (which returns a codec error in utf-8). Then codecs.open with errors set to both ignore and replace which leads to the resulting file not being recognized as .tar.gz
Implementation (tar binary unpackaging)
try:
data = client.get_object(Bucket=bucket, Key=key)
headers['Content-Type'] = data['ContentType']
if key.endswith('.tar.gz'):
with open('/tmp/tmpfile', 'wb') as wbf:
bucketobj.download_fileobj(key, wbf)
with codecs.open('/tmp/tmpfile', "rb",encoding='utf-8', errors='ignore') as fdata:
body = fdata.read()
headers['Content-Disposition'] = 'attachment; filename="{}"'.format(key.split('/')[-1])
Usage (package/aws information redacted for security)
$ wget -v https://<apigfqdn>/release/simple/<package>/<package>-1.0.4.tar.gz
$ tar -xzf <package>-1.0.4.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Related

Python: Stream gzip files from s3

I have files in s3 as gzip chunks, thus I have to read the data continuously and cant read random ones. I always have to start with the first file.
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d. If I do cat f2.gz | gzip -d, it will fail with gzip: stdin: not in gzip format.
How can I stream these data from s3 using python? I saw smart-open and it has the ability to decompress gz files with
from smart_open import smart_open, open
with open(path, compression='.gz') as f:
for line in f:
print(line.strip())
Where path is the path for f1.gz. This works until it hits the end of the file, where it will abort. Same thing will happen locally, if I do cat f1.gz | gzip -d, it will error with gzip: stdin: unexpected end of file when it hits the end.
Is there a way to make it stream the files continuously using python?
This one will not abort, and can iterate through f1.gz, f2.gz and f3.gz
with open(path, 'rb', compression='disable') as f:
for line in f:
print(line.strip(), end="")
but the output are just bytes. I was thinking it will work by doing python test.py | gzip -d, with the above code but I get an error gzip: stdin: not in gzip format. Is there a way to have python print using smart-open that gzip can read?
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d.
One idea would be to make a file object to implement this. The file object reads from one filehandle, exhausts it, reads from the next one, exhausts it, etc. This is similar to how cat works internally.
The handy thing about this is that it does the same thing as concatenating all of your files, without the memory use of reading in all of your files at the same time.
Once you have the combined file object wrapper, you can pass it to Python's gzip module to decompress the file.
Examples:
import gzip
class ConcatFileWrapper:
def __init__(self, files):
self.files = iter(files)
self.current_file = next(self.files)
def read(self, *args):
ret = self.current_file.read(*args)
if len(ret) == 0:
# EOF
# Optional: close self.current_file here
# self.current_file.close()
# Advance to next file and try again
try:
self.current_file = next(self.files)
except StopIteration:
# Out of files
# Return an empty string
return ret
# Recurse and try again
return self.read(*args)
return ret
def write(self):
raise NotImplementedError()
filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)
with gzip.open(wrapper) as gf:
for line in gf:
print(line)
# Close all files
[f.close() for f in filehandles]
Here's how I tested this:
I created a file to test this through the following commands.
Create a file with the contents 1 thru 1000.
$ seq 1 1000 > foo
Compress it.
$ gzip foo
Split the file. This produces four files named xaa-xad.
$ split -b 500 foo.gz
Run the above Python file on it, and it should print out 1 - 1000.
Edit: extra remark about lazy-opening the files
If you have a huge number of files, you might want to open only one file at a time. Here's an example:
def open_files(filenames):
for filename in filenames:
# Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
yield open(filename, "rb")

subprocess gunzip throws decompression failed

I am trying to gunzip using subprocess but it returns the error -
('Decompression failed %s', 'gzip: /tmp/tmp9OtVdr is a directory -- ignored\n')
What is wrong?
import subprocess
transform_script_process = subprocess.Popen(
['gunzip', f_temp.name, '-kf', temp_dir],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)(transform_script_stdoutdata, transform_script_stderrdata
) = transform_script_process.communicate()
self.log.info("Transform script stdout %s",
transform_script_stdoutdata)
if transform_script_process.returncode > 0:
shutil.rmtree(temp_dir)
raise AirflowException("Decompression failed %s",
transform_script_stderrdata)
You are calling the gunzip process and passing it the following parameters:
f_temp.name
-kf
temp_dir
I'm assuming f_temp.name is the path to the gzipped file you are trying to unzip. -kf will force decompression and instruct gzip to keep the file after decompressing it.
Now comes the interesting part. temp_dir seems like a variable that would hold the destination directory you want to extract the files to. However, gunzip does not support this. Please have a look at the manual for gzip. It states that you must pass in a list of files to decompress. There is no option to specify the destination directory.
Have a look at this post on Superuser for more information on specifying the folder you want to extract to: https://superuser.com/questions/139419/how-do-i-gunzip-to-a-different-destination-directory

Download and extract a tar file in Python in chunks

I am trying to use pycurl to download a tgz file and extract it using tarfile, but without storing the tgz file on disk and by not having the whole tgz file in memory. I would like to download it and extract it in chunks, streaming.
I know how to get pycurl callback which gives me data every time a new chunk of data is downloaded:
def write(data):
# Give data to tarfile to extract.
...
with contextlib.closing(pycurl.Curl()) as curl:
curl.setopt(curl.URL, tar_uri)
curl.setopt(curl.WRITEFUNCTION, write)
curl.setopt(curl.FOLLOWLOCATION, True)
curl.perform()
I also know how to open tarfile in streaming mode:
output_tar = tarfile.open(mode='r|gz', fileobj=fileobj)
But I do not know how to connect these two things together, so that every time I get a chunk over the wire, the next chunk of the tar file is extracted.
To be honest, unless you're really looking for a pure-Python solution (which is possible, just rather tedious), I would suggest to just shell out to /usr/bin/tar and feed it data in chunks.
Something like
import subprocess
p = subprocess.Popen(['/usr/bin/tar', 'xz', '-C', '/my/output/directory'], stdin=subprocess.PIPE)
def write(data):
p.stdin.write(data)
with ...:
curl.perform()
p.close()
A Python only solution could look like this:
import contextlib
import tarfile
from http.client import HTTPSConnection
def https_download_tar(host, path, item_visitor, port=443, headers=dict({}), compression='gz'):
"""Download and unpack tar file on-the-fly and call item_visitor for each entry.
item_visitor will receive the arguments TarFile (the currently extracted stream)
and the current TarInfo object
"""
with contextlib.closing(HTTPSConnection(host=host, port=port)) as client:
client.request('GET', path, headers=headers)
with client.getresponse() as response:
code = response.getcode()
if code < 200 or code >= 300:
raise Exception(f'HTTP error downloading tar: code: {code}')
try:
with tarfile.open(fileobj=response, mode=f'r|{compression}') as tar:
for tarinfo in tar:
item_visitor(tar, tarinfo)
except Exception as e:
raise Exception(f'Failed to extract tar stream: {e}')
# Test the download function using some popular archive
def list_entry(tar, tarinfo):
print(f'{tarinfo.name}\t{"DIR" if tarinfo.isdir() else "FILE"}\t{tarinfo.size}\t{tarinfo.mtime}')
https_download_tar('dl.discordapp.net', '/apps/linux/0.0.15/discord-0.0.15.tar.gz', list_entry)
The HTTPSConnection is used to provide a response stream (file-like) which is then passed to tarfile.open().
One can then iterate over the items in the TAR file and for example extract them using TarFile.extractfile().

Read .tar.gz file in Python

I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question . but in my case code doesn't work. the code is as follows :
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f=tar.extractfile(member)
content = f.read()
Data = np.loadtxt(content)
the error is as follows :
Traceback (most recent call last):
File "dataExtPlot.py", line 21, in <module>
content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'
also, Is there any other method to do this task ?
The docs tell us that None is returned by extractfile() if the member is a not a regular file or link.
One possible solution is to skip over the None results:
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f is not None:
content = f.read()
tarfile.extractfile() can return None if the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f:
content = f.read()
Data = np.loadtxt(content)
You may try this one
t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
try:
f = t.extractfile(filename)
Data = f.read()
print filename, ':', Data
except :
print 'ERROR: Did not find %s in tar archive' % filename
My needs:
Python3.
My tar.gz file consists of multiple utf-8 text files and dir.
Need to read text lines from all files.
Problems:
The tar object returned by tar.getmembers() maybe None.
The content extractfile(fname) returns is a bytes str (e.g. b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd'). Unicode char doesn't display correctly.
Solutions:
Check the type of tar object first. I reference the example in doc of tarfile lib. (Search "How to read a gzip compressed tar archive and display some member information")
Decode from byte str to normal str. (ref - most voted answer)
Code:
with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
if tarinfo.isreg():
logger.info(f"Is regular file: {tarinfo.name}")
f = tar.extractfile(tarinfo.name)
# To get the str instead of bytes str
# Decode with proper coding, e.g. utf-8
content = f.read().decode('utf-8', errors='ignore')
# Split the long str into lines
# Specify your line-sep: e.g. \n
lines = content.split('\n')
for i, line in enumerate(lines):
print(f"[{i}]: {line}\n")
elif tarinfo.isdir():
logger.info(f"Is dir: {tarinfo.name}")
else:
logger.info(f"Is something else: {tarinfo.name}.")
You cannot "read" the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfile extracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.
One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can "read" the file. You can achieve this by calling tarfile.getmembers() returns tarfile.TarInfos that contain detailed information about the type of file contained in the tarball.
The tarfile.TarInfo class has all the attributes and methods you need to determine the type of tar member such as isfile() or isdir() or tinfo.islnk() or tinfo.issym() and then accordingly decide what do to with each member (extract or not, etc).
For instance I use these to test the type of file in this patched tarfile to skip extracting special files and process links in a special way:
for tinfo in tar.getmembers():
is_special = not (tinfo.isfile() or tinfo.isdir()
or tinfo.islnk() or tinfo.issym())
...
In Jupyter notebook you can do like below
!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz

Python: tarfile stream

I would like to read some files from a tarball and save it to a new tarball.
This is the code I wrote.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
input_tar.close()
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo)
output_tar.close()
Unfortunately, the output tarball is no good:
$ tar tf foo.tar
./1QZP_A--2JED_A--not_reformatted.dat.bz2
tar: Truncated input file (needed 1548288 bytes, only 1545728 available)
tar: Error exit delayed from previous errors.
Any clue how to read and write tarballs on the fly with Python?
OK so this is how I managed to do it.
archive = 'dum/2164/archive.tar'
# Read input data.
input_tar = tarfile.open(archive, 'r|')
tarinfo = input_tar.next()
fileobj = input_tar.extractfile(tarinfo)
# Write output file.
output_tar = tarfile.open('foo.tar', 'w|')
output_tar.addfile(tarinfo, fileobj)
input_tar.close()
output_tar.close()

Categories