Download and extract a tar file in Python in chunks - python

I am trying to use pycurl to download a tgz file and extract it using tarfile, but without storing the tgz file on disk and by not having the whole tgz file in memory. I would like to download it and extract it in chunks, streaming.
I know how to get pycurl callback which gives me data every time a new chunk of data is downloaded:
def write(data):
# Give data to tarfile to extract.
...
with contextlib.closing(pycurl.Curl()) as curl:
curl.setopt(curl.URL, tar_uri)
curl.setopt(curl.WRITEFUNCTION, write)
curl.setopt(curl.FOLLOWLOCATION, True)
curl.perform()
I also know how to open tarfile in streaming mode:
output_tar = tarfile.open(mode='r|gz', fileobj=fileobj)
But I do not know how to connect these two things together, so that every time I get a chunk over the wire, the next chunk of the tar file is extracted.

To be honest, unless you're really looking for a pure-Python solution (which is possible, just rather tedious), I would suggest to just shell out to /usr/bin/tar and feed it data in chunks.
Something like
import subprocess
p = subprocess.Popen(['/usr/bin/tar', 'xz', '-C', '/my/output/directory'], stdin=subprocess.PIPE)
def write(data):
p.stdin.write(data)
with ...:
curl.perform()
p.close()

A Python only solution could look like this:
import contextlib
import tarfile
from http.client import HTTPSConnection
def https_download_tar(host, path, item_visitor, port=443, headers=dict({}), compression='gz'):
"""Download and unpack tar file on-the-fly and call item_visitor for each entry.
item_visitor will receive the arguments TarFile (the currently extracted stream)
and the current TarInfo object
"""
with contextlib.closing(HTTPSConnection(host=host, port=port)) as client:
client.request('GET', path, headers=headers)
with client.getresponse() as response:
code = response.getcode()
if code < 200 or code >= 300:
raise Exception(f'HTTP error downloading tar: code: {code}')
try:
with tarfile.open(fileobj=response, mode=f'r|{compression}') as tar:
for tarinfo in tar:
item_visitor(tar, tarinfo)
except Exception as e:
raise Exception(f'Failed to extract tar stream: {e}')
# Test the download function using some popular archive
def list_entry(tar, tarinfo):
print(f'{tarinfo.name}\t{"DIR" if tarinfo.isdir() else "FILE"}\t{tarinfo.size}\t{tarinfo.mtime}')
https_download_tar('dl.discordapp.net', '/apps/linux/0.0.15/discord-0.0.15.tar.gz', list_entry)
The HTTPSConnection is used to provide a response stream (file-like) which is then passed to tarfile.open().
One can then iterate over the items in the TAR file and for example extract them using TarFile.extractfile().

Related

Python: Stream gzip files from s3

I have files in s3 as gzip chunks, thus I have to read the data continuously and cant read random ones. I always have to start with the first file.
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d. If I do cat f2.gz | gzip -d, it will fail with gzip: stdin: not in gzip format.
How can I stream these data from s3 using python? I saw smart-open and it has the ability to decompress gz files with
from smart_open import smart_open, open
with open(path, compression='.gz') as f:
for line in f:
print(line.strip())
Where path is the path for f1.gz. This works until it hits the end of the file, where it will abort. Same thing will happen locally, if I do cat f1.gz | gzip -d, it will error with gzip: stdin: unexpected end of file when it hits the end.
Is there a way to make it stream the files continuously using python?
This one will not abort, and can iterate through f1.gz, f2.gz and f3.gz
with open(path, 'rb', compression='disable') as f:
for line in f:
print(line.strip(), end="")
but the output are just bytes. I was thinking it will work by doing python test.py | gzip -d, with the above code but I get an error gzip: stdin: not in gzip format. Is there a way to have python print using smart-open that gzip can read?
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d.
One idea would be to make a file object to implement this. The file object reads from one filehandle, exhausts it, reads from the next one, exhausts it, etc. This is similar to how cat works internally.
The handy thing about this is that it does the same thing as concatenating all of your files, without the memory use of reading in all of your files at the same time.
Once you have the combined file object wrapper, you can pass it to Python's gzip module to decompress the file.
Examples:
import gzip
class ConcatFileWrapper:
def __init__(self, files):
self.files = iter(files)
self.current_file = next(self.files)
def read(self, *args):
ret = self.current_file.read(*args)
if len(ret) == 0:
# EOF
# Optional: close self.current_file here
# self.current_file.close()
# Advance to next file and try again
try:
self.current_file = next(self.files)
except StopIteration:
# Out of files
# Return an empty string
return ret
# Recurse and try again
return self.read(*args)
return ret
def write(self):
raise NotImplementedError()
filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)
with gzip.open(wrapper) as gf:
for line in gf:
print(line)
# Close all files
[f.close() for f in filehandles]
Here's how I tested this:
I created a file to test this through the following commands.
Create a file with the contents 1 thru 1000.
$ seq 1 1000 > foo
Compress it.
$ gzip foo
Split the file. This produces four files named xaa-xad.
$ split -b 500 foo.gz
Run the above Python file on it, and it should print out 1 - 1000.
Edit: extra remark about lazy-opening the files
If you have a huge number of files, you might want to open only one file at a time. Here's an example:
def open_files(filenames):
for filename in filenames:
# Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
yield open(filename, "rb")

subprocess gunzip throws decompression failed

I am trying to gunzip using subprocess but it returns the error -
('Decompression failed %s', 'gzip: /tmp/tmp9OtVdr is a directory -- ignored\n')
What is wrong?
import subprocess
transform_script_process = subprocess.Popen(
['gunzip', f_temp.name, '-kf', temp_dir],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)(transform_script_stdoutdata, transform_script_stderrdata
) = transform_script_process.communicate()
self.log.info("Transform script stdout %s",
transform_script_stdoutdata)
if transform_script_process.returncode > 0:
shutil.rmtree(temp_dir)
raise AirflowException("Decompression failed %s",
transform_script_stderrdata)
You are calling the gunzip process and passing it the following parameters:
f_temp.name
-kf
temp_dir
I'm assuming f_temp.name is the path to the gzipped file you are trying to unzip. -kf will force decompression and instruct gzip to keep the file after decompressing it.
Now comes the interesting part. temp_dir seems like a variable that would hold the destination directory you want to extract the files to. However, gunzip does not support this. Please have a look at the manual for gzip. It states that you must pass in a list of files to decompress. There is no option to specify the destination directory.
Have a look at this post on Superuser for more information on specifying the folder you want to extract to: https://superuser.com/questions/139419/how-do-i-gunzip-to-a-different-destination-directory

Python3:How to keep-alive dropbox.files_upload() . So to download and upload at the same time?

I want to download a file from the internet and upload it to dropbox at the same time . I am downloading file as chunks and after every completed chunk I want it to upload it to dropbox.
import multiprocessing as m
import requests as rr
import dropbox
url='http://www.onirikal.com/videos/mp4/battle_games.mp4'
db=dropbox.Dropbox(Accesstoken)
def d(url):
r=rr.get(url,stream=True)
with open('file.mp4','wb')as f:
for a in r.iter_content(chunk_size=1000000):
if a:
f.write('')
f.write(a)
def u():
try:
with open('file.mp4','rb')as ff:
db.files_upload(ff.read(),'/file.mp4')
except FileNotFoundError:
pass
if __name__=='__main__':
p=m.Pool()
re= p.apply_async(d,[url])
ree=p.apply_async(u)
re.get(timeout=10)
ree.get(timeout=10)
But the uploaded file is having a size 0byte
EDIT
I am using f.write('') to save space on the server as i am only getting 512mb as storage
You should not use multiprocessing for this. Instead, simply download the chunks as you are already doing, and immediately call upload_chunk(a, len(a), offset, upload_id). You do not need the temporary file on disk.
I think you could do something like this
r=rr.get(url,stream=True)
db.files_upload(r.content,'/file.mp4')
To upload directly without making a file. In the case that you want to upload as a stream or in chunks, I'd imagine you would need to do something like the following:
from io import BytesIO
import requests
import dropbox
client = dropbox.client.DropboxClient(access_token)
r = requests.get(url, stream=True)
with BytesIO(r.content) as bytes_stream:
client.get_chunked_uploader(bytes_stream, len(r.content))
while uploader.offset < len(r.content):
try:
upload = uploader.upload_chunked()
except dropbox.rest.ErrorResponse, e:
# perform error handling and retry logic
uploader.finish('/file.mp4')
I cannot verify either method as I cannot currently install the dropbox module, so you might need to tweak a few things.

Read tarfile in as bytes

I have a setup in AWS where I have a python lambda proxying an s3 bucket containing .tar.gz files. I need to return the .tar.gz file from the python lambda back through the API to the user.
I do not want to untar the file, I want to return the tarfile as is, and it seems the tarfile module does not support reading in as bytes.
I have tried using python's .open method (which returns a codec error in utf-8). Then codecs.open with errors set to both ignore and replace which leads to the resulting file not being recognized as .tar.gz
Implementation (tar binary unpackaging)
try:
data = client.get_object(Bucket=bucket, Key=key)
headers['Content-Type'] = data['ContentType']
if key.endswith('.tar.gz'):
with open('/tmp/tmpfile', 'wb') as wbf:
bucketobj.download_fileobj(key, wbf)
with codecs.open('/tmp/tmpfile', "rb",encoding='utf-8', errors='ignore') as fdata:
body = fdata.read()
headers['Content-Disposition'] = 'attachment; filename="{}"'.format(key.split('/')[-1])
Usage (package/aws information redacted for security)
$ wget -v https://<apigfqdn>/release/simple/<package>/<package>-1.0.4.tar.gz
$ tar -xzf <package>-1.0.4.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Download csv file using python 3

I am new to Python. Here is my environment setup:
I have Anaconda 3 ( Python 3). I would like to be able to download an CSV file from the website:
https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD
I would like to use the requests library. I would appreciate anyhelp in figuring our how I can use the requests library in downloading the CSV file to the local directory on my machine
It is recommended to download data as stream, and flush it into the target or intermediate local file.
import requests
def download_file(url, output_file, compressed=True):
"""
compressed: enable response compression support
"""
# NOTE the stream=True parameter. It enable a more optimized and buffer support for data loading.
headers = {}
if compressed:
headers["Accept-Encoding"] = "gzip"
r = requests.get(url, headers=headers, stream=True)
with open(output_file, 'wb') as f: #open as block write.
for chunk in r.iter_content(chunk_size=4096):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush() #Afterall, force data flush into output file (optional)
return output_file
Considering original post:
remote_csv = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
local_output_file = "test.csv"
download_file(remote_csv, local_output_file)
#Check file content, just for test purposes:
print(open(local_output_file).read())
Base code was extracted from this post: https://stackoverflow.com/a/16696317/176765
Here, you can have more detailed information about body stream usage with requests lib:
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

Categories