subprocess gunzip throws decompression failed - python

I am trying to gunzip using subprocess but it returns the error -
('Decompression failed %s', 'gzip: /tmp/tmp9OtVdr is a directory -- ignored\n')
What is wrong?
import subprocess
transform_script_process = subprocess.Popen(
['gunzip', f_temp.name, '-kf', temp_dir],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)(transform_script_stdoutdata, transform_script_stderrdata
) = transform_script_process.communicate()
self.log.info("Transform script stdout %s",
transform_script_stdoutdata)
if transform_script_process.returncode > 0:
shutil.rmtree(temp_dir)
raise AirflowException("Decompression failed %s",
transform_script_stderrdata)

You are calling the gunzip process and passing it the following parameters:
f_temp.name
-kf
temp_dir
I'm assuming f_temp.name is the path to the gzipped file you are trying to unzip. -kf will force decompression and instruct gzip to keep the file after decompressing it.
Now comes the interesting part. temp_dir seems like a variable that would hold the destination directory you want to extract the files to. However, gunzip does not support this. Please have a look at the manual for gzip. It states that you must pass in a list of files to decompress. There is no option to specify the destination directory.
Have a look at this post on Superuser for more information on specifying the folder you want to extract to: https://superuser.com/questions/139419/how-do-i-gunzip-to-a-different-destination-directory

Related

How do you use a python variable in popen()?

Im trying to record docker stats for every file in the mydata directory. For example if one of the files is names piano.txt I would like the output file to be piano_stuff.txt. This is what I have so far:
import subprocess
import signal
import os
for file_name in os.listdir('mydata'):
data_txt = "./" + file_name.split(".")[0] + "_stuff.txt"
dockerStats = subprocess.Popen("docker stats --format {{.MemUsage}} >> ${data_txt}", shell=True)
os.killpg(os.getpgid(dockerStats.pid), signal.SIGTERM)
Don't use shell=True. Open the file locally, and pass the file object as the stdout argument. You can also use the --no-stream option to have the command exit after producing one line of output, rather than asynchronously trying to kill the process as soon as possible. (You might get multiple lines of output, or you might get none, depending on when the OS schedules the Docker process to run.)
with open(data_txt, "a") as f:
subprocess.run(["docker", "stats", "--format", "{{.MemUsage}}", "--no-stream"], stdout=f)

Download and extract a tar file in Python in chunks

I am trying to use pycurl to download a tgz file and extract it using tarfile, but without storing the tgz file on disk and by not having the whole tgz file in memory. I would like to download it and extract it in chunks, streaming.
I know how to get pycurl callback which gives me data every time a new chunk of data is downloaded:
def write(data):
# Give data to tarfile to extract.
...
with contextlib.closing(pycurl.Curl()) as curl:
curl.setopt(curl.URL, tar_uri)
curl.setopt(curl.WRITEFUNCTION, write)
curl.setopt(curl.FOLLOWLOCATION, True)
curl.perform()
I also know how to open tarfile in streaming mode:
output_tar = tarfile.open(mode='r|gz', fileobj=fileobj)
But I do not know how to connect these two things together, so that every time I get a chunk over the wire, the next chunk of the tar file is extracted.
To be honest, unless you're really looking for a pure-Python solution (which is possible, just rather tedious), I would suggest to just shell out to /usr/bin/tar and feed it data in chunks.
Something like
import subprocess
p = subprocess.Popen(['/usr/bin/tar', 'xz', '-C', '/my/output/directory'], stdin=subprocess.PIPE)
def write(data):
p.stdin.write(data)
with ...:
curl.perform()
p.close()
A Python only solution could look like this:
import contextlib
import tarfile
from http.client import HTTPSConnection
def https_download_tar(host, path, item_visitor, port=443, headers=dict({}), compression='gz'):
"""Download and unpack tar file on-the-fly and call item_visitor for each entry.
item_visitor will receive the arguments TarFile (the currently extracted stream)
and the current TarInfo object
"""
with contextlib.closing(HTTPSConnection(host=host, port=port)) as client:
client.request('GET', path, headers=headers)
with client.getresponse() as response:
code = response.getcode()
if code < 200 or code >= 300:
raise Exception(f'HTTP error downloading tar: code: {code}')
try:
with tarfile.open(fileobj=response, mode=f'r|{compression}') as tar:
for tarinfo in tar:
item_visitor(tar, tarinfo)
except Exception as e:
raise Exception(f'Failed to extract tar stream: {e}')
# Test the download function using some popular archive
def list_entry(tar, tarinfo):
print(f'{tarinfo.name}\t{"DIR" if tarinfo.isdir() else "FILE"}\t{tarinfo.size}\t{tarinfo.mtime}')
https_download_tar('dl.discordapp.net', '/apps/linux/0.0.15/discord-0.0.15.tar.gz', list_entry)
The HTTPSConnection is used to provide a response stream (file-like) which is then passed to tarfile.open().
One can then iterate over the items in the TAR file and for example extract them using TarFile.extractfile().

Read tarfile in as bytes

I have a setup in AWS where I have a python lambda proxying an s3 bucket containing .tar.gz files. I need to return the .tar.gz file from the python lambda back through the API to the user.
I do not want to untar the file, I want to return the tarfile as is, and it seems the tarfile module does not support reading in as bytes.
I have tried using python's .open method (which returns a codec error in utf-8). Then codecs.open with errors set to both ignore and replace which leads to the resulting file not being recognized as .tar.gz
Implementation (tar binary unpackaging)
try:
data = client.get_object(Bucket=bucket, Key=key)
headers['Content-Type'] = data['ContentType']
if key.endswith('.tar.gz'):
with open('/tmp/tmpfile', 'wb') as wbf:
bucketobj.download_fileobj(key, wbf)
with codecs.open('/tmp/tmpfile', "rb",encoding='utf-8', errors='ignore') as fdata:
body = fdata.read()
headers['Content-Disposition'] = 'attachment; filename="{}"'.format(key.split('/')[-1])
Usage (package/aws information redacted for security)
$ wget -v https://<apigfqdn>/release/simple/<package>/<package>-1.0.4.tar.gz
$ tar -xzf <package>-1.0.4.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Python checking integrity of gzip archive

Is there a way in Python using gzip or other module to check the integrity of the gzip archive?
Basically is there equivalent in Python to what the following does:
gunzip -t my_archive.gz
Oops, first answer (now deleted) was result of misreading the question.
I'd suggest using the gzip module to read the file and just throw away what you read. You have to decode the entire file in order to check its integrity in any case. https://docs.python.org/2/library/gzip.html
Something like ( Untested code)
import gzip
chunksize=10000000 # 10 Mbytes
ok = True
with gzip.open('file.txt.gz', 'rb') as f:
try:
while f.read(chunksize) != b'':
pass
except:
ok = False
I don't know what exception reading a corrupt zipfile will throw, you might want to find out and then catch only this particular one.
you can use subprocess or os module to execute this command and read the output. something like this
Using os module
import os
output = os.popen('gunzip -t my_archive.gz').read()
Using Subprocess Module
import subprocess
proc = subprocess.Popen(["gunzip", "-t", "my_archive.gz"], stdout=subprocess.PIPE, shell=True)
(out, err) = proc.communicate()

Python scp copy images from image_urls to server

I have written one function which recieves a url and copy it to all server.
Server remote path is stored in db.
def copy_image_to_server(image_url):
server_list = ServerData.objects.values_list('remote_path', flat=True).filter(active=1)
file = cStringIO.StringIO(urllib.urlopen(image_url).read())
image_file = Image.open(file)
image_file.seek(0)
for remote_path in server_list:
os.system("scp -i ~/.ssh/haptik %s %s " % (image_file, remote_path))
I am geeting this error at last line cannot open PIL.JpegImagePlugin.JpegImageFile: No such file
Please suggest me what's wrong in the code, i have checked url is not broken
The problem is that image_file is not a path (string), it's an object. Your os.system call is building up a string that expects a path.
You need to write the file to disk (perhaps using the tempfile module) before you can pass it to scp in this manner.
In fact, there's no need for you (at least in what you're doing in the code snippet) to convert it to a PIL Image object at all, you can just write it to disk once you've retrieved it, and then pass it to scp to move it:
file = cStringIO.StringIO(urllib.urlopen(image_url).read())
diskfile = tempfile.NamedTemporaryFile(delete=False)
diskfile.write(file.getvalue())
path = diskfile.name
diskfile.close()
for remote_path in server_list:
os.system("scp -i ~/.ssh/haptik %s %s " % (path, remote_path))
You should delete the file after you're done using it.

Categories