Python: Stream gzip files from s3

Python: Stream gzip files from s3 - python

I have files in s3 as gzip chunks, thus I have to read the data continuously and cant read random ones. I always have to start with the first file.
For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d. If I do cat f2.gz | gzip -d, it will fail with gzip: stdin: not in gzip format.
How can I stream these data from s3 using python? I saw smart-open and it has the ability to decompress gz files with
from smart_open import smart_open, open
with open(path, compression='.gz') as f:
for line in f:
print(line.strip())
Where path is the path for f1.gz. This works until it hits the end of the file, where it will abort. Same thing will happen locally, if I do cat f1.gz | gzip -d, it will error with gzip: stdin: unexpected end of file when it hits the end.
Is there a way to make it stream the files continuously using python?
This one will not abort, and can iterate through f1.gz, f2.gz and f3.gz
with open(path, 'rb', compression='disable') as f:
for line in f:
print(line.strip(), end="")
but the output are just bytes. I was thinking it will work by doing python test.py | gzip -d, with the above code but I get an error gzip: stdin: not in gzip format. Is there a way to have python print using smart-open that gzip can read?

For example lets say I have 3 gzip file in s3, f1.gz, f2.gz, f3.gz. If I download all locally, I can do cat * | gzip -d.
One idea would be to make a file object to implement this. The file object reads from one filehandle, exhausts it, reads from the next one, exhausts it, etc. This is similar to how cat works internally.
The handy thing about this is that it does the same thing as concatenating all of your files, without the memory use of reading in all of your files at the same time.
Once you have the combined file object wrapper, you can pass it to Python's gzip module to decompress the file.
Examples:
import gzip
class ConcatFileWrapper:
def __init__(self, files):
self.files = iter(files)
self.current_file = next(self.files)
def read(self, *args):
ret = self.current_file.read(*args)
if len(ret) == 0:
# EOF
# Optional: close self.current_file here
# self.current_file.close()
# Advance to next file and try again
try:
self.current_file = next(self.files)
except StopIteration:
# Out of files
# Return an empty string
return ret
# Recurse and try again
return self.read(*args)
return ret
def write(self):
raise NotImplementedError()
filenames = ["xaa", "xab", "xac", "xad"]
filehandles = [open(f, "rb") for f in filenames]
wrapper = ConcatFileWrapper(filehandles)
with gzip.open(wrapper) as gf:
for line in gf:
print(line)
# Close all files
[f.close() for f in filehandles]
Here's how I tested this:
I created a file to test this through the following commands.
Create a file with the contents 1 thru 1000.
$ seq 1 1000 > foo
Compress it.
$ gzip foo
Split the file. This produces four files named xaa-xad.
$ split -b 500 foo.gz
Run the above Python file on it, and it should print out 1 - 1000.
Edit: extra remark about lazy-opening the files
If you have a huge number of files, you might want to open only one file at a time. Here's an example:
def open_files(filenames):
for filename in filenames:
# Note: this will leak file handles unless you uncomment the code above that closes the file handles again.
yield open(filename, "rb")

Related

GZip and output file

I'm having difficulty with the following code (which is simplified from a larger application I'm working on in Python).
from io import StringIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = StringIO()
with gzip.GzipFile(fileobj=out, mode="w") as f:
f.write(str.encode(jsonString))
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "a", encoding="utf-8") as f:
f.write(out.getvalue())
When this runs I get the following error:
File "d:\Development\AWS\TwitterCompetitionsStreaming.py", line 61, in on_status
with gzip.GzipFile(fileobj=out, mode="w") as f:
File "C:\Python38\lib\gzip.py", line 204, in __init__
self._write_gzip_header(compresslevel)
File "C:\Python38\lib\gzip.py", line 232, in _write_gzip_header
self.fileobj.write(b'\037\213') # magic header
TypeError: string argument expected, got 'bytes'
PS ignore the rubbish indenting here...I know it doesn't look right.
What I'm wanting to do is to create a json file and gzip it in place in memory before saving the gzipped file to the filesystem (windows). I know I've gone about this the wrong way and could do with a pointer. Many thanks in advance.

You have to use bytes everywhere when working with gzip instead of strings and text. First, use BytesIO instead of StringIO. Second, mode should be 'wb' for bytes instead of 'w' (last is for text) (samely 'ab' instead of 'a' when appending), here 'b' character means "bytes". Full corrected code below:
Try it online!
from io import BytesIO
import gzip
jsonString = 'JSON encoded string here created by a previous process in the application'
out = BytesIO()
with gzip.GzipFile(fileobj = out, mode = 'wb') as f:
f.write(str.encode(jsonString))
currenttimestamp = '2021-01-29'
# Write the file once finished rather than streaming it - uncomment the next line to see file locally.
with open("out_" + currenttimestamp + ".json.gz", "wb") as f:
f.write(out.getvalue())

Paramiko Download, process and re-upload the same file

I am using Paramiko to create an SFTP client to create a backup copy of a JSON file, read in the contents of the original, then update (the original). I am able to get this snippet of code to work:
# open sftp connection stuff
# read in json create backup copy - but have to 'open' twice
read_file = sftp_client.open(file_path)
settings = json.load(read_file)
read_file = sftp_client.open(file_path)
sftp_client.putfo(read_file, backup_path)
# json stuff and updating
new_settings = json.dumps(settings, indent=4, sort_keys = True)
# update remote json file
with sftp_client.open(file_path, 'w') as f:
f.write(new_settings)
However when I try to clean up the code and combine the backup file creation and JSON load:
with sftp_client.open(file_path) as f:
sftp_client.putfo(f, backup_path)
settings = json.load(f)
The backup file will be created but json.load will fail to due not having any content. And if I reverse the order, the json.load will read in the values, but the backup copy will be empty.
I'm using Python 2.7 on a Windows machine, creating a remote connection to a QNX (Linux) machine. Appreciate any help.
Thanks in advance.

If you want to read the file second time, you have to seek file read pointer back to the file beginning:
with sftp_client.open(file_path) as f:
sftp_client.putfo(f, backup_path)
f.seek(0, 0)
settings = json.load(f)
Though that is functionally equivalent to your original code with two open's.
If you aim was to optimize the code, to avoid downloading the file twice, you will have to read/cache the file to memory and then upload and load the contents from the cache.
f = BytesIO()
sftp_client.getfo(file_path, f)
f.seek(0, 0)
sftp_client.putfo(f, backup_path)
f.seek(0, 0)
settings = json.load(f)

How to edit a file in python without a temporary sacrificial file? [duplicate]

I'm using Python, and would like to insert a string into a text file without deleting or copying the file. How can I do that?

Unfortunately there is no way to insert into the middle of a file without re-writing it. As previous posters have indicated, you can append to a file or overwrite part of it using seek but if you want to add stuff at the beginning or the middle, you'll have to rewrite it.
This is an operating system thing, not a Python thing. It is the same in all languages.
What I usually do is read from the file, make the modifications and write it out to a new file called myfile.txt.tmp or something like that. This is better than reading the whole file into memory because the file may be too large for that. Once the temporary file is completed, I rename it the same as the original file.
This is a good, safe way to do it because if the file write crashes or aborts for any reason, you still have your untouched original file.

Depends on what you want to do. To append you can open it with "a":
with open("foo.txt", "a") as f:
f.write("new line\n")
If you want to preprend something you have to read from the file first:
with open("foo.txt", "r+") as f:
old = f.read() # read everything in the file
f.seek(0) # rewind
f.write("new line\n" + old) # write the new line before

The fileinput module of the Python standard library will rewrite a file inplace if you use the inplace=1 parameter:
import sys
import fileinput
# replace all occurrences of 'sit' with 'SIT' and insert a line after the 5th
for i, line in enumerate(fileinput.input('lorem_ipsum.txt', inplace=1)):
sys.stdout.write(line.replace('sit', 'SIT')) # replace 'sit' and write
if i == 4: sys.stdout.write('\n') # write a blank line after the 5th line

Rewriting a file in place is often done by saving the old copy with a modified name. Unix folks add a ~ to mark the old one. Windows folks do all kinds of things -- add .bak or .old -- or rename the file entirely or put the ~ on the front of the name.
import shutil
shutil.move(afile, afile + "~")
destination= open(aFile, "w")
source= open(aFile + "~", "r")
for line in source:
destination.write(line)
if <some condition>:
destination.write(<some additional line> + "\n")
source.close()
destination.close()
Instead of shutil, you can use the following.
import os
os.rename(aFile, aFile + "~")

Python's mmap module will allow you to insert into a file. The following sample shows how it can be done in Unix (Windows mmap may be different). Note that this does not handle all error conditions and you might corrupt or lose the original file. Also, this won't handle unicode strings.
import os
from mmap import mmap
def insert(filename, str, pos):
if len(str) < 1:
# nothing to insert
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
# or this could be an error
if pos > origSize:
pos = origSize
elif pos < 0:
pos = 0
m.resize(origSize + len(str))
m[pos+len(str):] = m[pos:origSize]
m[pos:pos+len(str)] = str
m.close()
f.close()
It is also possible to do this without mmap with files opened in 'r+' mode, but it is less convenient and less efficient as you'd have to read and temporarily store the contents of the file from the insertion position to EOF - which might be huge.

As mentioned by Adam you have to take your system limitations into consideration before you can decide on approach whether you have enough memory to read it all into memory replace parts of it and re-write it.
If you're dealing with a small file or have no memory issues this might help:
Option 1)
Read entire file into memory, do a regex substitution on the entire or part of the line and replace it with that line plus the extra line. You will need to make sure that the 'middle line' is unique in the file or if you have timestamps on each line this should be pretty reliable.
# open file with r+b (allow write and binary mode)
f = open("file.log", 'r+b')
# read entire content of file into memory
f_content = f.read()
# basically match middle line and replace it with itself and the extra line
f_content = re.sub(r'(middle line)', r'\1\nnew line', f_content)
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(f_content)
# close file
f.close()
Option 2)
Figure out middle line, and replace it with that line plus the extra line.
# open file with r+b (allow write and binary mode)
f = open("file.log" , 'r+b')
# get array of lines
f_content = f.readlines()
# get middle line
middle_line = len(f_content)/2
# overwrite middle line
f_content[middle_line] += "\nnew line"
# return pointer to top of file so we can re-write the content with replaced string
f.seek(0)
# clear file content
f.truncate()
# re-write the content with the updated content
f.write(''.join(f_content))
# close file
f.close()

Wrote a small class for doing this cleanly.
import tempfile
class FileModifierError(Exception):
pass
class FileModifier(object):
def __init__(self, fname):
self.__write_dict = {}
self.__filename = fname
self.__tempfile = tempfile.TemporaryFile()
with open(fname, 'rb') as fp:
for line in fp:
self.__tempfile.write(line)
self.__tempfile.seek(0)
def write(self, s, line_number = 'END'):
if line_number != 'END' and not isinstance(line_number, (int, float)):
raise FileModifierError("Line number %s is not a valid number" % line_number)
try:
self.__write_dict[line_number].append(s)
except KeyError:
self.__write_dict[line_number] = [s]
def writeline(self, s, line_number = 'END'):
self.write('%s\n' % s, line_number)
def writelines(self, s, line_number = 'END'):
for ln in s:
self.writeline(s, line_number)
def __popline(self, index, fp):
try:
ilines = self.__write_dict.pop(index)
for line in ilines:
fp.write(line)
except KeyError:
pass
def close(self):
self.__exit__(None, None, None)
def __enter__(self):
return self
def __exit__(self, type, value, traceback):
with open(self.__filename,'w') as fp:
for index, line in enumerate(self.__tempfile.readlines()):
self.__popline(index, fp)
fp.write(line)
for index in sorted(self.__write_dict):
for line in self.__write_dict[index]:
fp.write(line)
self.__tempfile.close()
Then you can use it this way:
with FileModifier(filename) as fp:
fp.writeline("String 1", 0)
fp.writeline("String 2", 20)
fp.writeline("String 3") # To write at the end of the file

If you know some unix you could try the following:
Notes: $ means the command prompt
Say you have a file my_data.txt with content as such:
$ cat my_data.txt
This is a data file
with all of my data in it.
Then using the os module you can use the usual sed commands
import os
# Identifiers used are:
my_data_file = "my_data.txt"
command = "sed -i 's/all/none/' my_data.txt"
# Execute the command
os.system(command)
If you aren't aware of sed, check it out, it is extremely useful.

Serving binary file from web server to client

Usually, when I want to transfer a web server text file to client, here is what I did
import cgi
print "Content-Type: text/plain"
print "Content-Disposition: attachment; filename=TEST.txt"
print
filename = "C:\\TEST.TXT"
f = open(filename, 'r')
for line in f:
print line
Works very fine for ANSI file. However, say, I have a binary file a.exe (This file is in web server secret path, and user shall not have direct access to that directory path). I wish to use the similar method to transfer. How I can do so?
What content-type I should use?
Using print seems to have corrupted content received at client side. What is the correct method?
I use the following code.
#!c:/Python27/python.exe -u
import cgi
print "Content-Type: application/octet-stream"
print "Content-Disposition: attachment; filename=jstock.exe"
print
filename = "C:\\jstock.exe"
f = open(filename, 'rb')
for line in f:
print line
However, when I compare the downloaded file with original file, it seems there is an extra whitespace (or more) for after every single line.

Agree with the above posters about 'rb' and Content-Type headers.
Additionally:
for line in f:
print line
This might be a problem when encountering \n or \r\n bytes in the binary file. It might be better to do something like this:
import sys
while True:
data = f.read(4096)
sys.stdout.write(data)
if not data:
break
Assuming this is running on windows in a CGI environment, you will want to start the python process with the -u argument, this will ensure stdout isn't in text-mode

When opening a file, you can use open(filename, 'rb') - the 'b' flag marks it as binary. For a general handler, you could use some form of mime magic (I'm not familiar with using it from Python, I've only ever used it from PHP a couple of years ago). For the specific case, .exe is application/octet-stream.

Content-type of .exe is tipically application/octet-stream.
You might want to read your file using open(filename, 'rb') where b means binary.
To avoid the whitespace problem, you could try with:
sys.stdout.write(open(filename,"rb").read())
sys.stdout.flush()
or even better, depending on the size of your file, use the Knio approach:
fo = open(filename, "rb")
while True:
buffer = fo.read(4096)
if buffer:
sys.stdout.write(buffer)
else:
break
fo.close()

For anyone using Windows Server 2008 or 2012 and Python 3, here's an update...
After many hours of experimentation I have found the following to work reliably:
import io
with io.open(sys.stdout.fileno(),"wb") as fout:
with open(filename,"rb") as fin:
while True:
data = fin.read(4096)
fout.write(data)
if not data:
break

Create zip archive for instant download

In a web app I am working on, the user can create a zip archive of a folder full of files. Here here's the code:
files = torrent[0].files
zipfile = z.ZipFile(zipname, 'w')
output = ""
for f in files:
zipfile.write(settings.PYRAT_TRANSMISSION_DOWNLOAD_DIR + "/" + f.name, f.name)
downloadurl = settings.PYRAT_DOWNLOAD_BASE_URL + "/" + settings.PYRAT_ARCHIVE_DIR + "/" + filename
output = "Download " + torrent_name + ""
return HttpResponse(output)
But this has the nasty side effect of a long wait (10+ seconds) while the zip archive is being downloaded. Is it possible to skip this? Instead of saving the archive to a file, is it possible to send it straight to the user?
I do beleive that torrentflux provides this excat feature I am talking about. Being able to zip GBs of data and download it within a second.

Check this Serving dynamically generated ZIP archives in Django

As mandrake says, constructor of HttpResponse accepts iterable objects.
Luckily, ZIP format is such that archive can be created in single pass, central directory record is located at the very end of file:
(Picture from Wikipedia)
And luckily, zipfile indeed doesn't do any seeks as long as you only add files.
Here is the code I came up with. Some notes:
I'm using this code for zipping up a bunch of JPEG pictures. There is no point compressing them, I'm using ZIP only as container.
Memory usage is O(size_of_largest_file) not O(size_of_archive). And this is good enough for me: many relatively small files that add up to potentially huge archive
This code doesn't set Content-Length header, so user doesn't get nice progress indication. It should be possible to calculate this in advance if sizes of all files are known.
Serving the ZIP straight to user like this means that resume on downloads won't work.
So, here goes:
import zipfile
class ZipBuffer(object):
""" A file-like object for zipfile.ZipFile to write into. """
def __init__(self):
self.data = []
self.pos = 0
def write(self, data):
self.data.append(data)
self.pos += len(data)
def tell(self):
# zipfile calls this so we need it
return self.pos
def flush(self):
# zipfile calls this so we need it
pass
def get_and_clear(self):
result = self.data
self.data = []
return result
def generate_zipped_stream():
sink = ZipBuffer()
archive = zipfile.ZipFile(sink, "w")
for filename in ["file1.txt", "file2.txt"]:
archive.writestr(filename, "contents of file here")
for chunk in sink.get_and_clear():
yield chunk
archive.close()
# close() generates some more data, so we yield that too
for chunk in sink.get_and_clear():
yield chunk
def my_django_view(request):
response = HttpResponse(generate_zipped_stream(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=archive.zip'
return response

Here's a simple Django view function which zips up (as an example) any readable files in /tmp and returns the zip file.
from django.http import HttpResponse
import zipfile
import os
from cStringIO import StringIO # caveats for Python 3.0 apply
def somezip(request):
file = StringIO()
zf = zipfile.ZipFile(file, mode='w', compression=zipfile.ZIP_DEFLATED)
for fn in os.listdir("/tmp"):
path = os.path.join("/tmp", fn)
if os.path.isfile(path):
try:
zf.write(path)
except IOError:
pass
zf.close()
response = HttpResponse(file.getvalue(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=yourfiles.zip'
return response
Of course this approach will only work if the zip files will conveniently fit into memory - if not, you'll have to use a disk file (which you're trying to avoid). In that case, you just replace the file = StringIO() with file = open('/path/to/yourfiles.zip', 'wb') and replace the file.getvalue() with code to read the contents of the disk file.

Does the zip library you are using allow for output to a stream. You could stream directly to the user instead of temporarily writing to a zip file THEN streaming to the user.

It is possible to pass an iterator to the constructor of a HttpResponse (see docs). That would allow you to create a custom iterator that generates data as it is being requested. However I don't think that will work with a zip (you would have to send partial zip as it is being created).
The proper way, I think, would be to create the files offline, in a separate process. The user could then monitor the progress and then download the file when its ready (possibly by using the iterator method described above). This would be similar what sites like youtube use when you upload a file and wait for it to be processed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Stream gzip files from s3 - python

Related

GZip and output file

Paramiko Download, process and re-upload the same file

How to edit a file in python without a temporary sacrificial file? [duplicate]

Serving binary file from web server to client

Create zip archive for instant download

Categories

Resources