Python Gzip - Appending to file on the fly - python

Is it possible to append to a gzipped text file on the fly using Python ?
Basically I am doing this:-
import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()
A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).
Explicitly specifying the compression level does not seem to make a difference either.
If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.
Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?
Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?

That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.
However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.
You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.
You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.

Related

Append a folder to gzip in memory using python

I have a tar.gz file downloaded from s3, I load it in memory and I want to add a folder and eventually write it into another s3.
I've been trying different approaches:
from io import BytesIO
import gzip
buffer = BytesIO(zip_obj.get()["Body"].read())
im_memory_tar = tarfile.open(buffer, mode='a')
The above rises the error: ReadError: invalid header .
With the below approach:
im_memory_tar = tarfile.open(fileobj=buffer, mode='a')
im_memory_tar.add(name='code_1', arcname='code')
The content seems to be overwritten.
Do you know a good solution to append a folder into a tar.gz file?
Thanks.
very well explained in question how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.
First we need to consider how to append to a tar file. Let's set aside the compression for a moment.
A tar file is terminated by two 512-byte blocks of all zeros. To add more entries, you need to remove or overwrite that 1024 bytes at the end. If you then append another tar file there, or start writing a new tar file there, you will have a single tar file with all of the entries of the original two.
Now we return to the tar.gz. You can simply decompress the entire .gz file, do that append as above, and then recompress the whole thing.
Avoiding the decompression and recompression is rather more difficult, since we'd have to somehow remove that last 1024 bytes of zeros from the end of the compressed stream. It is possible, but you would need some knowledge of the internals of a deflate compressed stream.
A deflate stream consists of a series of compressed data "blocks", which are each an arbitrary number of bits long. You would need to decompress, without writing out the result, until you get to the block containing the last 1024 bytes. You would need to save the decompressed result of that and any subsequent blocks, and at what bit in the stream that block started. Then you could recompress that data, sans the last 1024 bytes, starting at that byte.
Complete the compression, and write out the gzip trailer with the 1024 zeros removed from the CRC and length. (There is a way to back out zeros from the CRC.) Now you have a complete gzip stream for the previous .tar.gz file, but with the last 1024 bytes of zeros removed.
Since the concatenation of two gzip streams is itself a valid gzip stream, you can now concatenate the second .tar.gz file directly or start writing a new .tar.gz stream there. You now have a single, valid .tar.gz stream with the entries from the two original sources.

get size of compressed file while compressing

I currently try to create a module that writes a *.gz file up to a specific size. I want to use it for a custom log handler to specify the maximum size of a zipped logfile. I already made my way through the gzip documentation and also the zlib documentation.
I could use zlib right away and measure the length of my compressed bytearray, but then I would have to create and write the gzip file header by myself. The zlib-documentaion itself says: For reading and writing .gz files see the gzip module..
But I do not see any option for getting the size of the compressed file in the gzip module.
the logfile opened via logfile = gzip.open("test.gz", "ab", compresslevel=6) does have a .size parameter, but this is the size of the original file, not the compressed file.
Also os.path.getsize("test.gz") is zero until logfile is closed and is actually written to the disk.
Do you have any idea how I can use the built-in gzip module to close a compressed file once it reached a certain size? Without closing and re-opening it all the time?
Or is this even possible?
Thanks for any help on this!
Update:
It is not true that no data is written to disk until the file is closed, it just takes some time to collect some kilobytes before the filesize changes. This is good enogh for me and my usecase, so this is solved. Thanks for any input!
My test code for this:
import os
import gzip
import time
data = 'Hello world'
limit = 10000
i = 0
logfile = gzip.open("test.gz", "wb", compresslevel=6)
while i < limit:
msg = f"{data} {str(i)} \n"
logfile.write(msg.encode("utf-8"))
print(os.path.getsize("test.gz"))
print(logfile.size)
if i > 1000:
logfile.flush()
break
#time.sleep(0.03)
i += 1
logfile.close()
print(f"final size of *.gz file: {os.path.getsize('test.gz')}")
print(f"final size of logfile object file: {logfile.size}")
gzip does not actually compress the file until after you close it so it does no really make sense to ask to know the size of the compressed file beforehand. One thing you could do is look at the size of compressed files you obtain on real data from your use case and do a linear regression to have some kind of approximation of the compression ratio.

How to obtain random access of a gzip compressed file

According to this FAQ on zlib.net it is possible to:
access data randomly in a compressed stream
I know about the module Bio.bgzf of Biopyton 1.60, which:
supports reading and writing BGZF files (Blocked GNU Zip Format), a variant of GZIP with efficient random access, most commonly used as part of the BAM file format and in tabix. This uses Python’s zlib library internally, and provides a simple interface like Python’s gzip library.
But for my use case I don't want to use that format. Basically I want something, which emulates the code below:
import gzip
large_integer_new_line_start = 10**9
with gzip.open('large_file.gz','rt') as f:
f.seek(large_integer_new_line_start)
but with the efficiency offered by the native zlib.net to provide random access to the compressed stream. How do I leverage that random access capability in Python?
I gave up on doing random access on a gzipped file using Python. Instead I converted my gzipped file to a block gzipped file with a block compression/decompression utility on the command line:
zcat large_file.gz | bgzip > large_file.bgz
Then I used BioPython and tell to get the virtual_offset of line number 1 million of the bgzipped file. And then I was able to rapidly seek the virtual_offset afterwards:
from Bio import bgzf
file='large_file.bgz'
handle = bgzf.BgzfReader(file)
for i in range(10**6):
handle.readline()
virtual_offset = handle.tell()
line1 = handle.readline()
handle.close()
handle = bgzf.BgzfReader(file)
handle.seek(virtual_offset)
line2 = handle.readline()
handle.close()
assert line1==line2
I would like to also point to the SO answer by Mark Adler here on examples/zran.c in the zlib distribution.
You are looking for dictzip.py, part of the serpento package. However, you have to compress the files with dictzip, which is a random seekable backward compatible variant of the gzip compression.
The indexed_gzip program might be what you wanted. It also uses zran.c under the hood.
If you just want to access the file from a random point can't you just do:
from random import randint
with open(filename) as f:
f.seek(0, 2)
size = f.tell()
f.seek(randint(0, size), 2)

Compressing A Series of JSON Objects While Maintaining Serial Reading?

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.
Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a
decompress error due to a truncated stream,
which I believe is due to the compressed strings containing new lines.
Anyone know of a good method to do this?
Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.
The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.
import gzip
import json
# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
for obj in objects:
outfile.write(json.dumps(obj) + '\n')
# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
for line in infile:
obj = json.loads(line)
# process obj
This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.
You might want to try an incremental json parser, such as jsaone.
That is, create a single json with all your objects, and parse it like
with gzip.GzipFile(file_path, 'r') as f_in:
for key, val in jsaone.load(f_in):
...
This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.
EDIT: oh, by the way, it's probably fair to clarify that I wrote jsaone.

How do the compression codecs work in Python?

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

Categories