How do the compression codecs work in Python? - python

I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.

As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.

The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses

The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.

I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.

Related

Python streaming compression

I have a program constantly writing to a file, which is compressed on the fly:
import lzma
with lzma.open('file.lz', 'wb') as f:
for ...:
# do something
f.write(item)
So, the file is append-only. At the same time, I need to be able to run another program which will read from this file - not streaming/following, just one-shot reading of the current content. Basically, this works:
import lzma
with lzma.open('file.lz', 'rb') as f:
content = f.read()
But writes in the first program don't write to the file immediately, and instead they buffer the data for some time (I see buffers of the size 8k to 60k). When small writes happen infrequently the file content can become very far from the current state and I'd like to flush it or do something similar (each n records or each n minutes). However, f.flush() doesn't seem to do anything. What's the best solution here, maybe I overlooked something obvious?

using wholeTextFiles in pyspark but get the error of out of memory

I have some files (part-00000.gz, part-00001.gz, part-00002.gz, ...) and each part is rather large. I need to use the filename of each part because it contains time stamp information. As I know, it seems that in pyspark only wholeTextFiles can read input as (filename, content). However, i get the error of out of memory when using wholeTextFiles. So, my guess is that wholeTextFiles reads a whole part as content in mapper without partition operation. I also find this answer (How does the number of partitions affect `wholeTextFiles` and `textFiles`?). If so, how can i get the filename of a rather large part file. Thanks
You get the error because wholeTextFiles tries to read the entire file into a single RDD. You're better off reading the file line-by-line, which you can do simply by writing your own generator and using the flatMap function. Here's an example of doing that to read a gzip file:
import gzip
def read_fun_generator(filename):
with gzip.open(filename, 'rb') as f:
for line in f:
yield line.strip()
gz_filelist = glob.glob("/path/to/files/*.gz")
rdd_from_bz2 = sc.parallelize(gz_filelist).flatMap(read_fun_generator)

How to make sure a netcdf file is closed in python?

It's probably simple, but I haven't been able to find a solution online...
I'm trying to work with a series of datasets stored as netcdf files. I open each one up, read in some keys points, then move onto the next file. I am finding that I constantly hit mmap errors/the script slows down as more files are being read in. I believe it may be because the netcdf files are not being properly closed by the .close() command.
I've been testing this:
from scipy.io.netcdf import netcdf_file as ncfile
f=ncfile(netcdf_file,mode='r')
f.close()
then if I try
>>>f
<scipy.io.netcdf.netcdf_file object at 0x24d29e10>
and
>>>f.variables['temperature'][:]
array([ 1234.68034431, 1387.43136567, 1528.35794546, ..., 3393.91061952,
3378.2844357 , 3433.06715226])
So it appears the file is still open? What does close() actually do? how do I know it has worked?
Is there a way to close/clear all open files from python?
Software:
Python 2.7.6, scipy 0.13.2, netcdf 4.0.1
The code for f.close is:
Definition: f.close(self)
Source:
def close(self):
"""Closes the NetCDF file."""
if not self.fp.closed:
try:
self.flush()
finally:
self.fp.close()
f.fp is the file object. So
In [451]: f.fp
Out[451]: <open file 'test.cdf', mode 'wb' at 0x939df40>
In [452]: f.close()
In [453]: f.fp
Out[453]: <closed file 'test.cdf', mode 'wb' at 0x939df40>
But I see from playing around with the f, that I can still create dimensions and variables. But f.flush() returns an error.
It does not look like it uses mmap during data writes, just during read.
def _read_var_array(self):
....
if self.use_mmap:
mm = mmap(self.fp.fileno(), begin_+a_size, access=ACCESS_READ)
data = ndarray.__new__(ndarray, shape, dtype=dtype_,
buffer=mm, offset=begin_, order=0)
else:
pos = self.fp.tell()
self.fp.seek(begin_)
data = fromstring(self.fp.read(a_size), dtype=dtype_)
data.shape = shape
self.fp.seek(pos)
I don't have much experience with mmap. It looks like it sets up a mmap object based on a block of bytes in the file, and uses that as the data buffer for the variable. I don't know what happens to that access if the underlying file is closed. I wouldn't be surprised if there is some sort of mmap error.
If the file is opened with mmap=False, then the whole variable is read into memory, and accessed like a regular numpy array.
mmap : None or bool, optional
Whether to mmap `filename` when reading. Default is True
when `filename` is a file name, False when `filename` is a
file-like object
My guess is that if you open a file without specifying the mmap mode, read an variable from it, and then close the file, that it is unsafe to reference that variable and its data later. Any reference that requires loading more data could result in a mmap error.
But if you open the file with mmap=False, you should be able slice the variable even after closing the file.
I don't see how the mmap for one file or variable could interfer with access to other files and variables. But I'd have to read more on mmap to be sure of that.
And from the netcdf docs:
Note that when netcdf_file is used to open a file with mmap=True (default for read-only), arrays returned by it refer to data directly on the disk. The file should not be closed, and cannot be cleanly closed when asked, if such arrays are alive. You may want to copy data arrays obtained from mmapped Netcdf file if they are to be processed after the file is closed, see the example below.

Compressing A Series of JSON Objects While Maintaining Serial Reading?

I have a bunch of json objects that I need to compress as it's eating too much disk space, approximately 20 gigs worth for a few million of them.
Ideally what I'd like to do is compress each individually and then when I need to read them, just iteratively load and decompress each one. I tried doing this by creating a text file with each line being a compressed json object via zlib, but this is failing with a
decompress error due to a truncated stream,
which I believe is due to the compressed strings containing new lines.
Anyone know of a good method to do this?
Just use a gzip.GzipFile() object and treat it like a regular file; write JSON objects line by line, and read them line by line.
The object takes care of compression transparently, and will buffer reads, decompressing chucks as needed.
import gzip
import json
# writing
with gzip.GzipFile(jsonfilename, 'w') as outfile:
for obj in objects:
outfile.write(json.dumps(obj) + '\n')
# reading
with gzip.GzipFile(jsonfilename, 'r') as infile:
for line in infile:
obj = json.loads(line)
# process obj
This has the added advantage that the compression algorithm can make use of repetition across objects for compression ratios.
You might want to try an incremental json parser, such as jsaone.
That is, create a single json with all your objects, and parse it like
with gzip.GzipFile(file_path, 'r') as f_in:
for key, val in jsaone.load(f_in):
...
This is quite similar to Martin's answer, wasting slightly more space but maybe slightly more comfortable.
EDIT: oh, by the way, it's probably fair to clarify that I wrote jsaone.

Python Gzip - Appending to file on the fly

Is it possible to append to a gzipped text file on the fly using Python ?
Basically I am doing this:-
import gzip
content = "Lots of content here"
f = gzip.open('file.txt.gz', 'a', 9)
f.write(content)
f.close()
A line is appended (note "appended") to the file every 6 seconds or so, but the resulting file is just as big as a standard uncompressed file (roughly 1MB when done).
Explicitly specifying the compression level does not seem to make a difference either.
If I gzip an existing uncompressed file afterwards, it's size comes down to roughly 80kb.
Im guessing its not possible to "append" to a gzip file on the fly and have it compress ?
Is this a case of writing to a String.IO buffer and then flushing to a gzip file when done ?
That works in the sense of creating and maintaining a valid gzip file, since the gzip format permits concatenated gzip streams.
However it doesn't work in the sense that you get lousy compression, since you are giving each instance of gzip compression so little data to work with. Compression depends on taking advantage the history of previous data, but here gzip has been given essentially none.
You could either a) accumulate at least a few K of data, many of your lines, before invoking gzip to add another gzip stream to the file, or b) do something much more sophisticated that appends to a single gzip stream, leaving a valid gzip stream each time and permitting efficient compression of the data.
You find an example of b) in C, in gzlog.h and gzlog.c. I do not believe that Python has all of the interfaces to zlib needed to implement gzlog directly in Python, but you could interface to the C code from Python.

Categories