Which compression method in Python has the best compression ratio?
Is the commonly used zlib.compress() the best or are there some better options? I need to get the best compression ratio possible.
I am compresing strings and sending them over UDP. A typical string I compress has about 1,700,000 bytes.
I'm sure that there might be some more obscure formats with better compression, but lzma is the best, of those that are well supported. There are some python bindings here.
EDIT
Don't pick a format without testing, some algorithms do better depending on the data set.
If you are willing to trade performance for getter compression then the bz2 library usually gives better results than the gz (zlib) library.
There are other compression libraries like xz (LZMA2) that might give even better results but they do not appear to be in the core distribution of python.
Python Doc for BZ2 class
EDIT: Depending on the type of image you might not get much additional compression. Many image formats are previously compressed unless it is raw, bmp, or uncompressed tiff. Testing between various compression types would be highly recommended.
EDIT2: If you do decide to do image compression. Image Magick supports python bindings and many image conversion types.
Image Magick
Image Formats Supported
The best compression algorithm definitely depends of the kind of data you are dealing with. Unless if you are working with a list of random numbers stored as a string (in which case no compression algorithm will work) knowing the kind of data usually allows to apply much better algorithms than general purpose ones (see other answers for good ready to use general compression algorithms).
If you are dealing with images you should definitely choose a lossy compression format (ie: pixel aware) preferably to any lossless one. That will give you much better results. Recompressing with a lossless format over a lossy one is a loss of time.
I would search through PIL to see what I can use. Something like converting image to jpeg with a compression ratio compatible with researched quality before sending should be very efficient.
You should also be very cautious if using UDP, it can lose some packets, and most compression format are very sensible to missing parts of file. OK. That can be managed at application level.
Related
I am searching for the best possible solution for storing large dense 2D matrices of floating-point data, generally float32 data.
The goal would be to share scientific data more easily from websites like the Internet Archive and make such data FAIR.
My current approaches (listed below) fall short of the desired goal, so I thought I might ask here, hoping to find something new, such as a better data structure. Even though my examples are in Python, the solution does not need to be in Python. Any good solution will do, even one in COBOL!
CSV-based
One approach I tried would be to store the values as compressed CSVs using pandas, but this is excruciatingly slow, and the end compression is not exactly optimal (generally a 50% from the plain CSV on the data I tried this on, which is not flawed but not sufficient to make this viable.) In this example, I am using gzip. I have also tried LZMA, but it is generally way slower, and, at least on the data I tried it on, it does not yield a significantly better result.
import pandas as pd
my_data: pd.DataFrame = create_my_data_doing_something()
my_data.to_csv("my_data_saved.csv.gz")
NumPy Based
Another solution is to store the data into a NumPy-like matrix and then compress it on a disk.
import numpy as np
my_data: np.ndarray = create_my_data_doing_something()
np.save("my_data_saved.npy", my_data)
and afterwards
gzip -k my_data_saved.npy
Numpy with H5
Another possible compression is H5, but as shown in the benchmark below it does not do better than the plain numpy.
with h5py.File("my_data_saved.h5", 'w') as hf:
hf.create_dataset("my_data_saved", data=my_data)
Issues in using the Numpy format
While this may be a good solution, we limit the scope of usability of the data to people that can use Python. Of course, this is a vast pool of people, though, in my circles, many biologists and mathematicians abhor Python and like to stick to Matlab and R (and therefore would not know what to do with a .npy.gz file).
Pickle
Another solution, as correctly pointed out by #lojza, is to store the data into a pickle object, which may also be compressed to a disk. Pickle obtains in my benchmarks (see below) a compression ratio comparable with what is obtained with Numpy.
import pickle
import compress_pickle
my_data: np.ndarray = create_my_data_doing_something()
# Not compressed pickle
with open("my_data_saved.pkl", "wb") as f:
pickle.dump(my_data, f)
# Compressed version
compress_pickle.dump(df, "my_data_saved.pkl.gz")
Issues in using Pickle format
The issue in using Pickle is two-fold: first, the same Python-dependency issue discussed above. Secondly, there is a significant security issue: the Pickle format can be used for arbitrary code execution exploits. People should be wary of downloading random pickle files from the internet (and here, the goal is to make people share datasets on the internet).
python import pickle
# Build the exploit
command = b"""cat flag.txt"""
x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command
# Test it
pickle.load(x)
Benchmarks
I have executed benchmarks to provide a baseline on which to improve. Here, we can see that generally, Numpy is the best performing choice, and to my knowledge, it should not have any security risk in its format. It follows that a compressed NumPy array is currently the best contender, please let's find a better one!
Examples
To share some actual use cases of this relatively simple task, I share a couple of graphs embedding the complete OBO Foundry graph. If there were a way to make these files smaller, sharing them would be significantly more accessible, allowing for more reproducible experiments and accelerating research in the bio-ontologies (for this specific data) and other fields.
OBO Foundry embedded using CBOW
OBO Foundry embedded using TransE
What other approaches may I try?
Say I have a Torch tensor of integers in a small range 0,...,R (e.g., R=31).
I want to store to disk in compressed form in a way that is close to the entropy of the vector.
The compression techniques I know (e.g., Huffman and arithmetic coding) all seem to be serial in nature.
Is there a fast Torch entropy compression implementation?
I'm happy to use an off the shelf implementation, but I can also try to implement myself if someone knows a suitable algorithm.
torch.save will store it with pickle protocol.
If you want to save space, to quantize these vectors before saving should help.
Also, you can try zlib module:
https://github.com/jonathantompson/torchzlib
One alternative is to transform it to numpy arrays and then use some of the compression methods available there.
Refer to:
Compress numpy arrays efficiently
For what you described, you can simply pack five-bit integers into a bit stream. It's easy to compress and decompress with the shift, or, and and bitwise operators (<<, >>, |, &). That would be as good as you could do, if your integers are uniformly distributed in 0..31, and there are no repeated patterns.
If, on the other hand, the distribution of your integers is significantly skewed or there are repeated patterns, then you should use an existing lossless compressor, such as zlib, zstd, or lzma2 (xz). For any of those, feed them one integer per byte.
To parallelize the computation, you can break up your 225 integers into many small subsets, each of which can be compressed independently. You could go down a few 10's of K each, likely with little overhead loss or compression loss. You will need to experiment with your data.
I'm trying to compress an image using zlib library on python (vscode). I generate an output file but it weights the same as the original file.
This is the code:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
with open("arroz", "wb") as out_file:
out_file.write(compressed)
I think the two files would not weight the exact same. If you try the following:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
print(in_file.tell())
with open("arroz", "wb") as out_file:
out_file.write(compressed)
print(out_file.tell())
you should see two slightly different numbers (which are basically the file size).
For some jpg of mine I got:
3563384
3448655
so the zlib.compress() is actually reducing the file size a tiny bit.
You should observe something similar yourself too.
Anything that is not the same number is fine.
As #jasonharper already pointed out, the JPEG format is already highly compressed, but not DEFLATE compressed, as zlib would do (including the implementation available in Python).
This is a bit different from the lossy compression implemented in JPEG, which is based on an integral transform. The output of this transform is typically non-redundant and therefore the Lempel-Ziv 77 algorithm implemented with DEFLATE (or any other implementation, for what is worth) is of limited efficacy.
In conclusion, zlib is doing its job, but it is unlikely to be effective for jpeg data.
Note on larger compressed files
The zlib compressed files can be larger than their inputs.
This is true for any loseless compression algorithm, and can be easily proved: consider multiple consecutive applications of a loseless algorithm, if any application would strictly reduce the file size, you would eventually get to a size equal to 0, i.e. an empty file. Obviously this cannot be inverted, thus demonstrating that loseless compression is not compatible with always reducing file size.
Looking into LZ77 details from Wikipedia:
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream.
The following is not exactly how LZ77 works but should give you the idea.
Let's replace repeating characters with the character followed by the number of times it is repeated.
This algorithm works well with xxxxxxxx being reduced to x8 (x 8 times). If the sequence is non-redundant, e.g. abcdefgh, then this algorithm would produce a1b1c1d1e1f1g1h1 which does not reduce the input size, but would actually DOUBLE it.
What you are observing is something similar.
I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?
An example line would read like this:
62233 242344 0.42442423
(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)
As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?
I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).
Thanks for all the ideas and clarifying questions!
You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.
Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)
I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.
The Power of Gzip
Yes there are Python library allow you to stream output and automatically compress it.
Lossy encoding does save space. Cutting down the precision helps.
I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.
An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.
If this can be used in Stata, I can help more with suggestions for quick conversions.
when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one
I could read *.npy from C code as
simple binary data(I mean *.npy are
binary-compatibly with C structures)
Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?
Does HDF5's numpy are binary-compatibly with C structures too?
UPD :
I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility)
So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab)
And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...
The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.
I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.
If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.
If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.
To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html
With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).
If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.
The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.
HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.
If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.