I'm trying to compress an image using zlib library on python (vscode). I generate an output file but it weights the same as the original file.
This is the code:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
with open("arroz", "wb") as out_file:
out_file.write(compressed)
I think the two files would not weight the exact same. If you try the following:
import zlib
with open("garenap.jpg", "rb") as in_file:
compressed = zlib.compress(in_file.read(), -1)
print(in_file.tell())
with open("arroz", "wb") as out_file:
out_file.write(compressed)
print(out_file.tell())
you should see two slightly different numbers (which are basically the file size).
For some jpg of mine I got:
3563384
3448655
so the zlib.compress() is actually reducing the file size a tiny bit.
You should observe something similar yourself too.
Anything that is not the same number is fine.
As #jasonharper already pointed out, the JPEG format is already highly compressed, but not DEFLATE compressed, as zlib would do (including the implementation available in Python).
This is a bit different from the lossy compression implemented in JPEG, which is based on an integral transform. The output of this transform is typically non-redundant and therefore the Lempel-Ziv 77 algorithm implemented with DEFLATE (or any other implementation, for what is worth) is of limited efficacy.
In conclusion, zlib is doing its job, but it is unlikely to be effective for jpeg data.
Note on larger compressed files
The zlib compressed files can be larger than their inputs.
This is true for any loseless compression algorithm, and can be easily proved: consider multiple consecutive applications of a loseless algorithm, if any application would strictly reduce the file size, you would eventually get to a size equal to 0, i.e. an empty file. Obviously this cannot be inverted, thus demonstrating that loseless compression is not compatible with always reducing file size.
Looking into LZ77 details from Wikipedia:
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream.
The following is not exactly how LZ77 works but should give you the idea.
Let's replace repeating characters with the character followed by the number of times it is repeated.
This algorithm works well with xxxxxxxx being reduced to x8 (x 8 times). If the sequence is non-redundant, e.g. abcdefgh, then this algorithm would produce a1b1c1d1e1f1g1h1 which does not reduce the input size, but would actually DOUBLE it.
What you are observing is something similar.
Related
I'm trying to implement a strings(1)-like function in Python.
import re
def strings(f, n=5):
# TODO: support files larger than available RAM
return re.finditer(br'[!-~\s]{%i,}' % n, f.read())
if __name__ == '__main__':
import sys
with open(sys.argv[1], 'rb') as f:
for m in strings(f):
print(m[0].decode().replace('\x0A', '\u240A'))
Setting aside the case of actual matches* that are larger than the available RAM, the above code fails in the case of files that are merely, themselves, larger than the available RAM!
An attempt to naively "iterate over f" will be done linewise, even for binary files; this may be inappropriate because (a) it may return different results than just running the regex on the whole input, and (b) if the machine has 4 gigabytes of RAM and the file contains any match for rb'[^\n]{8589934592,}', then that unasked-for match will cause a memory problem anyway!
Does Python's regex library enable any simple way to stream re.finditer over a binary file?
*I am aware that it is possible to write regular expressions that may require an exponential amount of CPU or RAM relative to their input length. Handling these cases is, obviously, out-of-scope; I'm assuming for the purposes of the question that the machine at least has enough resource to handle the regex, its largest match on the input, the acquisition of this match, and the ignoring-of all nonmatches.
Not a duplicate of Regular expression parsing a binary file? -- that question is actually asking about bytes-like objects; I am asking about binary files per se.
Not a duplicate of Parse a binary file with Regular Expressions? -- for the same reason.
Not a duplicate of Regular expression for binary files -- that question only addressed the special case where offsets of all matches were known beforehand.
Not a duplicate of Regular expression for binary files -- combination of both of these reasons.
Does Python's regex library enable any simple way to stream re.finditer over a binary file?
Well, while typing up the question in such excruciating detail and getting suppporting documentation, I found the solution:
mmap — Memory-mapped file support
Memory-mapped file objects behave like both bytearray and like file objects. You can use mmap objects in most places where bytearray are expected; for example, you can use the re module to search through a memory-mapped file. …
Enacted:
import re, mmap
def strings(f, n=5):
view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return re.finditer(br'[!-~\s]{%i,}' % n, view)
Caveat: on 32-bit systems, this might not work for files larger than 2GiB, if the underlying standard library is deficient.
However, it looks like it should be fine on both Windows and any well-maintained Linux distribution:
13.8 Memory-mapped I/O
Since mmapped pages can be stored back to their file when physical memory is low, it is possible to mmap files orders of magnitude larger than both the physical memory and swap space. The only limit is address space. The theoretical limit is 4GB on a 32-bit machine - however, the actual limit will be smaller since some areas will be reserved for other purposes. If the LFS interface is used the file size on 32-bit systems is not limited to 2GB … the full 64-bit [8 EiB] are available. …
Creating a File Mapping Using Large Pages
… you must specify the FILE_MAP_LARGE_PAGES flag with the MapViewOfFile function to map large pages. …
I have a huge text file of about 500MB in size. I tried to archive it with Gzip both from a python program and the command line. But, in both cases the archived file's size is about 240MB, whereas while archiving with WinRAR in Windows, the archived file size is around 450KB. Is there something I am missing here? Why is there so much difference and what can I do to achieve the similar level of compression?
I have tagged this with Python also, as any python code regarding this will be very helpful.
Here is first 3 lines of the file:
$ head 100.txt -n 3
31731610:22783120;
22783120:
45476057:39683372;5879272;54702019;58780534;30705698;60087296;98422023;55173626;5607459;843581;11846946;97676518;46819398;60044103;48496022;35228829;6594795;43867901;66416757;81235384;42557439;40435884;60586505;65993069;76377254;82877796;94397118;39141041;2725176;56097923;4290013;26546278;18501064;27470542;60289066;43986553;67745714;16358528;63833235;92738288;77291467;54053846;93392935;10376621;15432256;96550938;25648200;10411060;3053129;54530514;97316324;
It is possible that the file is highly redundant with a repeating pattern that is larger than 32K. gzip's deflate only looks 32K back for matches, whereas the others can capitalize on history much further back.
Update:
I just made a file that is a 64K block of random data, repeated 4096 times (256 MB). gzip (with 32K window) was blind to the redundancy and so unable to compress it. gzip expanded it to 256.04 MB. xz (LZMA with 8 MB window) compressed it to 102 KB.
WinRAR and Gzip are two very different compression programs. They each use different algorithms to compress data. Here are the descriptions of each type from Wikipedia:
Version 3 of RAR is based on Lempel-Ziv (LZSS) and prediction by partial matching (PPM) compression, specifically the PPMd implementation of PPMII by Dmitry Shkarin.
http://en.wikipedia.org/wiki/RAR#Compression_algorithm
And Gzip:
It is based on the DEFLATE algorithm, which is a combination of Lempel-Ziv (LZ77) and Huffman coding.
en.wikipedia.org/wiki/Gzip
My guess would be some sort of difference between how Prediction by partial matching and Huffman coding work. That file has very interesting properties though... What is the file?
I am tabulating a lot of output from some network analysis, listing an edge per line, which results in dozens of gigabytes, stretching the limits of my resources (understatement). As I only deal with numerical values, it occurred to me that I might be smarter than using the Py3k defaults. I.e. some other character encoding might save me quite some space if I only have digits (and space and the occasional decimal dot). As constrained I am, I might even save on the line endings (Not to have the Windows standard CRLF duplicate). What is the best practice on this?
An example line would read like this:
62233 242344 0.42442423
(Where actually the last number is pointlessly precise, I will cut it back to three nonzero digits.)
As I will need to read in the text file with other software (Stata, actually), I cannot keep the data in arbitrary binary, though I see no reason why Stata would only read UTF-8 text. Or you simply say that avoiding UTF-8 barely saves me anything?
I think compression would not work for me, as I write the text line by line and it would be great to limit the output size even during this. I might easily be mistaken how compression works, but I thought it could save me space after the file is generated, but my issue is that my code crashes already as I am tabulating the text file (line by line).
Thanks for all the ideas and clarifying questions!
You can use zlib or gzip to compress the data as you generate it. You won't need to change your format at all, the compression will adjust to the characters and sequences that you use the most to create an optimal file size.
Avoid the character encodings entirely and save your data in a binary format. See Python's struct. Ascii-encoded a value like 4-billion takes 10 bytes, but fits in a 4-byte integer. There are a lot of downsides to a custom binary format (its hard to manually debug, or inspect with other tools, etc)
I have done some study on this. Clever encoding does not matter once you apply compression. Even if you use some binary encoding, they seems to contain the same entropy and end up in similar size after compression.
The Power of Gzip
Yes there are Python library allow you to stream output and automatically compress it.
Lossy encoding does save space. Cutting down the precision helps.
I don't know the capabilities of data input in Stata, and a quick search reveals that said capabilities are described in the User's Guide, which seems to be available only on dead-tree copies. So I don't know if my suggestion is feasible.
An instant saving of half the size would be if you used 4-bits per character. You have an alphabet of 0 to 9, period, (possibly) minus sign, space and newline, which are 14 characters fitting perfectly in 2**4==16 slots.
If this can be used in Stata, I can help more with suggestions for quick conversions.
from measurements I get text files that basically contain a table of float numbers, with the dimensions 1000x1000. Those take up about 15MB of space which, considering that I get about 1000 result files in a series, is unacceptable to save. So I am trying to compress those by as much as possible without loss of data. My idea is to group the numbers into ~1000 steps over the range I expect and save those. That would provide sufficient resolution. However I still have 1.000.000 points to consider and thus my resulting file is still about 4MB. I probably won t be able to compress that any further?
The bigger problem is the calculation time this takes. Right now I d guesstimate 10-12 secs per file, so about 3 hrs for the 1000 files. WAAAAY to much. This is the algorithm I thougth up, do you have any suggestions? There's probably far more efficient algorithms to do that, but I am not much of a programmer...
import numpy
data=numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
out=numpy.empty((1000,1000),numpy.int16)
i=0
min=-0.5
max=0.5
step=(max-min)/1000
while i<=999:
j=0
while j<=999:
k=(data[i,j]//step)
out[i,j]=k
if data[i,j]>max:
out[i,j]=500
if data[i,j]<min:
out[i,j]=-500
j=j+1
i=i+1
numpy.savetxt('converted.txt', out, fmt="%i")
Thanks in advance for any hints you can provide!
Jakob
I see you store the numpy arrays as text files. There is a faster and more space-efficient way: just dump it.
If your floats can be stored as 32-bit floats, then use this:
data = numpy.genfromtxt('sample.txt',autostrip=True, case_sensitive=True)
data.astype(numpy.float32).dump(open('converted.numpy', 'wb'))
then you can read it with
data = numpy.load(open('converted.numpy', 'rb'))
The files will be 1000x1000x4 Bytes, about 4MB.
The latest version of numpy supports 16-bit floats. Maybe your floats will fit in its limiter range.
numpy.savez_compressed will let you save lots of arrays into a single compressed, binary file.
However, you aren't going to be able to compress it that much -- if you have 15GB of data, you're not magically going to fit it in 200MB by compression algorithms. You have to throw out some of your data, and only you can decide how much you need to keep.
Use the zipfile, bz2 or gzip module to save to a zip, bz2 or gz file from python. Any compression scheme you write yourself in a reasonable amount of time will almost certainly be slower and have worse compression ratio than these generic but optimized and compiled solutions. Also consider taking eumiro's advice.
Which compression method in Python has the best compression ratio?
Is the commonly used zlib.compress() the best or are there some better options? I need to get the best compression ratio possible.
I am compresing strings and sending them over UDP. A typical string I compress has about 1,700,000 bytes.
I'm sure that there might be some more obscure formats with better compression, but lzma is the best, of those that are well supported. There are some python bindings here.
EDIT
Don't pick a format without testing, some algorithms do better depending on the data set.
If you are willing to trade performance for getter compression then the bz2 library usually gives better results than the gz (zlib) library.
There are other compression libraries like xz (LZMA2) that might give even better results but they do not appear to be in the core distribution of python.
Python Doc for BZ2 class
EDIT: Depending on the type of image you might not get much additional compression. Many image formats are previously compressed unless it is raw, bmp, or uncompressed tiff. Testing between various compression types would be highly recommended.
EDIT2: If you do decide to do image compression. Image Magick supports python bindings and many image conversion types.
Image Magick
Image Formats Supported
The best compression algorithm definitely depends of the kind of data you are dealing with. Unless if you are working with a list of random numbers stored as a string (in which case no compression algorithm will work) knowing the kind of data usually allows to apply much better algorithms than general purpose ones (see other answers for good ready to use general compression algorithms).
If you are dealing with images you should definitely choose a lossy compression format (ie: pixel aware) preferably to any lossless one. That will give you much better results. Recompressing with a lossless format over a lossy one is a loss of time.
I would search through PIL to see what I can use. Something like converting image to jpeg with a compression ratio compatible with researched quality before sending should be very efficient.
You should also be very cautious if using UDP, it can lose some packets, and most compression format are very sensible to missing parts of file. OK. That can be managed at application level.