Repeatably gzip files in python

Repeatably gzip files in python - python

I'm writing a script in python for deploying static sites to aws (s3, cloudfront, route53). Because I don't want to upload every file on every deploy, I check which files were modified by comparing their md5 hash with their e-tag (which s3 sets to be the object's md5 hash). This works well for all files except for those that my build script gzips before uploading. Taking a look inside the files, it seems like gzip isn't really a pure function; there are very slight differences in the output file every time gzip is run, even if the source file hasn't changed.
My question is this: is there any way to get gzip to reliably and repeatably output the exact same file given the exact same input? Or am I better off just checking if the file is gzipped, unzipping it and computing the md5 hash/manually setting the e-tag value for it instead?

The compressed data is the same each time. The only thing that differs is likely the modification time in the header. The fifth argument of GzipFile (if that's what you're using) allows you to specify the modification time in the header. The first argument is the file name, which also goes in the header, so you want to keep that the same. If you provide a fourth argument for the source data, then the first argument is used only to populate the file name portion of the header.

gzip is not stable as you figured out correctly:
[root#dev1 ~]# touch a b
[root#dev1 ~]# gzip a
[root#dev1 ~]# gzip b
[root#dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c a.gz
4974585cf2e85113f1464dc9ea45c793 b.gz

Related

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?

Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.

It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]

how to check if a file is compressed with gzexe in Python?

I'm working on a simple virus scanner with Python and the scanner needs to check if a file has a virus signature(a particular string) in it. If a file is compressed, the scanner needs to decompress the file first and then check for the signature. Files that are compressed with gzip has a magic number at the very beginning of the file and that's easy to check and use gzip library to decompress.
But how to check if a file is compressed with gzexe? I looked here but gzexe compressed file is not listed. I checked the content of a file that's compressed with gzexe and find that it starts with "#!bin/sh". I think I can check this, but is there a better way to do this? Also, is there any library that can deal decompress gzexe compressed file?
EDIT
The previous problem I had with zlib is because I didn't realize that you have to pass a second parameter to zlib.decompress or it will give you an error. Python zlib documentation didn't point that out very clearly. In python it seems you need to pass 15+32 to this decompress method.
zlib.decompress(data, 15 + 32)
Also gzexe can be decompressed by zlib, As Mark said, you just need to find where the magic number starts and decompress the file from there.

Just search the file for a gzip signature. It is in there after the shell script.
You can use the zlib library to decompress it.

Comparing local file with remote file

I have the following problem: I have a local .zip file and a .zip file located on a server. I need to check if the .zip file on the server is different from the local one; if they are not I need to pull the new one from the server. My question is how do I compare them without downloading the file from the server and comparing them locally?
I could create an MD5 hash for the zip file on the server when creating the .zip file and then compare it with the MD5 of my local .zip file, but is there a simpler way?

Short answer: You can't.
Long answer: To compare with the zip file on the server, someone has to read that file. Either you can do that locally, which would involve pulling it, or you can ask the server to do it for you. Can you run code on the server?
Edit
If you can run Python on the server, why not hash the file and compare hashes?
import hashlib
with open( <path-to-file>, "rb" ) as theFile:
m = hashlib.md5( )
for line in theFile:
m.update( line )
with open( <path-to-hashfile>, "wb" ) as theFile:
theFile.write( m.digest( ) )
and then compare the contents of hashfile with a locally-generated hash?
Another edit
You asked for a simpler way. Think about this in an abstract way for a moment:
You don't want to download the entire zip file.
Hence, you can't process the entire file locally (because that would involve reading all of it from the server, which is equivalent to downloading it!).
Hence, you need to do some processing on the server. Specifically, you want to come up with some small amount of data that 'encodes' the file, so that you can fetch this small amount of data without fetching the whole file.
But this is a hash!
Therefore, you need to do some sort of hashing. Given that, I think the above is pretty simple.

I would like to know how you intend to compare them locally, if it were the case. You can apply the same logic to compare them remotely.

You can log in using ssh and make a md5 hash for the file remotely and a md5 hash for the current local file. If the md5s are matching the files are identicaly, else they are different.

compressed archive with quick access to individual file

I need to come up with a file format for new application I am writing.
This file will need to hold a bunch other text files which are mostly text but can be other formats as well.
Naturally, a compressed tar file seems to fit the bill.
The problem is that I want to be able to retrieve some data from the file very quickly and getting just a particular file from a tar.gz file seems to take longer than it should. I am assumeing that this is because it has to decompress the entire file even though I just want one. When I have just a regular uncompressed tar file I can get that data real quick.
Lets say the file I need quickly is called data.dat
For example the command...
tar -x data.dat -zf myfile.tar.gz
... is what takes a lot longer than I'd like.
MP3 files have id3 data and jpeg files have exif data that can be read in quickly without opening the entire file.
I would like my data.dat file to be available in a similar way.
I was thinking that I could leave it uncompressed and seperate from the rest of the files in myfile.tar.gz
I could then create a tar file of data.dat and myfile.tar.gz and then hopefully that data would be able to be retrieved faster because it is at the head of outer tar file and is uncompressed.
Does this sound right?... putting a compressed tar inside of a tar file?
Basically, my need is to have an archive type of file with quick access to one particular file.
Tar does this just fine, but I'd also like to have that data compressed and as soon as I do that, I no longer have quick access.
Are there other archive formats that will give me that quick access I need?
As a side note, this application will be written in Python. If the solution calls for a re-invention of the wheel with my own binary format I am familiar with C and would have no problem writing the Python module in C. Idealy I'd just use tar, dd, cat, gzip, etc though.
Thanks,
~Eric

ZIP seems to be appropriate for your situation. Files are compressed individually, which means you access them without streaming through everything before.
In Python, you can use zipfile.

uncompressing tar.Z file with python?

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance

If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp

Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?

A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.