Compressing text string with existing compression header - python

I wish to compress a given string with a pre-existing header retrieved from an already compressed file in an archive (a local file header).
I have attempted to look at zlib and while their compression/decompressing works nicely I can not find an option to set the compression header.
I want to avoid decompressing a file, add a string to the file, and then re-compress the file. Instead I simply want to "append" a given string to a given compressed file.
I have made attempts using the existing Zipfile module in Python, here I have tried to modify the Zipfile module to deal with a pre-set header, however from this I can conclude that the Zipfile module relies too heavily on the zlib library for this to be possible.
While my attempts have been in Python I am happy using any programming language.

What you want to do is more complicated than you think. However the code has already been written. Look at gzlog.h and gzlog.c in the examples directory of the zlib distribution.

Related

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]

Writing data to a zip archive in Python

I've been told in the past that there is simply no easy way to write a string a zip file. It's okay to READ from a zip archive, but if you want to write to a zip file, the best option is to extract it, make the changes, and then zip it back up again. However, the library I am using (openpyxl) accomplishes the feat of writing to a zip file without any extraction. This package uses the writestr() function in the python ZipFile library to make changes. Can someone explain to me how exactly this is possible? I know it has something to do with writing bytes but I can't fine a good explanation.
I'm aware of the vagueness of this question, but that's a circumstance of my lack of knowledge on the topic.
openpyxl does not modify the files in place because you can't do this with zipfiles. You must extract, modify and archive. We just hide this process in the library.

how to check if a file is compressed with gzexe in Python?

I'm working on a simple virus scanner with Python and the scanner needs to check if a file has a virus signature(a particular string) in it. If a file is compressed, the scanner needs to decompress the file first and then check for the signature. Files that are compressed with gzip has a magic number at the very beginning of the file and that's easy to check and use gzip library to decompress.
But how to check if a file is compressed with gzexe? I looked here but gzexe compressed file is not listed. I checked the content of a file that's compressed with gzexe and find that it starts with "#!bin/sh". I think I can check this, but is there a better way to do this? Also, is there any library that can deal decompress gzexe compressed file?
EDIT
The previous problem I had with zlib is because I didn't realize that you have to pass a second parameter to zlib.decompress or it will give you an error. Python zlib documentation didn't point that out very clearly. In python it seems you need to pass 15+32 to this decompress method.
zlib.decompress(data, 15 + 32)
Also gzexe can be decompressed by zlib, As Mark said, you just need to find where the magic number starts and decompress the file from there.
Just search the file for a gzip signature. It is in there after the shell script.
You can use the zlib library to decompress it.

Open an lzo file in python, without decompressing the file

I'm currently working on a 3rd year project involving data from Twitter. The department have provided me with .lzo's of a months worth of Twitter. The smallest is 4.9gb and when decompressed is 29gb so I'm trying to open the file and read as I'm going. Is this possible or do I need to decompress and work with the data that way?
EDIT: Have attempted to read it line by line and decompress the read line
UPDATE: Found a solution - reading the STDOUT of lzop -dc works like a charm
How about starting an lzop binary in a subprocess with -c switch and then read its STDOUT line by line?
I know only one library for LZO with Python — https://github.com/jd-boyd/python-lzo and it requires full decompression (moreover — it decompress contents in memory).
So I think you'll need to decompress files before work with them.
I know this is a very old question and the answer is really good. I enchountered a samilar problem, google brought me here.
I just write down my experience on lzo compression and lzop program. Hope I can help someone like me encounter the same quesion. And I write a simple python module to deal with lzo file, you can find it on https://github.com/ir193/python-lzo/
Regarding the quesion, reading lzo compressed file in place (without decompress the whole file) can be done by reading one block at one time. The lzo file is divided into serveral blocks and there is a maximum size of the block about serveral MB. In my module, you can just using read(4096) or so.
Actually *.lzo is created by lzop and has little to do with the python-lzo provided by another answer (https://github.com/jd-boyd/python-lzo). This module is used for compress/decompress string, not handle lzop file header and checksum. Don't use it if you want to open some exist lzo file.

uncompressing tar.Z file with python?

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance
If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp
Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?
A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

Categories