Open an lzo file in python, without decompressing the file - python

I'm currently working on a 3rd year project involving data from Twitter. The department have provided me with .lzo's of a months worth of Twitter. The smallest is 4.9gb and when decompressed is 29gb so I'm trying to open the file and read as I'm going. Is this possible or do I need to decompress and work with the data that way?
EDIT: Have attempted to read it line by line and decompress the read line
UPDATE: Found a solution - reading the STDOUT of lzop -dc works like a charm

How about starting an lzop binary in a subprocess with -c switch and then read its STDOUT line by line?

I know only one library for LZO with Python — https://github.com/jd-boyd/python-lzo and it requires full decompression (moreover — it decompress contents in memory).
So I think you'll need to decompress files before work with them.

I know this is a very old question and the answer is really good. I enchountered a samilar problem, google brought me here.
I just write down my experience on lzo compression and lzop program. Hope I can help someone like me encounter the same quesion. And I write a simple python module to deal with lzo file, you can find it on https://github.com/ir193/python-lzo/
Regarding the quesion, reading lzo compressed file in place (without decompress the whole file) can be done by reading one block at one time. The lzo file is divided into serveral blocks and there is a maximum size of the block about serveral MB. In my module, you can just using read(4096) or so.
Actually *.lzo is created by lzop and has little to do with the python-lzo provided by another answer (https://github.com/jd-boyd/python-lzo). This module is used for compress/decompress string, not handle lzop file header and checksum. Don't use it if you want to open some exist lzo file.

Related

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]

Compressing text string with existing compression header

I wish to compress a given string with a pre-existing header retrieved from an already compressed file in an archive (a local file header).
I have attempted to look at zlib and while their compression/decompressing works nicely I can not find an option to set the compression header.
I want to avoid decompressing a file, add a string to the file, and then re-compress the file. Instead I simply want to "append" a given string to a given compressed file.
I have made attempts using the existing Zipfile module in Python, here I have tried to modify the Zipfile module to deal with a pre-set header, however from this I can conclude that the Zipfile module relies too heavily on the zlib library for this to be possible.
While my attempts have been in Python I am happy using any programming language.
What you want to do is more complicated than you think. However the code has already been written. Look at gzlog.h and gzlog.c in the examples directory of the zlib distribution.

how to check if a file is compressed with gzexe in Python?

I'm working on a simple virus scanner with Python and the scanner needs to check if a file has a virus signature(a particular string) in it. If a file is compressed, the scanner needs to decompress the file first and then check for the signature. Files that are compressed with gzip has a magic number at the very beginning of the file and that's easy to check and use gzip library to decompress.
But how to check if a file is compressed with gzexe? I looked here but gzexe compressed file is not listed. I checked the content of a file that's compressed with gzexe and find that it starts with "#!bin/sh". I think I can check this, but is there a better way to do this? Also, is there any library that can deal decompress gzexe compressed file?
EDIT
The previous problem I had with zlib is because I didn't realize that you have to pass a second parameter to zlib.decompress or it will give you an error. Python zlib documentation didn't point that out very clearly. In python it seems you need to pass 15+32 to this decompress method.
zlib.decompress(data, 15 + 32)
Also gzexe can be decompressed by zlib, As Mark said, you just need to find where the magic number starts and decompress the file from there.
Just search the file for a gzip signature. It is in there after the shell script.
You can use the zlib library to decompress it.

compressed archive with quick access to individual file

I need to come up with a file format for new application I am writing.
This file will need to hold a bunch other text files which are mostly text but can be other formats as well.
Naturally, a compressed tar file seems to fit the bill.
The problem is that I want to be able to retrieve some data from the file very quickly and getting just a particular file from a tar.gz file seems to take longer than it should. I am assumeing that this is because it has to decompress the entire file even though I just want one. When I have just a regular uncompressed tar file I can get that data real quick.
Lets say the file I need quickly is called data.dat
For example the command...
tar -x data.dat -zf myfile.tar.gz
... is what takes a lot longer than I'd like.
MP3 files have id3 data and jpeg files have exif data that can be read in quickly without opening the entire file.
I would like my data.dat file to be available in a similar way.
I was thinking that I could leave it uncompressed and seperate from the rest of the files in myfile.tar.gz
I could then create a tar file of data.dat and myfile.tar.gz and then hopefully that data would be able to be retrieved faster because it is at the head of outer tar file and is uncompressed.
Does this sound right?... putting a compressed tar inside of a tar file?
Basically, my need is to have an archive type of file with quick access to one particular file.
Tar does this just fine, but I'd also like to have that data compressed and as soon as I do that, I no longer have quick access.
Are there other archive formats that will give me that quick access I need?
As a side note, this application will be written in Python. If the solution calls for a re-invention of the wheel with my own binary format I am familiar with C and would have no problem writing the Python module in C. Idealy I'd just use tar, dd, cat, gzip, etc though.
Thanks,
~Eric
ZIP seems to be appropriate for your situation. Files are compressed individually, which means you access them without streaming through everything before.
In Python, you can use zipfile.

uncompressing tar.Z file with python?

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance
If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp
Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?
A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

Categories