how to check if a file is compressed with gzexe in Python? - python

I'm working on a simple virus scanner with Python and the scanner needs to check if a file has a virus signature(a particular string) in it. If a file is compressed, the scanner needs to decompress the file first and then check for the signature. Files that are compressed with gzip has a magic number at the very beginning of the file and that's easy to check and use gzip library to decompress.
But how to check if a file is compressed with gzexe? I looked here but gzexe compressed file is not listed. I checked the content of a file that's compressed with gzexe and find that it starts with "#!bin/sh". I think I can check this, but is there a better way to do this? Also, is there any library that can deal decompress gzexe compressed file?
EDIT
The previous problem I had with zlib is because I didn't realize that you have to pass a second parameter to zlib.decompress or it will give you an error. Python zlib documentation didn't point that out very clearly. In python it seems you need to pass 15+32 to this decompress method.
zlib.decompress(data, 15 + 32)
Also gzexe can be decompressed by zlib, As Mark said, you just need to find where the magic number starts and decompress the file from there.

Just search the file for a gzip signature. It is in there after the shell script.
You can use the zlib library to decompress it.

Related

How to read a specific line from a gzip file with python

I have a big gzip file (11GB) and I want to print as fast as possible the line that I want with Python. I have tried to do it with linecache.getline(), but as the own function open the file, you are not able to open it with gzip.
linecache expects to get a textfile. A file that has been compressed using gzip is not a textfile. To do what you want requires two steps. (1) Unzip the file so that you have a textfile. (2) Use linecache on the textfile. You can do both of those things in Python, but only one after the other.
I understand that you want to get at a specific line without having to decompress then entire zipfile. But that is not how zipfile compression works. There is unlikely to be anything in the compressed data that corresponds to the notion of a line of text.

Compressing text string with existing compression header

I wish to compress a given string with a pre-existing header retrieved from an already compressed file in an archive (a local file header).
I have attempted to look at zlib and while their compression/decompressing works nicely I can not find an option to set the compression header.
I want to avoid decompressing a file, add a string to the file, and then re-compress the file. Instead I simply want to "append" a given string to a given compressed file.
I have made attempts using the existing Zipfile module in Python, here I have tried to modify the Zipfile module to deal with a pre-set header, however from this I can conclude that the Zipfile module relies too heavily on the zlib library for this to be possible.
While my attempts have been in Python I am happy using any programming language.
What you want to do is more complicated than you think. However the code has already been written. Look at gzlog.h and gzlog.c in the examples directory of the zlib distribution.

Open an lzo file in python, without decompressing the file

I'm currently working on a 3rd year project involving data from Twitter. The department have provided me with .lzo's of a months worth of Twitter. The smallest is 4.9gb and when decompressed is 29gb so I'm trying to open the file and read as I'm going. Is this possible or do I need to decompress and work with the data that way?
EDIT: Have attempted to read it line by line and decompress the read line
UPDATE: Found a solution - reading the STDOUT of lzop -dc works like a charm
How about starting an lzop binary in a subprocess with -c switch and then read its STDOUT line by line?
I know only one library for LZO with Python — https://github.com/jd-boyd/python-lzo and it requires full decompression (moreover — it decompress contents in memory).
So I think you'll need to decompress files before work with them.
I know this is a very old question and the answer is really good. I enchountered a samilar problem, google brought me here.
I just write down my experience on lzo compression and lzop program. Hope I can help someone like me encounter the same quesion. And I write a simple python module to deal with lzo file, you can find it on https://github.com/ir193/python-lzo/
Regarding the quesion, reading lzo compressed file in place (without decompress the whole file) can be done by reading one block at one time. The lzo file is divided into serveral blocks and there is a maximum size of the block about serveral MB. In my module, you can just using read(4096) or so.
Actually *.lzo is created by lzop and has little to do with the python-lzo provided by another answer (https://github.com/jd-boyd/python-lzo). This module is used for compress/decompress string, not handle lzop file header and checksum. Don't use it if you want to open some exist lzo file.

compressed archive with quick access to individual file

I need to come up with a file format for new application I am writing.
This file will need to hold a bunch other text files which are mostly text but can be other formats as well.
Naturally, a compressed tar file seems to fit the bill.
The problem is that I want to be able to retrieve some data from the file very quickly and getting just a particular file from a tar.gz file seems to take longer than it should. I am assumeing that this is because it has to decompress the entire file even though I just want one. When I have just a regular uncompressed tar file I can get that data real quick.
Lets say the file I need quickly is called data.dat
For example the command...
tar -x data.dat -zf myfile.tar.gz
... is what takes a lot longer than I'd like.
MP3 files have id3 data and jpeg files have exif data that can be read in quickly without opening the entire file.
I would like my data.dat file to be available in a similar way.
I was thinking that I could leave it uncompressed and seperate from the rest of the files in myfile.tar.gz
I could then create a tar file of data.dat and myfile.tar.gz and then hopefully that data would be able to be retrieved faster because it is at the head of outer tar file and is uncompressed.
Does this sound right?... putting a compressed tar inside of a tar file?
Basically, my need is to have an archive type of file with quick access to one particular file.
Tar does this just fine, but I'd also like to have that data compressed and as soon as I do that, I no longer have quick access.
Are there other archive formats that will give me that quick access I need?
As a side note, this application will be written in Python. If the solution calls for a re-invention of the wheel with my own binary format I am familiar with C and would have no problem writing the Python module in C. Idealy I'd just use tar, dd, cat, gzip, etc though.
Thanks,
~Eric
ZIP seems to be appropriate for your situation. Files are compressed individually, which means you access them without streaming through everything before.
In Python, you can use zipfile.

uncompressing tar.Z file with python?

I need to write a python script that retrieves tar.Z files from an FTP server, and uncompress them on a windows machine. tar.Z, if I understood correctly is the result of a compress command in Unix.
Python doesn't seem to know how to handle these, it's not gz, nor bz2 or zip. Does anyone know a library that would handle these ?
Thanks in advance
If GZIP -- the application -- can handle it, you have two choices.
Try the Python gzip library. It may work.
Use subprocess Popen to run gzip for you.
It may be an InstallShield .Z file. You may want to use InstallShield to unpack it and extract the .TAR file. Again, you may be able to use subprocess Popen to process the file.
It may also be a "LZW compressed file". Look at this library, it may help.
http://www.chilkatsoft.com/compression-python.asp
Since you target a specific platform (Windows), the simplest solution may be to run gzip in a system call: http://www.gzip.org/#exe
Are there other requirements in your project that the decompression needs to be done in Python?
A plain Python module that uncompresses is inexistant, AFAIK, but it's feasible to build one, given some knowledge:
the .Z format header specification
the .Z compression format
Almost all necessary information can be found the unarchiver CompressAlgorithm. Additional info from wikipedia for adaptive LZW and perhaps the compress man page.
Basically, you read the first three bytes (first two are magic bytes) to modify your algorithm, and then start reading and decompressing.
There's a lot of bit fiddling (.Z files begin having 9-bit tokens, up to 16-bit ones and then resetting the symbol table to the initial 256+2 values), which probably you'll deal with doing binary operations (&, <<= etc).

Categories