How to read a specific line from a gzip file with python - python

I have a big gzip file (11GB) and I want to print as fast as possible the line that I want with Python. I have tried to do it with linecache.getline(), but as the own function open the file, you are not able to open it with gzip.

linecache expects to get a textfile. A file that has been compressed using gzip is not a textfile. To do what you want requires two steps. (1) Unzip the file so that you have a textfile. (2) Use linecache on the textfile. You can do both of those things in Python, but only one after the other.
I understand that you want to get at a specific line without having to decompress then entire zipfile. But that is not how zipfile compression works. There is unlikely to be anything in the compressed data that corresponds to the notion of a line of text.

Related

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]

Using Memoize in python (2.7)

I do not want to extract files on the disk but keep the final .txt in memory and parse the file. I can't find anything using Memoize in python 2.7.
.zip -> .gz -> .txt(data needs to be parsed)
My second choice it unzip and parse the .txt file data. Any thoughts?
You can unzip the file and write it to an io.BytesIO object, which is essentially an in memory file.
https://docs.python.org/2/library/io.html#buffered-streams
You can then use any function that works for a regular file such as read, seek etc.
This case you get a virtual file that works for any format. If you are certain about the txt is the only thing you are going to use. io module also provides other pure text streams.

Getting the physical location of a file inside a ZipFile for quicker deletion via Python

I'm trying to find a way to delete files from inside a zip without de/recompressing the archive.
The ZipFile documentation mentions -
ZipInfo.header_offset
Byte offset to the file header.
ZipInfo.compress_size
Size of the compressed data.
Could this be as simple as reading compress_size bytes starting at header_offset, or is there a catch?
Does this allow for deletion by gluing parts of the compressed file directly, without de/recompression, perhaps using mmap?

how to check if a file is compressed with gzexe in Python?

I'm working on a simple virus scanner with Python and the scanner needs to check if a file has a virus signature(a particular string) in it. If a file is compressed, the scanner needs to decompress the file first and then check for the signature. Files that are compressed with gzip has a magic number at the very beginning of the file and that's easy to check and use gzip library to decompress.
But how to check if a file is compressed with gzexe? I looked here but gzexe compressed file is not listed. I checked the content of a file that's compressed with gzexe and find that it starts with "#!bin/sh". I think I can check this, but is there a better way to do this? Also, is there any library that can deal decompress gzexe compressed file?
EDIT
The previous problem I had with zlib is because I didn't realize that you have to pass a second parameter to zlib.decompress or it will give you an error. Python zlib documentation didn't point that out very clearly. In python it seems you need to pass 15+32 to this decompress method.
zlib.decompress(data, 15 + 32)
Also gzexe can be decompressed by zlib, As Mark said, you just need to find where the magic number starts and decompress the file from there.
Just search the file for a gzip signature. It is in there after the shell script.
You can use the zlib library to decompress it.

compressed archive with quick access to individual file

I need to come up with a file format for new application I am writing.
This file will need to hold a bunch other text files which are mostly text but can be other formats as well.
Naturally, a compressed tar file seems to fit the bill.
The problem is that I want to be able to retrieve some data from the file very quickly and getting just a particular file from a tar.gz file seems to take longer than it should. I am assumeing that this is because it has to decompress the entire file even though I just want one. When I have just a regular uncompressed tar file I can get that data real quick.
Lets say the file I need quickly is called data.dat
For example the command...
tar -x data.dat -zf myfile.tar.gz
... is what takes a lot longer than I'd like.
MP3 files have id3 data and jpeg files have exif data that can be read in quickly without opening the entire file.
I would like my data.dat file to be available in a similar way.
I was thinking that I could leave it uncompressed and seperate from the rest of the files in myfile.tar.gz
I could then create a tar file of data.dat and myfile.tar.gz and then hopefully that data would be able to be retrieved faster because it is at the head of outer tar file and is uncompressed.
Does this sound right?... putting a compressed tar inside of a tar file?
Basically, my need is to have an archive type of file with quick access to one particular file.
Tar does this just fine, but I'd also like to have that data compressed and as soon as I do that, I no longer have quick access.
Are there other archive formats that will give me that quick access I need?
As a side note, this application will be written in Python. If the solution calls for a re-invention of the wheel with my own binary format I am familiar with C and would have no problem writing the Python module in C. Idealy I'd just use tar, dd, cat, gzip, etc though.
Thanks,
~Eric
ZIP seems to be appropriate for your situation. Files are compressed individually, which means you access them without streaming through everything before.
In Python, you can use zipfile.

Categories