compressed archive with quick access to individual file

compressed archive with quick access to individual file - python

I need to come up with a file format for new application I am writing.
This file will need to hold a bunch other text files which are mostly text but can be other formats as well.
Naturally, a compressed tar file seems to fit the bill.
The problem is that I want to be able to retrieve some data from the file very quickly and getting just a particular file from a tar.gz file seems to take longer than it should. I am assumeing that this is because it has to decompress the entire file even though I just want one. When I have just a regular uncompressed tar file I can get that data real quick.
Lets say the file I need quickly is called data.dat
For example the command...
tar -x data.dat -zf myfile.tar.gz
... is what takes a lot longer than I'd like.
MP3 files have id3 data and jpeg files have exif data that can be read in quickly without opening the entire file.
I would like my data.dat file to be available in a similar way.
I was thinking that I could leave it uncompressed and seperate from the rest of the files in myfile.tar.gz
I could then create a tar file of data.dat and myfile.tar.gz and then hopefully that data would be able to be retrieved faster because it is at the head of outer tar file and is uncompressed.
Does this sound right?... putting a compressed tar inside of a tar file?
Basically, my need is to have an archive type of file with quick access to one particular file.
Tar does this just fine, but I'd also like to have that data compressed and as soon as I do that, I no longer have quick access.
Are there other archive formats that will give me that quick access I need?
As a side note, this application will be written in Python. If the solution calls for a re-invention of the wheel with my own binary format I am familiar with C and would have no problem writing the Python module in C. Idealy I'd just use tar, dd, cat, gzip, etc though.
Thanks,
~Eric

ZIP seems to be appropriate for your situation. Files are compressed individually, which means you access them without streaming through everything before.
In Python, you can use zipfile.

Related

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS. Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries; however, in our case, we can assume there's exactly one large text file that was zipped, and we could begin extracting and parsing data immediately without needing to wait for the ZIP file to buffer.
If we were using C#, we could use https://github.com/icsharpcode/SharpZipLib/wiki/Unpack-a-zip-using-ZipInputStream (implementation here) which handles this pattern elegantly.
However, it seems that the Python standard library's zipfile module doesn't support this type of streaming; it assumes that the input file-like object is seekable, and all tutorials point to iterating first over namelist() which seeks to the central directory data, then open(name) which seeks back to the file entry.
Many other examples on StackOverflow recommend using BytesIO(response.content) which might appear to pipe the content in a streaming way; however, .content in the Requests library consumes the entire stream and buffers the entire thing to memory.
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?

Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.

It's even possible to download one specific file from .zip without downloading whole file. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Using Onlinezip you can handle file like local file. Event API is identical as FileZip in python
[full disclosure: I'm author of library]

accessing one file at a time from a large (40GB) tar file in python

I'm trying to access a large tarball (tar.gz) in python. The tarball contains multiple mp3 or wav files. I'd like to read each file individually and do the processing that I would like to do.
I did look at a few of the suggestions available here: this and this.
Both solutions offer reading the table of contents, but not accessing/reading each file at a time.
The other solutions I have seen refer to extracting the entire tarball - I do not have so much place left on my disk to do so.
Any help in this regard will be appreciated.

You can use TarFile.extractfile to get a buffered reader on each file in the archive without decompressing the others.
import tarfile
with tarfile.open("test.tar.gz") as archive:
for member in archive:
file_obj = archive.extractfile(member)
print(file_obj)

Using Memoize in python (2.7)

I do not want to extract files on the disk but keep the final .txt in memory and parse the file. I can't find anything using Memoize in python 2.7.
.zip -> .gz -> .txt(data needs to be parsed)
My second choice it unzip and parse the .txt file data. Any thoughts?

You can unzip the file and write it to an io.BytesIO object, which is essentially an in memory file.
https://docs.python.org/2/library/io.html#buffered-streams
You can then use any function that works for a regular file such as read, seek etc.
This case you get a virtual file that works for any format. If you are certain about the txt is the only thing you are going to use. io module also provides other pure text streams.

unrar archive while downloading it

I've got a program that downloads part01, then part02 etc of a rar file split across the internet.
My program downloads part01 first, then part02 and so on.
After some tests, I found out that using, on example, UnRAR2 for python I can extract the first part of the file (an .avi file) contained in the archive and I'm able to play it for the first minutes. When I add another file it extracts a bit more and so on. What I wonder is: is it possible to make it extract single files WHILE downloading them?
I'd need it to start extracting part01 without having to wait for it to finish downloading... is that possible?
Thank you very much!
Matteo

You are talking about an .avi file inside the rar archives. Are you sure the archives are actually compressed? Video files released by the warez scene do not use compression:
Ripped movies are still packaged due to the large filesize, but compression is disallowed and the RAR format is used only as a container. Because of this, modern playback software can easily play a release directly from the packaged files, and even stream it as the release is downloaded (if the network is fast enough).
(I'm thinking VLC, BSPlayer, KMPlayer, Dziobas Rar Player, rarfilesource, rarfs,...)
You can check for the compression as follows:
Open the first .rar archive in WinRAR. (name.part01.rar or name.rar for old style volumes names)
Click the info button.
If Version to extract indicates 2.0, then the archive uses no compression. (unless you have decade old rars) You can see Total size and Packed size will be equal.
is it possible to make it extract
single files WHILE downloading them?
Yes. When no compression is used, you can write your own program to extract the files. (I know of someone who wrote a script to directly download the movie from external rar files; but it's not public and I don't have it.) Because you mentioned Python I suggest you take a look at rarfile 2.2 by Marko Kreen like the author of pyarrfs did. The archive is just the file chopped up with headers (rar blocks) added. It will be a copy operation that you need to pause until the next archive is downloaded.
I strongly believe it is also possible for compressed files. Your approach here will be different because you must use unrar to extract the compressed files. I have to add that there is also a free RARv3 implementation to extract rars implemented in The Unarchiver.
I think this parameter for (un)rar will make it possible:
-vp Pause before each volume
By default RAR asks for confirmation before creating
or unpacking next volume only for removable disks.
This switch forces RAR to ask such confirmation always.
It can be useful if disk space is limited and you wish
to copy each volume to another media immediately after
creation.
It will give you the possibility to pause the extraction until the next archive is downloaded.
I believe that this won't work if the rar was created with the 'solid' option enabled.
When the solid option is used for rars, all packed files are treated as one big file stream. This should not cause any problems if you always start from the first file even if it doesn't contain the file you want to extract.
I also think it will work with passworded archives.

I highly doubt it. By nature of compression (from my understanding), every bit is needed to uncompress it. It seems that the source of where you are downloading from has intentionally broken the avi into pieces before compression, but by the time you apply compression, whatever you compressed is now one atomic unit. So they kindly broke the whole avi into Parts, but each Part is still an atomic nit.
But I'm not an expert in compression.
The only test I can currently think of is something like: curl http://example.com/Part01 | unrar.

I don't know if this was asked with a specific language in mind, but it is possible to stream a compressed RAR directly from the internet and have it decompressed on the fly. I can do this with my C# library http://sharpcompress.codeplex.com/
The RAR format is actually kind of nice. It has headers preceding each entry and the compressed data itself does not require random access on the stream of bytes.
Do it multi-part files, you'd have to fully extract part 1 first, then continue writing when part 2 is available.
All of this is possible with my RarReader API. Solid archive are also streamable (in fact, they're only streamable. You can't randomly access files in a solid archive. You pretty much have to extract them all at once.)

Reading file from concatinated ( tar ) file directly without untarring the tar file

Hi i am having a one xml file and some image files, i am making my one concatenate file (ie as like tar file) from all these file (i am having my own scripts for tarring and untarring).
before i describe what exactly i want you have to look the current situation.
As of now i have to untar the all files into a directory then i am able to read the xml file which is part of the tar file.Then i read the data from xml file and then i am able to draw image mention in xml ( image names are mention in xml attribute value) on corresponding panels.
Now i want when someone click on my tar file, i should able to read the xml file and then i am able to read all the other images ( data) and i can draw on the corresponding panel with extract specifically in to a directory.
Is any method or any help really help me alot.
Thanks in advance.

The tarfile module gives you access to tarballs. It won't be random access, but you can read out any files you need and put them in a temporary directory, or just store them in strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.