Extracting zip files from a large binary file

Extracting zip files from a large binary file - python

I am dealing with a somewhat large binary file (717M). This binary file contains a set (unknown number!) of complete zip files.
I would like to extract all of those zip files (no need to explitly decompress them). I am able to find the offset (start point) of each chunks thanks to the magic number ('PK') but I fail to find a way to compute the length for each chunk (eg. to carve those zip file out of the large binary file).
Reading some documentation (http://forensicswiki.org/wiki/ZIP), gives me the impression it is easy to parse a zip file since it contains the compressed size of each compressed file.
Is there a way for me to do that in C or Python without reinventing the wheel ?

A zip entry is permitted to not contain the compressed size in the local header. There is a flag bit to have a descriptor with the compressed size, uncompressed size, and CRC follow the compressed data.
It would be more reliable to search for end-of-central-directory headers, use that to find the central directories, and use that to find the local headers and entries. This will require attention to detail, very carefully reading the PKWare appnote that describes the zip format. You will need to handle the Zip64 format as well, which has additional headers and fields.
It is possible a zip entry to be stored, i.e. copied verbatim into that location in the zip file, and it is possible for that entry to itself be a zip file. So make sure that you handle the case of embedded zip files, extracting only the outermost zip files.

There are some standard ways to handle zip files in python for example but as far as i know (not that i'm an expert) you first need to supply the actual file somehow. I suggest looking at the zip file format specification.
You should be able to find the other information you need based on the relative position to the magic number. If I'm not mistaken the CRC-32 is the magic number, so jumping forward 4 bytes will get you to the compressed size, and another 8 bytes should get you the file name.
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file name (variable size)
extra field (variable size)
Hope that helps a little bit at least :)

Related

ZIP file in the wild with a 12 byte data descriptor, without a signature

According to the ZIP specification at https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
4.3.9 Data descriptor:
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
and
4.3.9.3 Although not originally assigned a signature, the value
0x08074b50 has commonly been adopted as a signature value
for the data descriptor record. Implementers SHOULD be
aware that ZIP files MAY be encountered with or without this
signature marking data descriptors and SHOULD account for
either case when reading ZIP files to ensure compatibility.
how can I find (or make with a tool), a ZIP file with a data descriptor that does not have the 0x08074b50 signature?
I know I can just... write it myself byte-by-byte, but I'm making a (streaming) unZIPper, and want to test with real-world ZIP files.

Append a folder to gzip in memory using python

I have a tar.gz file downloaded from s3, I load it in memory and I want to add a folder and eventually write it into another s3.
I've been trying different approaches:
from io import BytesIO
import gzip
buffer = BytesIO(zip_obj.get()["Body"].read())
im_memory_tar = tarfile.open(buffer, mode='a')
The above rises the error: ReadError: invalid header .
With the below approach:
im_memory_tar = tarfile.open(fileobj=buffer, mode='a')
im_memory_tar.add(name='code_1', arcname='code')
The content seems to be overwritten.
Do you know a good solution to append a folder into a tar.gz file?
Thanks.

very well explained in question how-to-append-a-file-to-a-tar-file-use-python-tarfile-module
Note that 'a:gz' or 'a:bz2' is not possible. If mode is not suitable to open a certain (compressed) file for reading, ReadError is raised. Use mode 'r' to avoid this. If a compression method is not supported, CompressionError is raised.

First we need to consider how to append to a tar file. Let's set aside the compression for a moment.
A tar file is terminated by two 512-byte blocks of all zeros. To add more entries, you need to remove or overwrite that 1024 bytes at the end. If you then append another tar file there, or start writing a new tar file there, you will have a single tar file with all of the entries of the original two.
Now we return to the tar.gz. You can simply decompress the entire .gz file, do that append as above, and then recompress the whole thing.
Avoiding the decompression and recompression is rather more difficult, since we'd have to somehow remove that last 1024 bytes of zeros from the end of the compressed stream. It is possible, but you would need some knowledge of the internals of a deflate compressed stream.
A deflate stream consists of a series of compressed data "blocks", which are each an arbitrary number of bits long. You would need to decompress, without writing out the result, until you get to the block containing the last 1024 bytes. You would need to save the decompressed result of that and any subsequent blocks, and at what bit in the stream that block started. Then you could recompress that data, sans the last 1024 bytes, starting at that byte.
Complete the compression, and write out the gzip trailer with the 1024 zeros removed from the CRC and length. (There is a way to back out zeros from the CRC.) Now you have a complete gzip stream for the previous .tar.gz file, but with the last 1024 bytes of zeros removed.
Since the concatenation of two gzip streams is itself a valid gzip stream, you can now concatenate the second .tar.gz file directly or start writing a new .tar.gz stream there. You now have a single, valid .tar.gz stream with the entries from the two original sources.

Test a ZIP file if data has been added at the end of the file?

I am searching for a way to test ZIP files for more details as Pythons ZipFile.testzip() does.
In detail I am searching a way to identify ZIP files that have been modified in a way that somebody has appended additional data after the end of the ZIP file - or to be precise after the end of the End of central directory record (EOCD).
Common zip testing tools (Python ZipFile.testzip(), unzip, 7zip, WinRAR, ...) only test the file up to the EOCD and ignore additional data afterwards. However I need to know if there is additional data present or not after the end of the EOCD.
Is there a simple way to do so in Python? The simplest way would be if I could read the real "ZIP file size" (the offset of the last byte of the EOCD + 1). But how can this be done in Python?

How to convert a byte object to a list of tuples in python 3?

I have a list of tuples [(x,y,z),...,] and I want to store this list of tuples in a file. For this I chose a .txt file. I write to the file in the mode "wb" and then I close it. Later, I want to open the file in mode "rb" and convert this byte object back to a list of tuples. How would I go about this without regular expression nonsense? Is there a file type that would allow me to store this data and read it easily that I've overlooked?

The .txt extension is typically not used for binary data, as you seem to intend.
Since your data structure is not known on a byte level, it's not that simple.
If you do know your data (types and length), you could "encode" it as a binary structure with https://docs.python.org/3.4/library/struct.html and write that to a (binary) file.
Otherwise, there are many solutions to the problem of writing (structured) data to and reading data from files (that's why there are soo many file formats):
Standard library:
https://docs.python.org/3/library/fileformats.html
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/xml.html
https://docs.python.org/3/library/json.html
3rd party:
https://pypi.python.org/pypi/PyYAML
and other modules on https://pypi.python.org/
Related Q&A on Stackoverflow:
How to save data with Python?

extract a file that contains lzh compression strings

I have data file(it has not extension-file type) that I think it is combination of two types of strings: one string is plain text and another one is string which has been compressed by LZH(LHA) algorithm,when I open the file with notepad++ it is look like this:
as it shows in picture some data is readable but other data are compressed, is there any software or source code in python,c++, php, or any other language that can read this file by chunk and decompressed them?
I googled but I found many source codes or software that decompress lzh,but they decomprssed a file that totally compressed by lzh, and first they check the head of file and if it is not lzh file give error.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting zip files from a large binary file - python

Related

ZIP file in the wild with a 12 byte data descriptor, without a signature

Append a folder to gzip in memory using python

Test a ZIP file if data has been added at the end of the file?

How to convert a byte object to a list of tuples in python 3?

extract a file that contains lzh compression strings

Categories

Resources