Python generated lzma file with missing uncompressed size - python

According to https://svn.python.org/projects/external/xz-5.0.3/doc/lzma-file-format.txt
The lzma header should look something like this
1.1. Header
+------------+----+----+----+----+--+--+--+--+--+--+--+--+
| Properties | Dictionary Size | Uncompressed Size |
+------------+----+----+----+----+--+--+--+--+--+--+--+--+
I tried to generate lzma file of a 16kb *.bin file by using:
1.) the lzma.exe given by 7z standard SDK (with -d23 argument, 2^23 dict size) and then
2.) tried to generate in python using following code
import lzma
fileName = "file_split0_test.bin"
testFileName = "file_split0_test.lzma"
lzma_machine = lzma.LZMACompressor(format=lzma.FORMAT_ALONE)
with open(fileName, "rb") as fileRead:
toWrite = b""
byteRead = fileRead.read()
data_out = lzma_machine.compress(byteRead)
#print(data_out.hex())
fs = open(testFileName, 'wb')
fs.write(data_out)
fs.close()
fileRead.close()
However, the result of both are different despite I'm using the same "Properties" 5d, and dictionary size 0x8000. I can see that the output of python generated lzma file produced all 0xFF for the "Uncompressed Size" field, unlike the one generated using lzma.exe
Hopefully any expert can point out my mistakes here?
lzma.exe generated file
python lzma generated file

I was experiencing the same problem as you, and now I can say, that you are probably not doing any mistakes. It looks like modern lzma implementations don't add a value of uncompressed size in the header. They use simple "unknown size", the value of -1, which is sufficient for modern lzma decompressors. However, if you need to have the value of uncompressed size in the header, simply replace those binary data:
uncompressed_size = len(byteRead)
data_out = data_out[:5] + uncompressed_size.to_bytes(8, 'little') + data_out[13:]

Related

MATLAB vs. Python Binary File Read

I have a MATLAB application that reads a .bin file and parses through the data. I am trying to convert this script from MATLAB to Python but am seeing discrepancies in the values being read.
The read function utilized in the MATLAB script is:
fname = 'file.bin';
f=fopen(fname);
data = fread(f, 100);
fclose(f);
The Python conversion I attempted is: (edited)
fname = 'file.bin'
with open(fname, mode='rb') as f:
data= list(f.read(100))
I would then print a side-by-side comparison of the read bytes with their index and found discrepancies between the two. I have confirmed that the values read in Python are correct by executing $ hexdump -n 100 -C file.bin and by viewing the file's contents on the application HexEdit.
I would appreciate any insight into the source of discrepancies between the two programs and how I may be able to resolve it.
Note: I am trying to only utilize built-in Python libraries to resolve this issue.
Solution: Utilizing incorrect file path/structure between programming languages. Implementing #juanpa.arrivillaga's suggestion cleanly reproduced the MATLAB results.
An exact translation of the MATLAB code, using NumPy, would be:
data = np.frombuffer(f.read(100), dtype=np.uint8).astype(np.float64)
python automatically transforms single bytes into unsigned integers, as done by matlab, so you just need to do the following.
fname = 'file.bin'
with open(fname, mode='rb') as f:
bytes_arr = f.read(100)
# Conversion for visual comparison purposes
data = [x for x in bytes_arr]
print(data)
also welcome to python, bytes is a built-in type, so please don't override the built-in bytes type ... or you'll run into unexpected problems.
Edit: as pointed by #juanpa.arrivillaga you could use the faster
fname = 'file.bin'
with open(fname, mode='rb') as f:
bytes_arr = f.read(100)
# Conversion for visual comparison purposes
data = list(bytes_arr)

The result of zlib has different tail between python and php

The php code is
'''
$input_file = "a.txt";
$source = file_get_contents($input_file);
$source = gzcompress($source);
file_put_contents("php.txt",$source)
'''
The python code is
'''
testFile = "a.txt"
content = None
with open(testFile,"rb") as f:
content = f.read()
outContent = zlib.compress(content)
with open("py.txt","wb") as f:
f.write(outContent)
'''
The python3 version is [Python 3.6.9]
The php version is [PHP 7.2.17]
I need the same result for same md5.
The problem is not in PHP or Python, but rather in your "need". You cannot expect to get the same result, unless the two environments happen to be using the same version of the same compression code with the same settings. Since you do not have control of the version of code being used, your "need" can never be guaranteed to be met.
You should instead be doing your md5 on the decompressed data, not the compressed data.
I find the solution.
The code is
compress = zlib.compressobj(zlib.Z_DEFAULT_COMPRESSION,zlib.DEFLATED, 15, 9)
outContent = compress.compress(content)
outContent += compress.flush()
The python zlib provide a interface "zlib.compressobj",which returns a compressobj,and the parameters decide the result.
You can adjust parameters to make sure the python's result is same with php's

Encrypt a big file that does not fit in RAM with AES-GCM

This code works for a file myfile which fits in RAM:
import Crypto.Random, Crypto.Cipher.AES # pip install pycryptodome
nonce = Crypto.Random.new().read(16)
key = Crypto.Random.new().read(16) # in reality, use a key derivation function, etc. ouf of topic here
cipher = Crypto.Cipher.AES.new(key, Crypto.Cipher.AES.MODE_GCM, nonce=nonce)
out = io.BytesIO()
with open('myfile', 'rb') as g:
s = g.read()
ciphertext, tag = cipher.encrypt_and_digest(s)
out.write(nonce)
out.write(ciphertext)
out.write(tag)
But how to encrypt a 64 GB file using this technique?
Obviously, the g.read(...) should use a smaller buffer-size, e.g. 128 MB.
But then, how does it work for the crypto part? Should we keep a (ciphertext, tag) for each 128-MB chunk?
Or is it possible to have only one tag for the whole file?
As mentioned in #PresidentJamesK.Polk's comment, this seems to be the solution:
out.write(nonce)
while True:
block = g.read(65536)
if not block:
break
out.write(cipher.encrypt(block))
out.write(cipher.digest()) # 16-byte tag at the end of the file
The only problem is that, when reading back this file for decryption, stopping at the end minus 16 bytes is a bit annoying.
Or maybe one should do this:
out.write(nonce)
out.seek(16, 1) # go forward of 16 bytes, placeholder for tag
while True:
...
...
out.seek(16)
out.write(cipher.digest()) # write the tag at offset #16 of the output file
?

Decompressing with PyLZMA

I compress my files using this script
import pylzma, struct
i = open(path+fileName,'rb')
o = open(path+zipName+'.zip','wb')
data = i.read()
c = pylzma.compressfile(data, eos=1)
result = c.read(5)
result += struct.pack('<Q', len(data))
o.write(result + c.read())
i.close()
o.close()
I use this method as shown in the PyLZMA documentation because it allows my files to be readable by 7zip or lzma.exe. Decompression using 7zip works fine but it does not work when I use PyLZMA. I use this:
i = open(path+name+'.zip', 'rb')
o = open(path+name, 'wb')
data = i.read()
u = pylzma.decompress(data)
o.write(u)
It stops on pylzma.decompress and I receive the following error:
TypeError: Error while decompressing: 1
If I'm reading the documentation correctly (I'm having trouble installing PyLZMA so I am unable to verify), compress() outputs a string that decompress() can handle.
However, in order to make the compressed string compatible with other utilities, it is necessary to insert the 8-byte length in between the first 5 bytes and the rest of the compressed data.
Thus, if you want to decompress the data using PyLZMA, I suspect you will need to manually remove that 8-byte length field (quickest way would probably be to open the input file, read 5 bytes, skip 8, then read the remainder of the file).

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html
I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

Categories