Decompressing with PyLZMA

Decompressing with PyLZMA - python

I compress my files using this script
import pylzma, struct
i = open(path+fileName,'rb')
o = open(path+zipName+'.zip','wb')
data = i.read()
c = pylzma.compressfile(data, eos=1)
result = c.read(5)
result += struct.pack('<Q', len(data))
o.write(result + c.read())
i.close()
o.close()
I use this method as shown in the PyLZMA documentation because it allows my files to be readable by 7zip or lzma.exe. Decompression using 7zip works fine but it does not work when I use PyLZMA. I use this:
i = open(path+name+'.zip', 'rb')
o = open(path+name, 'wb')
data = i.read()
u = pylzma.decompress(data)
o.write(u)
It stops on pylzma.decompress and I receive the following error:
TypeError: Error while decompressing: 1

If I'm reading the documentation correctly (I'm having trouble installing PyLZMA so I am unable to verify), compress() outputs a string that decompress() can handle.
However, in order to make the compressed string compatible with other utilities, it is necessary to insert the 8-byte length in between the first 5 bytes and the rest of the compressed data.
Thus, if you want to decompress the data using PyLZMA, I suspect you will need to manually remove that 8-byte length field (quickest way would probably be to open the input file, read 5 bytes, skip 8, then read the remainder of the file).

Related

Python generated lzma file with missing uncompressed size

According to https://svn.python.org/projects/external/xz-5.0.3/doc/lzma-file-format.txt
The lzma header should look something like this
1.1. Header
+------------+----+----+----+----+--+--+--+--+--+--+--+--+
| Properties | Dictionary Size | Uncompressed Size |
+------------+----+----+----+----+--+--+--+--+--+--+--+--+
I tried to generate lzma file of a 16kb *.bin file by using:
1.) the lzma.exe given by 7z standard SDK (with -d23 argument, 2^23 dict size) and then
2.) tried to generate in python using following code
import lzma
fileName = "file_split0_test.bin"
testFileName = "file_split0_test.lzma"
lzma_machine = lzma.LZMACompressor(format=lzma.FORMAT_ALONE)
with open(fileName, "rb") as fileRead:
toWrite = b""
byteRead = fileRead.read()
data_out = lzma_machine.compress(byteRead)
#print(data_out.hex())
fs = open(testFileName, 'wb')
fs.write(data_out)
fs.close()
fileRead.close()
However, the result of both are different despite I'm using the same "Properties" 5d, and dictionary size 0x8000. I can see that the output of python generated lzma file produced all 0xFF for the "Uncompressed Size" field, unlike the one generated using lzma.exe
Hopefully any expert can point out my mistakes here?
lzma.exe generated file
python lzma generated file

I was experiencing the same problem as you, and now I can say, that you are probably not doing any mistakes. It looks like modern lzma implementations don't add a value of uncompressed size in the header. They use simple "unknown size", the value of -1, which is sufficient for modern lzma decompressors. However, if you need to have the value of uncompressed size in the header, simply replace those binary data:
uncompressed_size = len(byteRead)
data_out = data_out[:5] + uncompressed_size.to_bytes(8, 'little') + data_out[13:]

Reading a python binary file with a C# BinaryReader

I need to export some data like integers, floats etc. to a binary file with python. Afterwards, I have to read the file with C# again but it doesnt work for me.
I tried several ways of writing a binary file with python and it works as long as I read it with python as well:
a = 3
b = 5
with open('test.tcd', 'wb') as file:
file.write(bytes(a))
file.write(bytes(b))
or writing it like this:
import pickle as p
with open('test.tcd', 'wb') as file:
p.dump([a, b], file)
Currently I am reading the file in C# like this:
static void LoadFile(String path)
{
BinaryReader br = new BinaryReader(new FileStream(path, FileMode.Open));
int a = br.ReadInt32();
int b = br.ReadInt32();
System.Diagnostics.Debug.WriteLine(a);
System.Diagnostics.Debug.WriteLine(b);
br.Close();
}
Unfortunately the output isnt 3 and 5, instead my output is just zero. How do i read or write the binary file properly?

In Python, you have to write your integers with 4 bytes each. Read more here: struct.pack
a = 3
b = 5
with open('test.tcd', 'wb') as file:
f.write(struct.pack("<i", 3))
f.write(struct.pack("<i", 5))
Your C# code should work now.

It's possible python is not writing data in the same format that C# expects. You may need to swap byte endianess or do something else. You could read the raw bytes instead and use BitConverter to see if that fixes it.
Another option is to specify the endianess explicitly in python, I think big endian is the default binary reader format for C#.
an_int = 5
a_bytes_big = an_int.to_bytes(2, 'big')
print(a_bytes_big)
Output
b'\x00\x05'
a_bytes_little = an_int.to_bytes(2, 'little')
print(a_bytes_little)
Output
b'\x05\x00'

Deserializing messages without loading entire file into memory?

I am using Google Protocol Buffers and Python to decode some large data files--200MB each. I have some code below that shows how to decode a delimited stream and it works just fine. However it uses the read() command which loads the whole file into memory and then iterates over it.
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read() ## PROBLEM-LOADS ENTIRE FILE TO MEMORY.
n = 0
while n < len(buf):
msg_len, new_pos = _DecodeVarint32(buf, n)
n = new_pos
msg_buf = buf[n:n+msg_len]
n += msg_len
read_row = sfeed.standard_feed()
read_row.ParseFromString(msg_buf)
# do something with read_metric
print(read_row)
Note that this code comes from another SO post, but I don't remember the exact url. I was wondering if there was a readlines() equivalent with protocol buffers that allows me to read in one delimited message at a time and decode it? I basically want a pipeline that is not limited by the RAM I have to load the file.
Seems like there was a pystream-protobuf package that supported some of this functionality, but it has not been updated in a year or two. There is also a post from 7 years ago that asked a similar question. But I was wondering if there was any new information since then.
python example for reading multiple protobuf messages from a stream

If it is ok to load one full message at a time, this is quite simple to implement by modifying the code you posted:
import feed_pb2 as sfeed
import sys
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
with open('/home/working/data/feed.pb', 'rb') as f:
buf = f.read(10) # Maximum length of length prefix
while buf:
msg_len, new_pos = _DecodeVarint32(buf, 0)
buf = buf[new_pos:]
# read rest of the message
buf += f.read(msg_len - len(buf))
read_row = sfeed.standard_feed()
read_row.ParseFromString(buf)
buf = buf[msg_len:]
# do something with read_metric
print(read_row)
# read length prefix for next message
buf += f.read(10 - len(buf))
This reads 10 bytes, which is enough to parse the length prefix, and then reads the rest of the message once its length is known.
String mutations are not very efficient in Python (they make a lot of copies of the data), so using bytearray can improve performance if your individual messages are also large.

https://github.com/cartoonist/pystream-protobuf/ was updated 6 months ago. I haven't tested it much so far, but it seems to work fine without any need for an update. It provides optional gzip and async.

Python bz2 sequential compressor produces invalid data stream on low compression levels

I have a series of strings in a list named 'lines' and I compress them as follows:
import bz2
compressor = bz2.BZ2Compressor(compressionLevel)
for l in lines:
compressor.compress(l)
compressedData = compressor.flush()
decompressedData = bz2.decompress(compressedData)
When compressionLevel is set to 8 or 9, this works fine. When it's any number between 1 and 7 (inclusive), the final line fails with an IOError: invalid data stream. The same occurs if I use the sequential decompressor. However, if I join the strings into one long string and use the one-shot compressor function, it works fine:
import bz2
compressedData = bz2.compress("\n".join(lines))
decompressedData = bz2.decompress(compressedData)
# Works perfectly
Do you know why this would be and how to make it work at lower compression levels?

You are throwing away the compressed data returned by compressor.compress(l) ... docs say "Returns a chunk of compressed data if possible, or an empty byte string otherwise." You need to do something like this:
# setup code goes here
for l in lines:
chunk = compressor.compress(l)
if chunk: do_something_with(chunk)
chunk = compressor.flush()
if chunk: do_something_with(chunk)
# teardown code goes here
Also note that your oneshot code uses "\n".join() ... to check this against the chunked result, use "".join()
Also beware of bytes/str issues e.g. the above should be b"whatever".join().
What version of Python are you using?

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])

This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.

I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.

I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html

I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decompressing with PyLZMA - python

Related

Python generated lzma file with missing uncompressed size

Reading a python binary file with a C# BinaryReader

Deserializing messages without loading entire file into memory?

Python bz2 sequential compressor produces invalid data stream on low compression levels

How can I work with Gzip files which contain extra data?

Categories

Resources