Decompressing bz2 files on Windows - python

I am trying to decompress a bz2 file with below code snippet which is provided in various places:
bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)
However, I am getting a much smaller file than I expect.
When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.
When I read the extracted XML file, I can see that rest of the file is missing as I expect.
I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.
By the way, I already run into this and this both did no good.

If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:
Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3’s BZ2File class, which does support multi-stream files.
An alternative, drop-in replacement: bz2file should work though.

If it is a multistream file, you have to set mode to "r" or it will silently fail (e.g. output the compressed data as is).
This should do what you want:
with open(out_file_path) as out_file, BZ2File(bz2_file_path, "r") as bz2_file:
for data in iter(lambda: bz2_file.read(100 * 1024), b""):
out_file.write(data)
From the documentation:
If mode is 'r', the input file may be the concatenation of multiple compressed streams.
https://docs.python.org/3/library/bz2.html#bz2.BZ2File

Related

Decompress remote .gz file in Python

i've a issue with Python.
My case: i've a gzipped file from a partner platform (i.e. h..p//....namesite.../xxx)
If i click the link from my browser, it will download a file like (i.e. namefile.xml.gz).
So... if i read this file with python i can decompress and read it.
Code:
content = gzip.open(namefile.xml.gz,'rb')
print content.read()
But i can't if i try to read the file from remote source.
From remote file i can read only the encoded string, but not decoded it.
Code:
response = urllib2.urlopen(url)
encoded =response.read()
print encoded
With this code i can read the string encoded... but i can't decoded it with gzip or lzip.
Any advices?
Thanks a lot
Unfortunately the method #Aya suggests does not work, since GzipFile extensively uses seek method of the file object (not supported by response).
So you have basically two options:
Read the contents of the remote file into io.StringIO, and pass the object into gzip.GzipFile (if the file is small)
download the file into a temporary file on disk, and use gzip.open
There is another option (which requires some coding) - to implement your own reader using zlib module. It is rather easy, but you will need to know about a magic constant (How can I decompress a gzip stream with zlib?).
If you use Python 3.2 or later the bug in GzipFile (requiring tell support) is fixed, but they apparently aren't going to backport the fix to Python 2.x
For Python v3.2 or later, you can use the gzip.GzipFile class to wrap the file object returned by urllib2.urlopen(), with something like this...
import urllib2
import gzip
response = urllib2.urlopen(url)
gunzip_response = gzip.GzipFile(fileobj=response)
content = gunzip_response.read()
print content
...which will transparently decompress the response stream as you read it.

Writing append only gzipped log files in Python

I am building a service where I log plain text format logs from several sources (one file per source). I do not intend to rotate these logs as they must be around forever.
To make these forever around files smaller I hope I could gzip them in fly. As they are log data, the files compress very well.
What is a good approach in Python to write append-only gzipped text files, so that the writing can be later resumed when service goes on and off? I am not that worried about losing few lines, but if gzip container itself breaks down and the file becomes unreadable that's no no.
Also, if it's no go, I can simply write them in as plain text without gzipping if it's not worth of the hassle.
Note: On unix systems you should seriously consider using an external program, written for this exact task:
logrotate (rotates, compresses, and mails system logs)
You can set the number of rotations so high, that the first file would be deleted in 100 years or so.
In Python 2, logging.FileHandler takes an keyword argument encoding that can be set to bz2 or zlib.
This is because logging uses the codecs module, which in turn treats bz2 (or zlib) as encoding:
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2") as fh:
... fh.write("Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World
Python 3 version (although the docs mention bz2 as alias, you'll actually have to use bz2_codec - at least w/ 3.2.3):
>>> import codecs
>>> with codecs.open("on-the-fly-compressed.txt.bz2", "w", "bz2_codec") as fh:
... fh.write(b"Hello World\n")
$ bzcat on-the-fly-compressed.txt.bz2
Hello World

pickle.load Not Working

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?
The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards
You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

AppDailySales: Works, but the downloaded gzip file is corrupted

I am trying to use the appdailysales.py module to download daily our iPhone apps. I am a .NET developer, so I tried running this using IronPython in a C# solution using the following code:
using IronPython.Hosting;
var ipy = Python.CreateRuntime();
dynamic appSales = ipy.UseFile("appdailysales.py");
appSales.main();
Because I didn't have gzip, I took out the references to that module. I was going to use the GZipStream C# class to decompress the file (Apple, provides their downloads as .gz files). So, I commented out lines 75 and 429-435.
I have tried executing appdailysales.py in my C# solution, directly from IronPython and using Python 2.7 (installed ActivePython last night); all with the same results: When I try to open the .gz file using 7zip, I get the following error:
CRC Failed ... file is broken
When I try using the GZipStream class I get:
The CRC in GZip footer does not match the CRC calculated from the decompressed data
If I download the .gz file manually, I can decompress the file just fine using 7Zip or GZipStream.
I am fluent in C#, but new to Python. Any help you can provide would be much appreciated.
Thanks for your time.
Looks like line 444 is the problem. Here are lines 444-446:
downloadFile = open(filename, 'w')
downloadFile.write(filebuffer)
downloadFile.close()
At this stage, IF you have deleted lines 429-435 OR selected not to unzip, then filebuffer refers to the raw gzipped stream that you got from the web. The output file is opened in TEXT mode, and you are on Windows, so every \n in the BINARY gzipped stream will be converted to \r\n -- CORRUPTION, like the error message said.
So: for the module to be used portably on both Windows and other platforms, the open mode must be "wb" (b for binary). If the gunzipped result file is also a binary file, "wb" can be hardcoded in the open call. However if the gunzipped file is a text file (meant to be capable of being opened in a text editor), then you need just "w" for that purpose, and you should set a variable mode to either "wb" or "w" as appropriate, and use mode in the open call.
Big question: I understand why you removed the gzip references for IronPython usage. Did you remove those lines for Python 2.7? Or did you run it under Python 2.7 with those lines still in, but set options.unzipFile to False?

Unzipping part of a .gz file using python

So here's the problem. I have sample.gz file which is roughly 60KB in size. I want to decompress the first 2000 bytes of this file. I am running into CRC check failed error, I guess because the gzip CRC field appears at the end of file, and it requires the entire gzipped file to decompress. Is there a way to get around this? I don't care about the CRC check. Even if I fail to decompress because of bad CRC, that is OK. Is there a way to get around this and unzip partial .gz files?
The code I have so far is
import gzip
import time
import StringIO
file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data
The error encountered is
File "gunzip.py", line 27, in ?
data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
Also is there any way to use zlib module to do this and ignore the gzip headers?
The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)
The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace gzip.GzipFile._read_eof while I decompress the partial file:
import contextlib
#contextlib.contextmanager
def patch_gzip_for_partial():
"""
Context manager that replaces gzip.GzipFile._read_eof with a no-op.
This is useful when decompressing partial files, something that won't
work if GzipFile does it's checksum comparison.
"""
_read_eof = gzip.GzipFile._read_eof
gzip.GzipFile._read_eof = lambda *args, **kwargs: None
yield
gzip.GzipFile._read_eof = _read_eof
An example usage:
from cStringIO import StringIO
with patch_gzip_for_partial():
decompressed = gzip.GzipFile(StringIO(compressed)).read()
I seems that you need to look into Python zlib library instead
The GZIP format relies on zlib, but introduces a file-level compression concept along with CRC checking, and this appears to be what you do not want/need at the moment.
See for example these code snippets from Dough Hellman
Edit: the code on Doubh Hellman's site only show how to compress or decompress with zlib. As indicated above, GZIP is "zlib with an envelope", and you'll need to decode the envellope before getting to the zlib-compressed data per se. Here's more info to go about it, it's really not that complicated:
see RFC 1952 for details about the GZIP format
This format starts with a 10 bytes header, followed by optional, non compressed elements such as the file name or a comment, followed by the zlib-compressed data, itself followed by a CRC-32 (precisely an "Adler32" CRC).
By using Python's struct module, parsing the header should be relatively simple
The zlib sequence (or its first few thousand bytes, since that is what you want to do) can then be decompressed with python's zlib module, as shown in the examples above
Possible problems to handle: if there are more than one file in the GZip archive, and if the second file starts within the block of a few thousand bytes we wish to decompress.
Sorry to provide neither an simple procedure nor a ready-to-go snippet, however decoding the file with the indication above should be relatively quick and simple.
I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.
Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:
f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data
AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.
I also encounter this problem when I use my python script to read compressed files generated by gzip tool under Linux and the original files were lost.
By reading the implementation of gzip.py of Python, I found that gzip.GzipFile had similar methods of File class and exploited python zip module to process data de/compressing. At the same time, the _read_eof() method is also present to check the CRC of each file.
But in some situations, like processing Stream or .gz file without correct CRC (my problem), an IOError("CRC check failed") will be raised by _read_eof(). Therefore, I try to modify the gzip module to disable the CRC check and finally this problem disappeared.
def _read_eof(self):
pass
https://github.com/caesar0301/PcapEx/blob/master/live-scripts/gzip_mod.py
I know it's a brute-force solution, but it save much time to rewrite yourself some low level methods using the zip module, like of reading data chuck by chuck from the zipped files and extract the data line by line, most of which has been present in the gzip module.
Jamin

Categories