pickle.load Not Working - python

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?

The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards

You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

Related

How to unpickle pickle extension file

I have downloaded a pickle file:
foo.pickle.gz.pickle
The page from where I downloaded this file describes decompressing it to .pickle. I searched about python pickle, there are many pages that describe how to use in python, but not system wide. How can I decompress or unzip it? I am using ubuntu 16.04
Thanks in advance!
Pickle is the name of Python object serialisation module. So, you have to 'unpickle' it with a python script. Basic synthax is:
import pickle
with open('filename', 'rb') as pickled_one:
data = pickle.load(pickled_one)
More details are available here, on official Python documentation.
I do have to warn you about this, from that same page:
The pickle module is not secure against erroneous or maliciously
constructed data. Never unpickle data received from an untrusted or
unauthenticated source.
Pickle object can only be deserialized in python. You can't use non-python environments to deserialize the object. Please see the official page
If there are multiple pickled objects, as the answers above only unpickle 1 object.Use
pickle_list =[]
pickle_file = open(file_name, 'rb')
while True:
try:
pickle_list.append(pickle.load(pickle_file))
except EOFError:
break
pickle_file.close()
Not able to indent the code properly, but try and except are inside the while loop

Decompressing bz2 files on Windows

I am trying to decompress a bz2 file with below code snippet which is provided in various places:
bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)
However, I am getting a much smaller file than I expect.
When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.
When I read the extracted XML file, I can see that rest of the file is missing as I expect.
I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.
By the way, I already run into this and this both did no good.
If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:
Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3’s BZ2File class, which does support multi-stream files.
An alternative, drop-in replacement: bz2file should work though.
If it is a multistream file, you have to set mode to "r" or it will silently fail (e.g. output the compressed data as is).
This should do what you want:
with open(out_file_path) as out_file, BZ2File(bz2_file_path, "r") as bz2_file:
for data in iter(lambda: bz2_file.read(100 * 1024), b""):
out_file.write(data)
From the documentation:
If mode is 'r', the input file may be the concatenation of multiple compressed streams.
https://docs.python.org/3/library/bz2.html#bz2.BZ2File

read() from a ExFileObject always cause StreamError exception

I am trying to read only one file from a tar.gz file. All operations over tarfile object works fine, but when I read from concrete member, always StreamError is raised, check this code:
import tarfile
fd = tarfile.open('file.tar.gz', 'r|gz')
for member in fd.getmembers():
if not member.isfile():
continue
cfile = fd.extractfile(member)
print cfile.read()
cfile.close()
fd.close()
cfile.read() always causes "tarfile.StreamError: seeking backwards is not allowed"
I need to read contents to mem, not dumping to file (extractall works fine)
Thank you!
The problem is this line:
fd = tarfile.open('file.tar.gz', 'r|gz')
You don't want 'r|gz', you want 'r:gz'.
If I run your code on a trivial tarball, I can even print out the member and see test/foo, and then I get the same error on read that you get.
If I fix it to use 'r:gz', it works.
From the docs:
mode has to be a string of the form 'filemode[:compression]'
...
For special purposes, there is a second format for mode: 'filemode|[compression]'. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file… Use this variant in combination with e.g. sys.stdin, a socket file object or a tape device. However, such a TarFile object is limited in that it does not allow to be accessed randomly, see Examples.
'r|gz' is meant for when you have a non-seekable stream, and it only provides a subset of the operations. Unfortunately, it doesn't seem to document exactly which operations are allowed—and the link to Examples doesn't help, because none of the examples use this feature. So, you have to either read the source, or figure it out through trial and error.
But, since you have a normal, seekable file, you don't have to worry about that; just use 'r:gz'.
In addition to the file mode, I attempted to seek on a network stream.
I had the same error when trying to requests.get the file, so I extracted all to a tmp directory:
# stream == requests.get
inputs = [tarfile.open(fileobj=LZMAFile(stream), mode='r|')]
t = "/tmp"
for tarfileobj in inputs:
tarfileobj.extractall(path=t, members=None)
for fn in os.listdir(t):
with open(os.path.join(t, fn)) as payload:
print(payload.read())

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html
I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

Unzipping part of a .gz file using python

So here's the problem. I have sample.gz file which is roughly 60KB in size. I want to decompress the first 2000 bytes of this file. I am running into CRC check failed error, I guess because the gzip CRC field appears at the end of file, and it requires the entire gzipped file to decompress. Is there a way to get around this? I don't care about the CRC check. Even if I fail to decompress because of bad CRC, that is OK. Is there a way to get around this and unzip partial .gz files?
The code I have so far is
import gzip
import time
import StringIO
file = open('sample.gz', 'rb')
mybuf = MyBuffer(file)
mybuf = StringIO.StringIO(file.read(2000))
f = gzip.GzipFile(fileobj=mybuf)
data = f.read()
print data
The error encountered is
File "gunzip.py", line 27, in ?
data = f.read()
File "/usr/local/lib/python2.4/gzip.py", line 218, in read
self._read(readsize)
File "/usr/local/lib/python2.4/gzip.py", line 273, in _read
self._read_eof()
File "/usr/local/lib/python2.4/gzip.py", line 309, in _read_eof
raise IOError, "CRC check failed"
IOError: CRC check failed
Also is there any way to use zlib module to do this and ignore the gzip headers?
The issue with the gzip module is not that it can't decompress the partial file, the error occurs only at the end when it tries to verify the checksum of the decompressed content. (The original checksum is stored at the end of the compressed file so the verification will never, ever work with a partial file.)
The key is to trick gzip into skipping the verification. The answer by caesar0301 does this by modifying the gzip source code, but it's not necessary to go that far, simple monkey patching will do. I wrote this context manager to temporarily replace gzip.GzipFile._read_eof while I decompress the partial file:
import contextlib
#contextlib.contextmanager
def patch_gzip_for_partial():
"""
Context manager that replaces gzip.GzipFile._read_eof with a no-op.
This is useful when decompressing partial files, something that won't
work if GzipFile does it's checksum comparison.
"""
_read_eof = gzip.GzipFile._read_eof
gzip.GzipFile._read_eof = lambda *args, **kwargs: None
yield
gzip.GzipFile._read_eof = _read_eof
An example usage:
from cStringIO import StringIO
with patch_gzip_for_partial():
decompressed = gzip.GzipFile(StringIO(compressed)).read()
I seems that you need to look into Python zlib library instead
The GZIP format relies on zlib, but introduces a file-level compression concept along with CRC checking, and this appears to be what you do not want/need at the moment.
See for example these code snippets from Dough Hellman
Edit: the code on Doubh Hellman's site only show how to compress or decompress with zlib. As indicated above, GZIP is "zlib with an envelope", and you'll need to decode the envellope before getting to the zlib-compressed data per se. Here's more info to go about it, it's really not that complicated:
see RFC 1952 for details about the GZIP format
This format starts with a 10 bytes header, followed by optional, non compressed elements such as the file name or a comment, followed by the zlib-compressed data, itself followed by a CRC-32 (precisely an "Adler32" CRC).
By using Python's struct module, parsing the header should be relatively simple
The zlib sequence (or its first few thousand bytes, since that is what you want to do) can then be decompressed with python's zlib module, as shown in the examples above
Possible problems to handle: if there are more than one file in the GZip archive, and if the second file starts within the block of a few thousand bytes we wish to decompress.
Sorry to provide neither an simple procedure nor a ready-to-go snippet, however decoding the file with the indication above should be relatively quick and simple.
I can't see any possible reason why you would want to decompress the first 2000 compressed bytes. Depending on the data, this may uncompress to any number of output bytes.
Surely you want to uncompress the file, and stop when you have uncompressed as much of the file as you need, something like:
f = gzip.GzipFile(fileobj=open('postcode-code.tar.gz', 'rb'))
data = f.read(4000)
print data
AFAIK, this won't cause the whole file to be read. It will only read as much as is necessary to get the first 4000 bytes.
I also encounter this problem when I use my python script to read compressed files generated by gzip tool under Linux and the original files were lost.
By reading the implementation of gzip.py of Python, I found that gzip.GzipFile had similar methods of File class and exploited python zip module to process data de/compressing. At the same time, the _read_eof() method is also present to check the CRC of each file.
But in some situations, like processing Stream or .gz file without correct CRC (my problem), an IOError("CRC check failed") will be raised by _read_eof(). Therefore, I try to modify the gzip module to disable the CRC check and finally this problem disappeared.
def _read_eof(self):
pass
https://github.com/caesar0301/PcapEx/blob/master/live-scripts/gzip_mod.py
I know it's a brute-force solution, but it save much time to rewrite yourself some low level methods using the zip module, like of reading data chuck by chuck from the zipped files and extract the data line by line, most of which has been present in the gzip module.
Jamin

Categories