Preferred (or most common) file extension for a Python pickle - python

At times, I've seen .pickle, .pck, .pcl, and .db for files that contain Python pickles, but I am unsure what is the most common or best practice. I know that the latter three extensions are also used for other things.
The related question is: What MIME type is preferred for sending pickles between systems using a REST API?

Python 2
From the Python 2 documentation, while serializing (i.e. writing to a pickle file), use:
output = open('data.pkl', 'wb')
I would choose .pkl as the extension when using Python 2.
Python 3
The example in the Python 3 documentation now uses .pickle as the file extension for serialization:
with open('data.pickle', 'wb') as f:
pickle.dump(...)
The MIME type preferred for sending pickles from martineau's comment below:
application/octet-stream
See What is the HTTP "content-type" to use for a blob of bytes?

Related

How to unpickle pickle extension file

I have downloaded a pickle file:
foo.pickle.gz.pickle
The page from where I downloaded this file describes decompressing it to .pickle. I searched about python pickle, there are many pages that describe how to use in python, but not system wide. How can I decompress or unzip it? I am using ubuntu 16.04
Thanks in advance!
Pickle is the name of Python object serialisation module. So, you have to 'unpickle' it with a python script. Basic synthax is:
import pickle
with open('filename', 'rb') as pickled_one:
data = pickle.load(pickled_one)
More details are available here, on official Python documentation.
I do have to warn you about this, from that same page:
The pickle module is not secure against erroneous or maliciously
constructed data. Never unpickle data received from an untrusted or
unauthenticated source.
Pickle object can only be deserialized in python. You can't use non-python environments to deserialize the object. Please see the official page
If there are multiple pickled objects, as the answers above only unpickle 1 object.Use
pickle_list =[]
pickle_file = open(file_name, 'rb')
while True:
try:
pickle_list.append(pickle.load(pickle_file))
except EOFError:
break
pickle_file.close()
Not able to indent the code properly, but try and except are inside the while loop

Decompressing bz2 files on Windows

I am trying to decompress a bz2 file with below code snippet which is provided in various places:
bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)
However, I am getting a much smaller file than I expect.
When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.
When I read the extracted XML file, I can see that rest of the file is missing as I expect.
I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.
By the way, I already run into this and this both did no good.
If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:
Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3’s BZ2File class, which does support multi-stream files.
An alternative, drop-in replacement: bz2file should work though.
If it is a multistream file, you have to set mode to "r" or it will silently fail (e.g. output the compressed data as is).
This should do what you want:
with open(out_file_path) as out_file, BZ2File(bz2_file_path, "r") as bz2_file:
for data in iter(lambda: bz2_file.read(100 * 1024), b""):
out_file.write(data)
From the documentation:
If mode is 'r', the input file may be the concatenation of multiple compressed streams.
https://docs.python.org/3/library/bz2.html#bz2.BZ2File

Python cannot read "warc.gz" file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.
I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.
After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.
I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.
I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?
Thanks in advance!
Example:
I create new warc.gz files like this:
import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")
To write records I use:
record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)
This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.
To read files I use:
warc_file = warc.open(warc_path, "rb")
To loop through records I use:
for record in warc_file:
...
The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.
It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.
It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:
import gzip
import warc
with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()

Decompress remote .gz file in Python

i've a issue with Python.
My case: i've a gzipped file from a partner platform (i.e. h..p//....namesite.../xxx)
If i click the link from my browser, it will download a file like (i.e. namefile.xml.gz).
So... if i read this file with python i can decompress and read it.
Code:
content = gzip.open(namefile.xml.gz,'rb')
print content.read()
But i can't if i try to read the file from remote source.
From remote file i can read only the encoded string, but not decoded it.
Code:
response = urllib2.urlopen(url)
encoded =response.read()
print encoded
With this code i can read the string encoded... but i can't decoded it with gzip or lzip.
Any advices?
Thanks a lot
Unfortunately the method #Aya suggests does not work, since GzipFile extensively uses seek method of the file object (not supported by response).
So you have basically two options:
Read the contents of the remote file into io.StringIO, and pass the object into gzip.GzipFile (if the file is small)
download the file into a temporary file on disk, and use gzip.open
There is another option (which requires some coding) - to implement your own reader using zlib module. It is rather easy, but you will need to know about a magic constant (How can I decompress a gzip stream with zlib?).
If you use Python 3.2 or later the bug in GzipFile (requiring tell support) is fixed, but they apparently aren't going to backport the fix to Python 2.x
For Python v3.2 or later, you can use the gzip.GzipFile class to wrap the file object returned by urllib2.urlopen(), with something like this...
import urllib2
import gzip
response = urllib2.urlopen(url)
gunzip_response = gzip.GzipFile(fileobj=response)
content = gunzip_response.read()
print content
...which will transparently decompress the response stream as you read it.

pickle.load Not Working

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?
The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards
You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

Categories