How to unpickle pickle extension file - python

I have downloaded a pickle file:
foo.pickle.gz.pickle
The page from where I downloaded this file describes decompressing it to .pickle. I searched about python pickle, there are many pages that describe how to use in python, but not system wide. How can I decompress or unzip it? I am using ubuntu 16.04
Thanks in advance!

Pickle is the name of Python object serialisation module. So, you have to 'unpickle' it with a python script. Basic synthax is:
import pickle
with open('filename', 'rb') as pickled_one:
data = pickle.load(pickled_one)
More details are available here, on official Python documentation.
I do have to warn you about this, from that same page:
The pickle module is not secure against erroneous or maliciously
constructed data. Never unpickle data received from an untrusted or
unauthenticated source.

Pickle object can only be deserialized in python. You can't use non-python environments to deserialize the object. Please see the official page

If there are multiple pickled objects, as the answers above only unpickle 1 object.Use
pickle_list =[]
pickle_file = open(file_name, 'rb')
while True:
try:
pickle_list.append(pickle.load(pickle_file))
except EOFError:
break
pickle_file.close()
Not able to indent the code properly, but try and except are inside the while loop

Related

Preferred (or most common) file extension for a Python pickle

At times, I've seen .pickle, .pck, .pcl, and .db for files that contain Python pickles, but I am unsure what is the most common or best practice. I know that the latter three extensions are also used for other things.
The related question is: What MIME type is preferred for sending pickles between systems using a REST API?
Python 2
From the Python 2 documentation, while serializing (i.e. writing to a pickle file), use:
output = open('data.pkl', 'wb')
I would choose .pkl as the extension when using Python 2.
Python 3
The example in the Python 3 documentation now uses .pickle as the file extension for serialization:
with open('data.pickle', 'wb') as f:
pickle.dump(...)
The MIME type preferred for sending pickles from martineau's comment below:
application/octet-stream
See What is the HTTP "content-type" to use for a blob of bytes?

Python cannot read "warc.gz" file completely

For my work, I scrape web-sites and write them to gzipped web-archives (with extension "warc.gz"). I use Python 2.7.11 and the warc 0.2.1 library.
I noticed that for majority of files I cannot read them completely with the warc-library. For example if the warc.gz file has 517 records, I can read only about 200 of them.
After some research I found out that this problem happens only with the gzipped files. The files with extension "warc" do not have this problem.
I have found out that some people have this problem as well (https://github.com/internetarchive/warc/issues/21), while no solution for it is found.
I guess that there might be a bug in "gzip" in Python 2.7.11. Does maybe someone have experience with this, and know what can be done about this problem?
Thanks in advance!
Example:
I create new warc.gz files like this:
import warc
warc_path = "\\some_path\file_name.warc.gz"
warc_file = warc.open(warc_path, "wb")
To write records I use:
record = warc.WARCRecord(payload=value, headers=headers)
warc_file.write_record(record)
This creates perfect "warc.gz" files. There are no problems with them. All, including "\r\n" is correct. But the problem starts when I read these files.
To read files I use:
warc_file = warc.open(warc_path, "rb")
To loop through records I use:
for record in warc_file:
...
The problem is that not all records are found during this looping for "warc.gz" file, while they all are found for "warc" files. Working with both types of files is addressed in the warc-library itself.
It seems that the custom gzip handling in warc.gzip2.GzipFile, file splitting with warc.utils.FilePart and reading in warc.warc.WARCReader is broken as a whole (tested with python 2.7.9, 2.7.10 and 2.7.11). It stops short when it receives no data instead of a new header.
It would seem that basic stdlib gzip handles the catenated files just fine and so this should work as well:
import gzip
import warc
with gzip.open('my_test_file.warc.gz', mode='rb') as gzf:
for record in warc.WARCFile(fileobj=gzf):
print record.payload.read()

Is there a way to download the source from pypi using a script?

Following the links (Eg: https://pypi.python.org/packages/source/p/py-web-search/py-web-search-0.2.1.tar.gz#md5=4e2f7363acdc1e7c08629edfa7996a5a ) provided on pypi from a browser will allow us to download the source code.
Is there a way to do this from a script?
So far I have this:
import requests
s = requests.get('https://pypi.python.org/packages/source/p/py-web-search/py-web-search-0.2.1.tar.gz#md5=4e2f7363acdc1e7c08629edfa7996a5a')
with open('pws.tar.gz', 'w') as fp:
fp.write(s.text)
Note: Opening the file in binary mode causes this error TypeError: 'str' does not support the buffer interface
When I open the tar file using the archive manager it tells that an error occurred while loading the archive.
I tried printing s.text and then redirecting the output to pws.tar.gz but it makes no difference.
It's optional(if you want to download a very large file then you can turn it on)stream=True
import requests
s = requests.get('https://pypi.python.org/packages/source/p/py-web-search/py-web-search-0.2.1.tar.gz#md5=4e2f7363acdc1e7c08629edfa7996a5a',stream=True)
with open('pws.tar.gz', 'wb') as fp:
for chunk in s.iter_content():
if chunk:
fp.write(chunk)
fp.flush()
This post seems to think it would work with opening it in binary mode and using write(bytes(s.text, 'UTF-8')) would work.

Decompress remote .gz file in Python

i've a issue with Python.
My case: i've a gzipped file from a partner platform (i.e. h..p//....namesite.../xxx)
If i click the link from my browser, it will download a file like (i.e. namefile.xml.gz).
So... if i read this file with python i can decompress and read it.
Code:
content = gzip.open(namefile.xml.gz,'rb')
print content.read()
But i can't if i try to read the file from remote source.
From remote file i can read only the encoded string, but not decoded it.
Code:
response = urllib2.urlopen(url)
encoded =response.read()
print encoded
With this code i can read the string encoded... but i can't decoded it with gzip or lzip.
Any advices?
Thanks a lot
Unfortunately the method #Aya suggests does not work, since GzipFile extensively uses seek method of the file object (not supported by response).
So you have basically two options:
Read the contents of the remote file into io.StringIO, and pass the object into gzip.GzipFile (if the file is small)
download the file into a temporary file on disk, and use gzip.open
There is another option (which requires some coding) - to implement your own reader using zlib module. It is rather easy, but you will need to know about a magic constant (How can I decompress a gzip stream with zlib?).
If you use Python 3.2 or later the bug in GzipFile (requiring tell support) is fixed, but they apparently aren't going to backport the fix to Python 2.x
For Python v3.2 or later, you can use the gzip.GzipFile class to wrap the file object returned by urllib2.urlopen(), with something like this...
import urllib2
import gzip
response = urllib2.urlopen(url)
gunzip_response = gzip.GzipFile(fileobj=response)
content = gunzip_response.read()
print content
...which will transparently decompress the response stream as you read it.

pickle.load Not Working

I got a file that contains a data structure with test results from a Windows user. He created this file using the pickle.dump command. On Ubuntu, I tried to load this test results with the following program:
import pickle
import my_module
f = open('results', 'r')
print pickle.load(f)
f.close()
But I get an error inside pickle module that no module named "my_module".
May the problem be due to corruption in the file, or maybe moving from Widows to Linux is the couse?
The problem lies in pickle's way of handling newline characters. Some of the line feed characters cripple module names in dumped / loaded data.
Storing and loading files in binary mode may help, but I was having trouble with them too. After a long time reading docs and searching I found that pickle handles several different "protocols" for storing data and due to backward compatibility it uses the oldest one: protocol 0 - the original ASCII protocol.
User can select modern protocol by specifing the protocol keyword while storing data in dump file, something like this:
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=2)
or, by choosing the highest protocol available (currently 2)
pickle.dump(someObj, open("dumpFile.dmp", 'wb'), protocol=pickle.HIGHEST_PROTOCOL)
Protocol version is stored in dump file, so Load() function handles it automaticaly.
Regards
You should open the pickled file in binary mode, especially if you are using pickle on different platforms. See this and this questions for an explanation.

Categories