Python : How to read pickle dump? - python

I have a pickle dump which I got from a friend and he asked me to read it like :
f = open('file.pickle')
import pickle
l = pickle.loads(f.read())
But I get an ImportError saying no module named sql.models
Can someone help me understand what is happening ?

You are missing the code required to reconstruct the pickled objects.
Pickles store the location where the class can be imported from, together with the instance attributes. The original module is still required to recreate the module. From the documentation:
Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. This means that only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised. [4]
Similarly, classes are pickled by named reference, so the same restrictions in the unpickling environment apply. Note that none of the class’s code or data is pickled, so in the following example the class attribute attr is not restored in the unpickling environment:
class Foo:
attr = 'a class attr'
picklestring = pickle.dumps(Foo)
These restrictions are why picklable functions and classes must be defined in the top level of a module.
In other words, the original data used to create the pickle includes at least one instance of a custom class that originates in a module named sql.models.
Do be careful reading arbitrary pickles, even from friends. A pickle is just a stack language that recreates arbitrary Python structures. You can construct a pickle that spawns a secret back-door server on your computer, with enough determination and skill. The pickle documention warns you explicitly:
Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
This has been a problem in the past, even for experienced developers.

Related

How to works pickle dataframe inside

I wonder how the module "pickle" save and load objects. I saved a file with a dataframe object on the disk,
import pandas as pd
import pickle
df = pd.read_excel(r".\test.xlsx")
with open("o.pkl", "wb") as file:
pickle.dump(df, file)
then I uninstalled pandas and tried to load the object dataframe from file, but i get error "Exception has occurred: ModuleNotFoundError
No module named 'pandas'":
import pickle
with open("o.pkl", "rb") as file:
e = pickle.load(file)
my question is, does the pickle module somehow use pandas when loading an df? If so how is it done?
Pickle by default will go and import the class.
In this case, if you do not have pandas installed when you run the second snippet, it won't work by default (see below for more info on that default behaviour).
Quick primer on pickling
Essentially, everything in Python is an instance of a class, in some shape or form.
When you make a DataFrame, such as when you use pandas.read_excel, you create an instance of a DataFrame class. To create that class you need:
the class definition (containing information about methods and attributes)
something that creates the instance from some input data
You can create instances of a class normally by directly instantiating the class, or by using another method/function. Example:
# This makes a string, '12345' by directly invoking the str constructor
s = str(12345)
# This makes a list by using the split method of the string
l = s.split('3')
Pickle works just the same. When you unpickle, you need the class definition as well as the function which transforms some input data (your .pkl file) into the instance.
The class definition will be available in the pickled data, but none of the other supporting imports + code outside of the class will be.
This means that even if you override the default behaviour, while you might be able to make a DataFrame, your DataFrame won't work because you're missing pandas. When you try to invoke a method on the DataFrame, Python will try to access code that doesn't live in the original class definition. This code lives in other modules in the pandas module, and so this will never be captured in the pickle -- your code will then become quite unhappy at this point.
Can I override the default behaviour for unpickling?
Yes, you can do this -- you can override the import behaviour by using a custom unpickler. That's described here in the Python doc: restricting globals (Python official doc).
I've run into a similar thing before where it needed a specific pandas version, but I didn't investigate. Running across your post here, I read some of the documentation and came across this line:
When a class instance is unpickled, its __init__() method is usually not invoked. The default behaviour first creates an uninitialized instance and then restores the saved attributes.
https://docs.python.org/3.8/library/pickle.html#pickle-inst
So to unpickle an arbitrary class instance, it has to be able to access the initialization method of that class. If the class isn't present, it can't do that.
That same page also says:
Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled.
If I make a pandas DataFrame, I can access df.__class__ which will return pandas.core.frame.DataFrame
Putting this all together on that page, here's what I think happens:
Pickling df saves the instance data, which includes the __class__ attribute
Unpickling goes and looks for this class to access its __setstate__ method
If the module containing this class definition can't be found: error!
Short answer: it saves that information.

ImportError for top-level package when trying to use dill to pickle entire package source code alongside instance

I have the following project structure:
Package1
|--__init__.py
|--__main__.py
|--Module1.py
|--Module2.py
where Module1.py contains something like:
import dill as pickle
import Package1.Module2
# from https://stackoverflow.com/questions/52402783/pickle-class-definition-in-module-with-dill
def mainify(obj):
import __main__
import inspect
import ast
s = inspect.getsource(obj)
m = ast.parse(s)
co = compile(m, "<string>", "exec")
exec(co, __main__.__dict__)
def Module1():
"""I hope the details of this class are not necessary for this example. I can add detail if necessary
"""
obj_to_pickle = Module1()
def write_session():
mainify(Module1)
mainify(Module2)
with FileHandler.open_file(...) as f:
pickle.dump(obj_to_pickle, f)
I run the code as a module via python -m Package1 ..., thus __main__.py is the entry point to package execution, though I hope these details aren't relevant (I can improve my example if necessary).
Now, when I try to load the pickled object, I get ModuleNotFoundError: No module named Package1.
How can tell dill in this situation to understand that Package1 is the package? The mainify function seems to be getting the modules' source code into the pickle, but I believe the import statement in Module1.py that is import Package1.Module2.py is causing the ImportError. How can I tell dill to understand the reference to Package1?
NOTE: this reference can be fixed by adding the directory that Package1 is in via sys.path.append. But the whole point of pickling the package source alongside the instance is to make pickled instance unpicklable without needed to do this.
Relevant posts:
Pickle class definition in module with dill
Why dill dumps external classes by reference, no matter what?
#courtyardz. I'm a contributor of dill and your question is similar to others that have been asked in the past.
First, let me explain that generally dill assumes that all the modules necessary to deserialize an object are importable in the "unpickling" environment. Therefore modules are almost always saved by reference, with the current exception of modules that are not properly installed, like local modules (e.g. located in the working directory) or modules at non-canonical paths added to sys.path. There's also a function that's able to save the complete state of a module, which can be restored afterwards, but not the module itself.
That said, what exactly do you need? It's to serialize an object alongside its class (including any objects in the module's namespace that it refers to), or it's really the whole module?
If you need to transfer the complete module to an interpreter session where it's not available, like in a different machine, this problem is under active discussion here: https://github.com/uqfoundation/dill/issues/123. There's no complete solution for this currently, but one possibility is to ship the module as a ZIP archive, and load it using the zipimport module (indirectly, by saving the zip file to disk, maybe in a temporary location, and adding its path to sys.path as described in Python's documentation).
If you just need to serialize an object with its class, note that doing such has the limitation that objects of that class pickled by separate calls to dill.dump() or dill.dumps() will end up having different (although identical) classes when unpickled. This may or may not be a problem. There's also an open discussion about forcing the serialization of a class by value: https://github.com/uqfoundation/dill/issues/424.
The workaround you are trying to use should work because dill pickles classes defined in the __main__ module by value, as well as "orphaned" classes, i.e. classes that can't be found in the module where they were defined. However, for this to work the object must be created by the __main__.Module1 class (I suppose this is a class, even though you used def instead of class in your code example), not the Package1.Module1.Module1 class. If the class references global objects in Module1 in its methods, you may need to use the option recurse=True with dill.dump(s).
A simpler workaround, that may not work for your specific case as it involves multiple modules, is to temporarily change the __module__ attribute of the class. For example, at a module's body:
import dill
class X:
pass
obj = X()
X.__module__ = None # temporarily orphan the class
with open('/path/to/file.pkl', 'wb') as file:
dill.dump(obj) # X will be pickled by value because __module__ is None
X.__module__ = __name__ # de-orphan the class
Going back to your example, if you can't create the object with the "mainified" class, you may change the object's class temporarily too:
obj_to_pickle = Module1()
def write_session():
mainify(Module1)
mainify(Module2)
obj_to_pickle.__class__ = __main__.Module1
with FileHandler.open_file(...) as f:
pickle.dump(obj_to_pickle, f)
obj_to_pickle.__class__ = Module1
If the object has instance attributes of types defined in Package1, it won't work however.

Django: cache a dictionary containing a lambda function

I am trying to save a dictionary that contains a lambda function in django.core.cache. The example below fails silently.
from django.core.cache import cache
cache.set("lambda", {"name": "lambda function", "function":lambda x: x+1})
cache.get("lambda")
#None
I am looking for an explanation for this behaviour. Also, I would like to know if there is a workaround without using def.
The example below fails silently.
No, it doesn't. The cache.set() call should give you an error like:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Why? Internally, Django is using Python's pickle library to serialize the value you are attempting to store in cache. When you want to pull it out of cache again with your cache.get() call, Django needs to know exactly how to reconstruct the cached value. And due to this desire not to lose information or incorrectly/improperly reconstruct a cached value, there are several restrictions on what kinds of objects can be pickled. You'll note that only these types of functions may be pickled:
functions defined at the top level of a module
built-in functions defined at the top level of a module
And there is this further explanation about how pickling functions works:
Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. This means that only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised.

Serialize a python function with dependencies

I have tried multiple approaches to pickle a python function with dependencies, following many recommendations on StackOverflow, (such as dill, cloudpickle, etc.) but all seem to run into a fundamental issue that I cannot figure out.
I have a main module that tries to pickle a function from an imported module, sends it over ssh to be unpickled and executed at a remote machine.
So main has:
import dill (for example)
import modulea
serial=dill.dumps( modulea.func )
send (serial)
On the remote machine:
import dill
receive serial
funcremote = dill.loads( serial )
funcremote()
If the functions being pickled and sent are top level functions defined in main itself, everything works. When they are in an imported module, the loads function fails with messages of the type "module modulea not found".
It appears that the module name is pickled along with the function name. I do not see any way to "fix up" the pickle to remove the dependency, or alternately, to create a dummy module in the receiver to become the recipient of the unpickling.
Any pointers will be much appreciated.
--prasanna
I'm the dill author. I do this exact thing over ssh, but with success. Currently, dill and any of the other serializers pickle modules by reference… so to successfully pass a function defined in a file, you have to ensure that the relevant module is also installed on the other machine. I do not believe there is any object serializer that serializes modules directly (i.e. not by reference).
Having said that, dill does have some options to serialize object dependencies. For example, for class instances, the default in dill is to not serialize class instances by reference… so the class definition can also be serialized and send with the instance. In dill, you can also (use a very new feature to) serialize file handles by serializing the file, instead of the doing so by reference. But again, if you have the case of a function defined in a module, you are out-of-luck, as modules are serialized by reference pretty darn universally.
You might be able to use dill to do so, however, just not with pickling the object, but with extracting the source and sending the source code. In pathos.pp and pyina, dill us used to extract the source and the dependencies of any object (including functions), and pass them to another computer/process/etc. However, since this is not an easy thing to do, dill can also use the failover of trying to extract a relevant import and send that instead of the source code.
You can understand, hopefully, this is a messy messy thing to do (as noted in one of the dependencies of the function I am extracting below). However, what you are asking is successfully done in the pathos package to pass code and dependencies to different machines across ssh-tunneled ports.
>>> import dill
>>>
>>> print dill.source.importable(dill.source.importable)
from dill.source import importable
>>> print dill.source.importable(dill.source.importable, source=True)
def _closuredsource(func, alias=''):
"""get source code for closured objects; return a dict of 'name'
and 'code blocks'"""
#FIXME: this entire function is a messy messy HACK
# - pollutes global namespace
# - fails if name of freevars are reused
# - can unnecessarily duplicate function code
from dill.detect import freevars
free_vars = freevars(func)
func_vars = {}
# split into 'funcs' and 'non-funcs'
for name,obj in list(free_vars.items()):
if not isfunction(obj):
# get source for 'non-funcs'
free_vars[name] = getsource(obj, force=True, alias=name)
continue
# get source for 'funcs'
#…snip… …snip… …snip… …snip… …snip…
# get source code of objects referred to by obj in global scope
from dill.detect import globalvars
obj = globalvars(obj) #XXX: don't worry about alias?
obj = list(getsource(_obj,name,force=True) for (name,_obj) in obj.items())
obj = '\n'.join(obj) if obj else ''
# combine all referred-to source (global then enclosing)
if not obj: return src
if not src: return obj
return obj + src
except:
if tried_import: raise
tried_source = True
source = not source
# should never get here
return
I imagine something could also be built around the dill.detect.parents method, which provides a list of pointers to all parent object for any given object… and one could reconstruct all of any function's dependencies as objects… but this is not implemented.
BTW: to establish a ssh tunnel, just do this:
>>> t = pathos.Tunnel.Tunnel()
>>> t.connect('login.university.edu')
39322
>>> t
Tunnel('-q -N -L39322:login.university.edu:45075 login.university.edu')
Then you can work across the local port with ZMQ, or ssh, or whatever. If you want to do so with ssh, pathos also has that built in.

cPickle.load throwing ImportError in Python

I have Python 2.7.3 installed on my Windows 7 computer. When I run the following code
import nltk, json, cPickle, itertools
import numpy as np
from nltk.tokenize import word_tokenize
from pprint import pprint
t_given_a = json.load(open('conditional_probability.json','rb'))
a_unconditional = json.load(open('age.json','rb'))
t_unconditional = cPickle.load(open('freqdist.pkl','rb'))['distribution']
The command prompt gives me the error
"ImportError: No Module named Multiarray."
I'm fairly new to Python and I'm not exactly sure why this error happened. I searched other threads and many suggested to use 'rb' instead of 'r'. I have rb to begin with and it's still throwing me that error. Any suggestion?
When you pickle an object in python it saves its class as a string of package name + class name. On unpickle python tries to import that module and find that class for you to recreate an object. And if you don't have that module importable you'll get an ImportError.
Just install that Multiarray module, and if you don't know which is it then ask whoever you got that pickle file from.
From the docs:
Note that functions (built-in and user-defined) are pickled by “fully
qualified” name reference, not by value. This means that only the
function name is pickled, along with the name of the module the
function is defined in. Neither the function’s code, nor any of its
function attributes are pickled. Thus the defining module must be
importable in the unpickling environment, and the module must contain
the named object, otherwise an exception will be raised.
Similarly, classes are pickled by named reference, so the same restrictions in
the unpickling environment apply. Note that none of the class’s code
or data is pickled
[...] These restrictions are why picklable functions and classes must be
defined in the top level of a module

Categories