I use pickle and dill for follow lambda function and work fine :
import dill
import pickle
f = lambda x,y: x+y
s = pickle.dumps(f)
or even when used in class, for example:
file
foo.py
class Foo(object):
def __init__(self):
self.f = lambda x, y: x+y
file
test.py
import dill
import pickle
from foo import Foo
f = Foo()
s = pickle.dumps(f) # or s = dill.dumps(f)
but when build same file with format .pyx (foo.pyx) using cython, can't serialize with dill, pickle or cpickle, get this error :
Traceback (most recent call last):
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2878, in run_cod
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
a = pickle.dumps(c)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 1380, in dumps
Pickler(file, protocol).dump(obj)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 425, in save_reduce
save(state)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/site-packages/dill/_dill.py", line 912, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 669, in _batch_setitems
save(v)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 317, in save
self.save_global(obj, rv)
File "/home/amin/anaconda2/envs/rllab2/lib/python2.7/pickle.py", line 754, in save_global
(obj, module, name))
PicklingError: Can't pickle . at 0x7f9ab1ff07d0>: it's not found as foo.lambda
setup.py file for build cython
setup.py
from distutils.core import setup
from Cython.Build import cythonize
setup(ext_modules=cythonize("foo.pyx"))
then run in terminal:
python setup.py build_ext --inplace
Is there a way ?
I'm the dill author. Expanding on what #DavidW says in the comments -- I believe there are (currently) no known serializers that can pickle cython lambdas, or the vast majority of cython code. Indeed, it is much more difficult for python serializers to be able to pickle objects with C-extensions unless the authors of the C-extension code specifically build serialization instructions (as did numpy and pandas). In that vein... instead of a lambda, you could build a class with a __call__ method, so it acts like a function... and then add one or more of the pickle methods (__reduce__, __getstate__, __setstate__, or something similar)... and then you should be able to pickle instances of your class. It's a bit of work, but since this path has been used to pickle classes written in C++ -- I believe you should be able to get it to work for cython-built classes.
In this code
import dill
import pickle
f = lambda x,y: x+y
s = pickle.dumps(f)
f is a function,
But in another code
import dill
import pickle
from foo import Foo
f = Foo()
s = pickle.dumps(f)
# or
s = dill.dumps(f)
f is a class
Related
Currently pyspark uses 2.4.0 version as part of conda installation. pip installation allows to use a later version of pyspark which is 3.1.2. but using this version, dill library has conflicts with pickle library.
i use this for unit test for pyspark. If I import dill library in test script, or any other test which imports dill which is run along with the pyspark test using pytest, it breaks.
The error it gives the below given error.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/pyspark/serializers.py", line 437, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 101, in dumps
cp.dump(obj)
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
File "/opt/conda/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 722, in save_function
*self._dynamic_function_reduce(obj), obj=obj
File "/opt/conda/lib/python3.6/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 659, in _save_reduce_pickle5
dictitems=dictitems, obj=obj
File "/opt/conda/lib/python3.6/pickle.py", line 610, in save_reduce
save(args)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/pickle.py", line 736, in save_tuple
save(element)
File "/opt/conda/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/conda/lib/python3.6/site-packages/dill/_dill.py", line 1146, in save_cell
f = obj.cell_contents
ValueError: Cell is empty
This happens in /opt/conda/lib/python3.6/pickle.py file in save function. After persistent id and memo check it tries to get the type of the obj and if that is ‘cell’ class, it tries to get the details of it in the next line using self.dispatch.get function. On using pyspark 2.4.0 returns ‘None’ and it works well but on using pyspark 3.1.2, it returns an object and it forces the object to use save_reduce function. It is unable to save it since the cell is empty. Eg: <cell at 0x7f0729a2as66: empty>,
If we force the return value to be None for pyspark 3.1.2 installation, it works, but that needs to happen gracefully, than by hardcoding.
Anyone had this issue ? any suggestion on using which versions of dill, pickle and pyspark to use together.
here is the code that is being used
import pytest
from pyspark.sql import SparkSession
import dill # if this line is added, test does not work with pyspark-3.1.2
simpleData = [
("James", "Sales", "NY", 90000, 34, 10000),
]
schema = ["A", "B", "C", "D", "E", "F"]
#pytest.fixture(scope="session")
def start_session(request):
spark = (
SparkSession.builder.master("local[1]")
.appName("Python Spark unit test")
.getOrCreate()
)
yield spark
spark.stop()
def test_simple_rdd(start_session):
rdd = start_session.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7])
assert rdd.stdev() == 2.0
This works with pyspark 2.4.0 but does not work with pyspark 3.1.2 with the above given error.
dill version - 0.3.1.1
pickle version - 4.0
python - 3.6
Apparently you aren't using dill except to import it. I assume you will be using it later...? As I mentioned in my comment, cloudpickle and dill do have some mild conflicts, and this appears to be what you are experiencing. Both serializers add logic to the pickle registry to tell python how to serialize different kinds of objects. So, if you use both dill and cloudpickle, there can be conflicts as the pickle registry is a dict -- so the order of import and etc matters.
The issue is similar to as noted here:
https://github.com/tensorflow/tfx/issues/2090
There's a few things you can try:
(1) some codes allow you to replace the serializer. So, if you are able replace cloudpickle for dill, then that may resolve the conflicts. I'm not sure this can be done with pyspark, but there is a pyspark module on serializers, so that is promising...
Set PySpark Serializer in PySpark Builder
(2) dill provides a mechanism to help mitigate some of the conflicts in the pickle registry. If you use dill.extend(False) before using cloudpickle, then dill.extend(True) before using dill, it may clear up the issue you are seeing.
I’m trying to generate predictions from a pickled model with pyspark, I get the model with the following command
model = deserialize_python_object(filename)
with deserialize_python_object(filename) defined as:
import pickle
def deserialize_python_object(filename):
try:
with open(filename, ‘rb’) as f:
obj = pickle.load(f)
except:
obj = None
return obj
the error log looks like:
File “/Users/gmg/anaconda3/envs/env/lib**strong text**/python3.7/site-packages/pyspark/sql/udf.py”, line 189, in wrapper
return self(*args)
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 167, in __call__
judf = self._judf
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 151, in _judf
self._judf_placeholder = self._create_judf()
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 160, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/sql/udf.py”, line 35, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/rdd.py”, line 2420, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File “/Users/gmg/anaconda3/envs/env/lib/python3.7/site-packages/pyspark/serializers.py”, line 600, in dumps
raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: can’t pickle _abc_data objects
Seems that you are having the same problem like in this issue:
https://github.com/cloudpipe/cloudpickle/issues/180
What is happening is that pyspark's cloudpickle library is outdated for python 3.7, you should fix the problem with this crafted patch by now until pyspark gets that module updated.
Try using this workaround:
Install cloudpickle pip install cloudpickle
Add this to your code:
import cloudpickle
import pyspark.serializers
pyspark.serializers.cloudpickle = cloudpickle
monkeypatch credit https://github.com/cloudpipe/cloudpickle/issues/305
I have the following code, which decorates the class:
import dill
from collections import namedtuple
from multiprocessing import Process
def proxified(to_be_proxied):
b = namedtuple('d', [])
class Proxy(to_be_proxied, b):
pass
Proxy.__name__ = to_be_proxied.__name__
return Proxy
#proxified
class MyClass:
def f(self):
print('hello!')
pickled_cls = dill.dumps(MyClass)
def new_process(clazz):
dill.loads(clazz)().f()
p = Process(target=new_process, args=(pickled_cls,))
p.start()
p.join()
When I am trying to pickle decorated class I am getting the following error:
Traceback (most recent call last):
File "/usr/lib/python3.5/pickle.py", line 907, in save_global
obj2, parent = _getattribute(module, name)
File "/usr/lib/python3.5/pickle.py", line 265, in _getattribute
.format(name, obj))
AttributeError: Can't get local attribute 'proxified.<locals>.Proxy' on <function proxified at 0x7fbf7de4b8c8>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/carbolymer/example.py", line 108, in <module>
pickled_cls = dill.dumps(MyClass)
File "/usr/lib/python3.5/site-packages/dill/dill.py", line 243, in dumps
dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
File "/usr/lib/python3.5/site-packages/dill/dill.py", line 236, in dump
pik.dump(obj)
File "/usr/lib/python3.5/pickle.py", line 408, in dump
self.save(obj)
File "/usr/lib/python3.5/pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.5/site-packages/dill/dill.py", line 1189, in save_type
StockPickler.save_global(pickler, obj)
File "/usr/lib/python3.5/pickle.py", line 911, in save_global
(obj, module_name, name))
_pickle.PicklingError: Can't pickle <class '__main__.proxified.<locals>.Proxy'>: it's not found as __main__.proxified.<locals>.Proxy
How can I pickle the decorated class using dill? I would like pass to pass this class to a separate process as an argument - maybe is there a simpler way to do it?
A good explanation of "Why pickling decorated functions is painful", provided by Gaël Varoquaux can be found here.
Basically rewriting the class using functools.wraps can avoid these issues :)
I have a problem with pickle.load() from a file. Dump and load is done in dill_read_write.py:
dill_read_write.py
import os
import dill
from contact_geometry import ContactGeometry
def write_pickle(obj, filename):
os.chdir(os.path.abspath(os.path.join(os.path.dirname(__file__))))
filename = os.path.join(os.getcwd(), filename)
with open(filename, 'wb') as output_:
dill.dump(obj, output_)
def read_pickle(filename):
with open(filename, 'rb') as input_:
return dill.load(input_)
if __name__ == "__main__":
read_pickle("ground_.pkl")
Saving object ContactGeometry data to pickle file is done when the PyQt application (project) is running. Function write() is called in moduleC.py:
moduleC.py
from contact_geometry import ContactGeometry
from moduleA.moduleB import dill_read_write
class Foo(FooItem):
def __init__(self,...):
...
def createGeometry(self):
contact_geometry_ = ContactGeometry()
# save object to pickle file
dill_read_write.write_pickle(contact_geometry_, "object_data.pkl")
The object is saved and pickle file is created. But when I run only the file dill_read_write.py to read (load) object data from pickle file I get the following error:
Traceback (most recent call last):
File "C:\projectName\moduleA\moduleB\dill_read_write.py", line 29, in <module>
read("ground_.pkl")
File "C:\projectName\moduleA\moduleB\dill_read_write.py", line 24, in read
return dill.load(input_)
File "C:\Python27\lib\site-packages\dill-0.2.2-py2.7.egg\dill\dill.py", line 199, in load
obj = pik.load()
File "C:\Python27\Lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python27\Lib\pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "C:\Python27\lib\site-packages\dill-0.2.2-py2.7.egg\dill\dill.py", line 278, in find_class
return StockUnpickler.find_class(self, module, name)
File "C:\Python27\Lib\pickle.py", line 1124, in find_class
__import__(module)
ImportError: No module named moduleA.moduleB.contact_geometry
I searched a bit and found that dill can perform better than pickle with classes but I am having problems to implement it. I've also found that I have to implement __reduce__() in class ContactGeometry in file contact_geometry.py.
contact_geometry.py
class ContactGeometry(object):
def __init__(self, ...):
...
def __reduce__(self):
return (self.__class__, (os.path.realpath(__file__))
But I am not sure what should return this method? How could I successfully load pickle file from the current situation?
Below is the project structure, if it is any help.
I'm the dill author. It's hard to tell how you are running the code, but it looks like the issue is that one way you are running the code and the module name as in #Antti Haapala's answer. His suggestions are also good ones to follow.
I'll add this… You need to make sure that (1) moduleA.moduleB.contact_geometry is on the PYTHONPATH, and (2) you are not dumping the module as __main__.moduleB.contact_geometry and trying to load it as moduleA.moduleB.contact_geometry -- dill treats __main__ as if it were a module (for the most part).
You shouldn't need to add __reduce__ methods to your classes, however.
You cannot run a python file from within a package like that; it wouldn't find the toplevel package names. I'd propose any of the following:
Write a start script in at the top level (where the main.py is), that imports and runs the read_write_dill from moduleA.moduleB
Instead in the top level directory, where the main is, you can run
that module with python -m moduleA.moduleB.dill_read_write.
Or, my preferred alternative, write a setup.py for your project and write a script for that utility.
I'm not able to pickle a pyzabbix.ZabbixAPI class instance with the following code:
from pyzabbix import ZabbixAPI
from pickle import dumps
api = ZabbixAPI('http://platform.autuitive.com/monitoring/')
print dumps(api)
It results in the following error:
Traceback (most recent call last):
File "zabbix_test.py", line 8, in <module>
print dumps(api)
File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/lib/python2.7/copy_reg.py", line 84, in _reduce_ex
dict = getstate()
TypeError: 'ZabbixAPIObjectClass' object is not callable
Well, in the documentation it says:
instances of such classes whose __dict__ or the result of calling
__getstate__() is picklable (see section The pickle protocol for details).
And it would appear this class isn't one. If you really need to do this, then consider writing your own pickling routines.