Why does this not pickle functions as redefined through code? [duplicate] - python

I'm trying to transfer a function across a network connection (using asyncore). Is there an easy way to serialize a python function (one that, in this case at least, will have no side effects) for transfer like this?
I would ideally like to have a pair of functions similar to these:
def transmit(func):
obj = pickle.dumps(func)
[send obj across the network]
def receive():
[receive obj from the network]
func = pickle.loads(s)
func()

You could serialise the function bytecode and then reconstruct it on the caller. The marshal module can be used to serialise code objects, which can then be reassembled into a function. ie:
import marshal
def foo(x): return x*x
code_string = marshal.dumps(foo.__code__)
Then in the remote process (after transferring code_string):
import marshal, types
code = marshal.loads(code_string)
func = types.FunctionType(code, globals(), "some_func_name")
func(10) # gives 100
A few caveats:
marshal's format (any python bytecode for that matter) may not be compatable between major python versions.
Will only work for cpython implementation.
If the function references globals (including imported modules, other functions etc) that you need to pick up, you'll need to serialise these too, or recreate them on the remote side. My example just gives it the remote process's global namespace.
You'll probably need to do a bit more to support more complex cases, like closures or generator functions.

Check out Dill, which extends Python's pickle library to support a greater variety of types, including functions:
>>> import dill as pickle
>>> def f(x): return x + 1
...
>>> g = pickle.dumps(f)
>>> f(1)
2
>>> pickle.loads(g)(1)
2
It also supports references to objects in the function's closure:
>>> def plusTwo(x): return f(f(x))
...
>>> pickle.loads(pickle.dumps(plusTwo))(1)
3

Pyro is able to do this for you.

The most simple way is probably inspect.getsource(object) (see the inspect module) which returns a String with the source code for a function or a method.

It all depends on whether you generate the function at runtime or not:
If you do - inspect.getsource(object) won't work for dynamically generated functions as it gets object's source from .py file, so only functions defined before execution can be retrieved as source.
And if your functions are placed in files anyway, why not give receiver access to them and only pass around module and function names.
The only solution for dynamically created functions that I can think of is to construct function as a string before transmission, transmit source, and then eval() it on the receiver side.
Edit: the marshal solution looks also pretty smart, didn't know you can serialize something other thatn built-ins

In modern Python you can pickle functions, and many variants. Consider this
import pickle, time
def foobar(a,b):
print("%r %r"%(a,b))
you can pickle it
p = pickle.dumps(foobar)
q = pickle.loads(p)
q(2,3)
you can pickle closures
import functools
foobar_closed = functools.partial(foobar,'locked')
p = pickle.dumps(foobar_closed)
q = pickle.loads(p)
q(2)
even if the closure uses a local variable
def closer():
z = time.time()
return functools.partial(foobar,z)
p = pickle.dumps(closer())
q = pickle.loads(p)
q(2)
but if you close it using an internal function, it will fail
def builder():
z = 'internal'
def mypartial(b):
return foobar(z,b)
return mypartial
p = pickle.dumps(builder())
q = pickle.loads(p)
q(2)
with error
pickle.PicklingError: Can't pickle <function mypartial at 0x7f3b6c885a50>: it's not found as __ main __.mypartial
Tested with Python 2.7 and 3.6

The cloud package (pip install cloud) can pickle arbitrary code, including dependencies. See https://stackoverflow.com/a/16891169/1264797.

code_string = '''
def foo(x):
return x * 2
def bar(x):
return x ** 2
'''
obj = pickle.dumps(code_string)
Now
exec(pickle.loads(obj))
foo(1)
> 2
bar(3)
> 9

Cloudpickle is probably what you are looking for.
Cloudpickle is described as follows:
cloudpickle is especially useful for cluster computing where Python
code is shipped over the network to execute on remote hosts, possibly
close to the data.
Usage example:
def add_one(n):
return n + 1
pickled_function = cloudpickle.dumps(add_one)
pickle.loads(pickled_function)(42)

You can do this:
def fn_generator():
def fn(x, y):
return x + y
return fn
Now, transmit(fn_generator()) will send the actual definiton of fn(x,y) instead of a reference to the module name.
You can use the same trick to send classes across network.

The basic functions used for this module covers your query, plus you get the best compression over the wire; see the instructive source code:
y_serial.py module :: warehouse Python objects with SQLite
"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data."
http://yserial.sourceforge.net

Here is a helper class you can use to wrap functions in order to make them picklable. Caveats already mentioned for marshal will apply but an effort is made to use pickle whenever possible. No effort is made to preserve globals or closures across serialization.
class PicklableFunction:
def __init__(self, fun):
self._fun = fun
def __call__(self, *args, **kwargs):
return self._fun(*args, **kwargs)
def __getstate__(self):
try:
return pickle.dumps(self._fun)
except Exception:
return marshal.dumps((self._fun.__code__, self._fun.__name__))
def __setstate__(self, state):
try:
self._fun = pickle.loads(state)
except Exception:
code, name = marshal.loads(state)
self._fun = types.FunctionType(code, {}, name)

Related

Using numba to compile dynamic functions

I'm writing a program which dynamically detects and imports python functions and detects which input parameters and outputs that is will expect/generate.
Like so:
def importFunctions(self, filename):
moduleImport = __import__(filename)
members = getmembers(moduleImport, isfunction)
functions = []
for m in members:
function = getattr(moduleImport, m[0])
number_of_inputs = function.__code__.co_argcount
inputs = function.__code__.co_varnames
if number_of_inputs > 1:
inputs = inputs[0:number_of_inputs-1]
elif number_of_inputs == 1:
inputOne = inputs[0]
inputs = []
inputs.append(inputOne)
outputs = function.__annotations__["return"]
functions.append([function, inputs, outputs])
return functions
This works only when I properly annotate the function, an example function could look something like this:
from numba import jit
#jit
def subtraction(a, b) -> ["difference"]:
a = float(a)
b = float(b)
difference = a - b
return (difference,)
This work perfectly fine without the decorator, but when I want to add the numba "jit" decorator to a function, I get an error saying that the imported function is missing the "return"-annotation.
UPDATE
Having tried to aces the original function by using "func.py_func" as suggested by #Rutger Kassies, my suspicions are that either getmembers or getattr it not proporely importing the numba to-be-compiled function.
It seems that getmembers finds "jit" as a separate member, and doesn't correctly associate it with the original function. The way it's written above, the 'function' named "jit", is of type function, as it should be. However, calling it returns a "<function _jit..wrapper". This has me scratching my head quite a bit but I suppose the 'getattr' is somehow behind this.
My guess is that I will have to fin another approach to dynamically importing functions that doesn't rely on "getattr".
If you're dealing with the numba.jit or numba.njit decorators, you can access the original function, in all it's annotated glory, by accessing the .py_func attribute. A simple example:
import numpy as np
import numba
from typing import get_type_hints, Annotated, Any
custom_output_type = Annotated[Any, "something"]
#numba.njit
def func(x: float) -> custom_output_type:
return x**2
# trigger compilation, not required
func(1.2)
get_type_hints(func.py_func, include_extras=True)
Which returns what you would expect from a regular Python function:
{'x': float, 'return': typing.Annotated[typing.Any, 'something']}
It would be similar when using the inspect module.
It gets more complicated when you use the other decorators lie vectorize & guvectorize, unfortunately. See for example:
https://numba.discourse.group/t/using-annotations-with-numba-gu-vectorize-functions/1008
It's probably best to rely as much as possible on the inspect & typing modules over accessing the private attributes of a function.

Alternatives to using nested functions in PySpark mapPartitions when using Cython?

I have a row-wise operation I wish to perform on my dataframe which takes in some fixed variables as parameters. The only way I know how to do this is with the use of nested functions. I'm trying to use Cython to compile a portion of my code, then call the Cython function from within mapPartitions, but it raised the error PicklingError: Can't pickle <cyfunction outer_function.<locals>._nested_function at 0xfffffff>.
When using pure Python, I do
def outer_function(fixed_var_1, fixed_var_2):
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(outer_function(a, b))
Right now I have outer_function defined in a separate file, like this
# outer_func.pyx
def outer_function(fixed_var_1, fixed_var_2):
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
and this
# runner.py
from outer_func import outer_function
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(outer_function(a, b))
And this throws the pickling error above.
I've looked at https://docs.databricks.com/user-guide/faq/cython.html and tried to get outer_function. Still, the same error occurs. The problem is that the nested function does not appear in the global space of the module, thus it cannot be found and serialized.
I've also tried doing this
def outer_function(fixed_var_1, fixed_var_2):
global _nested_function
def _nested_function(partition):
for row in partition:
yield dosomething(row, fixed_var_1, fixed_var_2)
return _nested_function
This throws a different error AttributeError: 'module' object has no attribute '_nested_function'.
Is there any way of not using nested function in this case? Or is there another way I can make the nested function "serializable"?
Thanks!
EDIT: I also tried doing
# outer_func.pyx
class PartitionFuncs:
def __init__(self, fixed_var_1, fixed_var_2):
self.fixed_var_1 = fixed_var_1
self.fixed_var_2 = fixed_var_2
def nested_func(self, partition):
for row in partition:
yield dosomething(row, self.fixed_var_1, self.fixed_var_2)
# main.py
from outer_func import PartitionFuncs
p_funcs = PartitionFuncs(a, b)
output_df = input_df.repartition(some_col).rdd \
.mapPartitions(p_funcs.nested_func)
And still I get PicklingError: Can't pickle <cyfunction PartitionFuncs.nested_func at 0xfffffff>. Oh well, the idea didn't work.
This is a sort-of-half answer because when I tried your class PartitionFuncs method p_funcs.nested_func pickled/unpickled fine for me (I didn't try combining it with PySpark though), so whether the solution below is necessary may depend on your Python version/platform etc. Pickle should support bound methods from Python 3.4, however it looks like PySpark forces the pickle protocol to 3, which will stop that working. There might be ways to change this but I don't know them.
Nested functions are known not to be pickleable, so that approach definitely work work. The class approach is the right one.
My suggestion in the comments was to just try pickling the class, not the bound function. For this to work an instance of the class needs to be callable, so you rename your function to __call__
class PartitionFuncs:
def __init__(self, fixed_var_1, fixed_var_2):
self.fixed_var_1 = fixed_var_1
self.fixed_var_2 = fixed_var_2
def __call__(self, partition):
for row in partition:
yield dosomething(row, self.fixed_var_1, self.fixed_var_2)
This does depend on both the fixed_var variables being pickleable by default. If they're not you can write custom saving and loading methods, as described in the pickle documentation.
As you point out in your comment, this does mean you need a separate class for each function you define. Options here involve inheritance, so having a separate PickleableData class, that each of the Func classes can hold a reference to.

Load python source code into function object

I need to create a unit-test like framework for light-weight auto grading of single-function python assignments. I'd love to do something like the following:
test = """
def foo(x):
return x * x
"""
compiled_module = magic_compile(test)
result = compiled_module.foo(3)
assert result == 9
Is there such a magic_compile function? The best I have so far is
exec(test)
result = getattr(sys.modules[__name__], 'foo')(3)
But this seems dangerous and unstable. This won't be run in production, so I'm not super-concerned about sandboxing and safety. Just wanted to make sure there wasn't a better solution I'm missing.
In Python 3.x:
module = {}
exec(test, module)
assert module['foo'](3) == 9
module isn't a real Python module, it's just a namespace to hold the code's globals to avoid polluting your module's namespace.

Serializing a Python class

I have a sample Python class
class bean :
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
Now i know that in Java to make a class serializable we just have to extend Serializable and ovverride a few methods and life is simple. Though Python is so simplistic yet i cant find a way to serialize the objects of this class.
This class should be serializable over network because the objects of this call goes to apache spark which distributes the object over network.
What is the best way to do that.
I also found this but dont know if it is the best way to do it.
I also read
Classes, functions, and methods cannot be pickled -- if you pickle an object, the object's class is not pickled, just a string that identifies what class it belongs to.
So does that mean those classes cant be serialized ?
PS: There would be millions of object of this class as the data is huge. So please provide 2 solution one the easiest and other the most efficient way of doing so.
EDIT :
For clarification i have to use this something like
def myfun():
**Some Logic **
t1 = bean(<params>)
t2 = bean(<params2>)
temp = list()
temp.append(t1)
temp.append(t2)
return temp
Now how it is finally called
PairRDD.map(myfun).collect()
which throws exception
<function __init__ at 0x7f3549853c80> is not JSON serializable
First, for your example pickle will work great. pickle doesn't serialize "functions", it only serializes "data" - so if you have the types you are trying to serialize on the remote script, i.e. if you have the type "bean" imported on the receiving end - you can use pickle or cpickle and everything will work.the mentioned disclaimer stated it doesn't keep the code of the class, meaning if you don't have it imported on the receiving end, pickle won't work for you.
All cross-language serialization solutions (i.e. json, xml) will never provide the ability to transfer class "code" because there's no reasonable way to represent it. If you're using the same language on both ends (like here) there are way to get this to work - you could for example marshal the object, pickle the result, send it over, receive on receiving end, unpickle and unmarshal and you have an object with it's functions - this is in fact sending the code and eval()-ing it on the receiving end..
Here's a quick example based on your class for pickling object:
test.py
import datetime
import pickle
class bean:
def __init__(self, puid, pogid, bucketId, dt, at) :
self.puid = puid
self.pogid = pogid
self.bucketId = bucketId
self.dt = (datetime.datetime.today() - datetime.datetime.strptime(dt, "%Y-%m-%d %H:%M:%S")).days
self.absdt=dt
self.at = at
def whoami(self):
return "%d %d"%(self.puid, self.pogid)
def myfun():
t1 = bean(1,2,3,"2015-12-31 11:50:25",4)
t2 = bean(5,6,7,"2015-12-31 12:50:25",8)
tmp = list()
tmp.append(t1)
tmp.append(t2)
return tmp
if __name__ == "__main__":
with open("test.data", "w") as f:
pickle.dump(myfun(), f)
with open("test.data", "r") as f2:
obj = pickle.load(f2)
print "|".join([bean.whoami() for bean in obj])
running it:
ben#ben-lnx:~$ python test.py
1 2|5 6
so you can see pickle works as long as you have the imported type..
As long as the all arguments you pass to __init__ (puid, pogid, bucketId, dt, at) can be serialized there should be no need for any additional steps. If you experience any problems it most likely means you didn't properly distribute your modules over the cluster.
While PySpark automatically distributes variables and functions referenced inside closures, distributing modules, libraries and classes is your responsibility. In case of simples classes creating a separate module and passing it via SparkContext.addPyFile should be just enough:
# https://www.python.org/dev/peps/pep-0008/#class-names
from some_module import Bean
sc.addPyFile("some_module.py")

cPickle - Ignore stuff it can't serialize instead of raising an exception

I'm using cPickle to serialize data that's used for logging.
I'd like to be able to throw whatever I want into an object, then serialize it. Usually this is fine with cPickle, but just ran into a problem where one of the objects I wanted to serialize contained a function. This caused cPickle to raise an exception.
I would rather cPickle just skipped over stuff it can't deal with instead of causing the whole process to implode.
What is a good way to make this happen?
I'm assuming that you're looking for a best-effort solution and you're okay if the unpickled results don't function properly.
For your particular use case, you may want to register a pickle handler for function objects. Just make it a dummy handler that's good enough for your best-effort purposes. Making a handler for functions is possible, it's rather tricky. To avoid affecting other code that pickles, you'll probably want to deregister the handler when exiting your logging code.
Here's an example (without any deregistration):
import cPickle
import copy_reg
from types import FunctionType
# data to pickle: note that o['x'] is a lambda and they
# aren't natively picklable (at this time)
o = {'x': lambda x: x, 'y': 1}
# shows that o is not natively picklable (because of
# o['x'])
try:
cPickle.dumps(o)
except TypeError:
print "not natively picklable"
else:
print "was pickled natively"
# create a mechanisms to turn unpickable functions int
# stub objects (the string "STUB" in this case)
def stub_pickler(obj):
return stub_unpickler, ()
def stub_unpickler():
return "STUB"
copy_reg.pickle(
FunctionType,
stub_pickler, stub_unpickler)
# shows that o is now picklable but o['x'] is restored
# to the stub object instead of its original lambda
print cPickle.loads(cPickle.dumps(o))
It prints:
not natively picklable
{'y': 1, 'x': 'STUB'}
Alternatively, try cloudpickle:
>>> import cloudpickle
>>> squared = lambda x: x ** 2
>>> pickled_lambda = cloudpickle.dumps(squared)
>>> import pickle
>>> new_squared = pickle.loads(pickled_lambda)
>>> new_squared(2)
4
pip install cloudpickle and live your dreams. The same dreams lived by dask, IPython parallel, and PySpark.

Categories