Is it possible to pickle Python object by reference (by name)? - python

I have a situation where there's a complex object that can be referenced by unique name like package.subpackage.MYOBJECT. While it's possible to pickle this object using standard pickle algorithm, resulting data string will be very big.
I'm looking for some way to get same pickling semantic for an object that is already here for classes and functions: Python's pickle just dumps their fully qualified names, not code. This way just string like package.subpackage.MYOBJECT will be dumped and upon unpickling object will be imported, just like it happens for functions or classes.
It seems that this task boils down to making object aware of variable name it's bound to, but I have no clues how to do it.
Here's short example to explain myself clearly (obvious imports are skipped).
File bigpackage/bigclasses/models.py:
class SomeInterface():
__meta__ = ABCMeta
#abstractmethod
def operation():
pass
class ImplementationA(SomeInterface):
def operation():
print "ImplementationA"
class ImplementationB(SomeInterface):
def operation():
print "ImplementationB"
IMPL_A = ImplementationA()
IMPL_B = ImplementationB()
File bigpackage/bigclasses/tasks.py:
#celery.task
def background_task(impl, somearg):
assert isinstance(impl, SomeInterface)
impl.operation()
print somearg
File bigpackage/bigclasses/work.py:
from bigpackage.bigclasses.models import IMPL_A, IMPL_B
from bigpackage.bigclasses.tasks import background_task
background_task.submit(IMPL_A, "arg1")
background_task.submit(IMPL_B, "arg2")
Here I have trivial background Celery task that accept one of two available implementations of SomeInterface as an argument. Task's arguments are pickled by Celery, passed to a queue and executed on some worker server, that runs exactly the same code base. My idea is to avoid deep pickling of IMPL_A and IMPL_B and instead pass them as bigpackage.bigclasses.models.IMPL_A and bigpackage.bigclasses.models.IMPL_B correspondingly. That will help with performance and total traffic for queue server and also provide some safety against changes in IMPL_A and IMPL_B that will make them non-pickleable (for example lambda anywhere in object attributes hierarchy).

Related

Python objects in other classes or separate?

I have an application I'm working on in Python 2.7 which has several classes that need to interact with each other before returning everything back to the main program for output.
A brief example of the code would be:
class foo_network():
"""Handles all network functionality"""
def net_connect(self, network):
"""Connects to the network destination"""
pass
class foo_fileparsing():
"""Handles all sanitation, formatting, and parsing on received file"""
def file_check(self, file):
"""checks file for potential problems"""
pass
Currently I have a main file/function which instantiates all the classes and then handles passing data back and forth, as necessary, between them and their methods. However this seems a bit clumsy.
As such I'm wondering two things:
What would be the most 'Pythonic' way to handle this?
What is the best way to handle this for performance and memory usage?
I'm wondering if I should just instantiate objects of one class inside another (from the example, say, creating a foo_fileparsing object within the foo_network class if that is the only class which will be calling it, rather than my current approach of returning everything to the main function and passing it between objects that way.
Unfortunately I can't find a PEP or other resource that seems to address this type of situation.
You can use modules. And have every class in one module.
and then you can use import to import only a particular method from that class.
And all you need to do for that is create a directory with the name same as you class name and put a __init__.py file in that directory which tells python to consider that directory as a module.
Then for example the foo_network folder contains a file named foo_network.py and a file __init__.py and in foo_network.py the code is
class foo_network():
"""Handles all network functionality"""
def net_connect(self, network):
"""Connects to the network destination"""
pass
and in any other file you can simply use
import net_connect from foo_network
it will only import that particular method. This way your code will not look messy and you can will be importing only what is required.
You can also do
from foo_network import *
to import all methods at once.

Extended celery.schedules.schedule objects aren't passed extra arguments

The code is here.
I wrote an extension of the celery.schedules.schedule interface, and I can't figure out why it's getting instantiated with nothing set in the extra values I created.
When I instantiate them before passing to app.conf.CELERYBEAT_SCHEDULE they're correct. But all the ones that celery beat instantiates are incorrect.
I asked in #celery IRC chan and the only response I got was about lazy mode, but that's for celery.beat.Scheduler, not celery.schedules.schedule, so if it's relevant, I don't understand how. Do I have to extend that too, just so that it instantiates the schedules correctly?
I've tried digging into the celery code w/the debugger to figure out where these schedules are getting instantiated and I can't find it. I can see when they come back from Unpickler they are wrong, but I can't find where they get created or where they get pickled.
celery.schedules.schedule has a __reduce__ method that defines how to serialize and reconstruct the object using pickle:
https://github.com/celery/celery/blob/master/celery/schedules.py#L150-L151
When pickle serializes the object it will call:
fun, args = obj.__reduce__()
and when it reconstructs the object it will do:
obj = fun(*args)
So if you've added new state to your custom schedule subclass, passed as arguments to __init__, then you will
also have to define a __reduce__ method that takes these
new arguments into account:
class myschedule(schedule):
def __init__(self, run_every=None, relative=False, nowfun=None,
odds=None, max_run_every=None, **kwargs):
super(myschedule, self).__init__(
run_every, relative, nowfun, **kwargs)
self.odds = odds
self.max_run_every = max_run_every
def __reduce__(self):
return self.__class__, (
self.run_every, self.relative, self.nowfun,
self.odds, self.max_run_every)
After a lot of time in the Python debugger, I narrowed the problem down to celery.beat.PersistentScheduler.sync() and/or shelve.sync() (which is called by the former).
When the shelve is synced, the values are getting lost. I don't know why, but I'm pretty sure it's a bug in either Celery or Shelve.
In any case, I wrote a workaround.

How to work with interactively-defined classes in IPython.parallel?

Context
In an interactive prototyping development on the notebook connected to a cluster, I would like to define a class that is both available in the client __main__ session and interactively update on the cluster engine nodes to be able to move instances of that class around by passing such instances a argument to a LoadBalanced view. The following demonstrates the typical user session:
First setup the parallel clustering environment:
>>> from IPython.parallel import Client
>>> rc = Client()
>>> lview = rc.load_balanced_view()
>>> rc[:]
<DirectView [0, 1, 2]>
In a notebook cell let's define the code snippet of the component we are interactively editing:
>>> class MyClass(object):
... def __init__(self, parameter):
... self.parameter = parameter
...
... def update_something(self, some_data):
... # do something smart here with some_data & internal state
...
... def compute_something(self, other_data):
... # do something smart here with other data & internal state
... return something
...
In the next cell, let's create a script that builds instances of this class and then use the load balanced view of the cluster environment to evaluate our component on a wide range of input parameters:
>>> def process(obj, some_data, other_data):
... obj.update_something(some_data)
... return obj.compute_something(other_data)
...
>>> tasks = []
>>> some_instances = [MyClass(i) for i in range(10)]
>>> for obj in some_instances:
... for some_data in data_source_1:
... for other_data in data_source_2:
... ar = lview.apply_async(process, obj, some_data, other_data)
... tasks.append(ar)
...
>>> # wait for computation to end
>>> results = [ar.get() for ar in tasks]
Problem
That will obviously not work as the engines of the load balanced view will be unable to unpickle the instances passed as first argument to the process function. The process function definition itself is passed successfully as I assume that apply_async does bytecode instrospection to pickle it (by accessing the .code attribute of the function) and then just does a simple pickle for the remaining arguments.
Possible solutions (that don't work for me)
One alternative solution would be to use the %%px cell magic on the cell holding the definition of the class MyClass. However that would prevent me to build the class instances in the client script that also do the scheduling. I would need to copy and paste the cell content in an other cell without the %%px magic (or execute the cell twice once with magic and another time without the magic) but this is tedious when I am still editing the methods of the class in an iterative development & evaluation setting.
An alternative solution would be to embed the class definition inside the process function but I find this not practical as I would like to reuse that class definition in other functions later in my notebook.
Alternatively I could just stop using a class and only work with functions that can be shipped over to the engines by passing then as first argument to the apply_async. However I don't like that either as I would like to prototype my code in an object oriented way for later extraction from the notebook and including the resulting class in an object oriented library. The notebook session serving as a collaborative prototyping tool using for exchanging ideas between developers using the http://nbviewer.ipython.org publisher.
The final alternative would be to write my class in a python module on a file on the filesystem and ship that file to the engines PYTHONPATH using NFS for instance. That works but prevent me to work only in the notebook environment which defeats the whole purpose of interactive prototyping in the notebook.
So basically, is there a way to define a class interactively and then ship its definition around to the engines?
It should be possible to pickle a class definition using the inspect.getsource in the client then send the source to the engines and use the eval builtin but unfortunately source inspection does not work for classes defined inside the DummyMod built-in module:
TypeError: <IPython.core.interactiveshell.DummyMod object at 0x10c2c4e50> is a built-in class
Is there a way to inspect the bytecode of a class definition instead?
Or is it possible to use the %%px magic so as to both execute the content of the cell locally on the client and on each engine?
Thanks for the detailed question (and pinging me on Twitter).
First, maybe it should be considered a bug that you can't just push classes,
because the simple solution should be
rc[:]['MyClass'] = MyClass
but pickling interactively defined classes results only in a reference ('\x80\x02c__main__\nMyClass\nq\x01.'), giving your DummyMod AttributeError.
This can probably be fixed internally in IPython's serialization.
On to an actual working solution, though.
Adding local execution to %%px is super easy, just:
def pxlocal(line, cell):
ip = get_ipython()
ip.run_cell_magic("px", line, cell)
ip.run_cell(cell)
get_ipython().register_magic_function(pxlocal, "cell")
And now you have a %%pxlocal magic that runs %%px in addition to running the cell locally.
Then all you have to do is:
%%pxlocal
class MyClass(object):
# etc
to define your class everywhere.
I will add a --local flag to %%px, so this extra step isn't necessary.
A complete, working example notebook.
I think you could use "dill" to pickle the interactively defined class, and not have to worry about %%pxlocal magic, using DummyMod, and faking of namespaces.
To pickle a class interactively, just do "import dill" and then build your class as you first did. You should be able to then send it across any sane map or apply_async function.

We need to pickle any sort of callable

Recently a question was posed regarding some Python code attempting to facilitate distributed computing through the use of pickled processes. Apparently, that functionality has historically been possible, but for security reasons the same functionality is disabled. On the second attempted at transmitting a function object through a socket, only the reference was transmitted. Correct me if I am wrong, but I do not believe this issue is related to Python's late binding. Given the presumption that process and thread objects can not be pickled, is there any way to transmit a callable object? We would like to avoid transmitting compressed source code for each job, as that would probably make the entire attempt pointless. Only the Python core library can be used for portability reasons.
You could marshal the bytecode and pickle the other function things:
import marshal
import pickle
marshaled_bytecode = marshal.dumps(your_function.func_code)
# In this process, other function things are lost, so they have to be sent separated.
pickled_name = pickle.dumps(your_function.func_name)
pickled_arguments = pickle.dumps(your_function.func_defaults)
pickled_closure = pickle.dumps(your_function.func_closure)
# Send the marshaled bytecode and the other function things through a socket (they are byte strings).
send_through_a_socket((marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure))
In another python program:
import marshal
import pickle
import types
# Receive the marshaled bytecode and the other function things.
marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure = receive_from_a_socket()
your_function = types.FunctionType(marshal.loads(marshaled_bytecode), globals(), pickle.loads(pickled_name), pickle.loads(pickled_arguments), pickle.loads(pickled_closure))
And any references to globals inside the function would have to be recreated in the script that receives the function.
In Python 3, the function attributes used are __code__, __name__, __defaults__ and __closure__.
Please do note that send_through_a_socket and receive_from_a_socket do not actually exist, and you should replace them by actual code that transmits data through sockets.

Gracefully-degrading pickling in Python

(You may read this question for some background)
I would like to have a gracefully-degrading way to pickle objects in Python.
When pickling an object, let's call it the main object, sometimes the Pickler raises an exception because it can't pickle a certain sub-object of the main object. For example, an error I've been getting a lot is "can’t pickle module objects." That is because I am referencing a module from the main object.
I know I can write up a little something to replace that module with a facade that would contain the module's attributes, but that would have its own issues(1).
So what I would like is a pickling function that automatically replaces modules (and any other hard-to-pickle objects) with facades that contain their attributes. That may not produce a perfect pickling, but in many cases it would be sufficient.
Is there anything like this? Does anyone have an idea how to approach this?
(1) One issue would be that the module may be referencing other modules from within it.
You can decide and implement how any previously-unpicklable type gets pickled and unpickled: see standard library module copy_reg (renamed to copyreg in Python 3.*).
Essentially, you need to provide a function which, given an instance of the type, reduces it to a tuple -- with the same protocol as the reduce special method (except that the reduce special method takes no arguments, since when provided it's called directly on the object, while the function you provide will take the object as the only argument).
Typically, the tuple you return has 2 items: a callable, and a tuple of arguments to pass to it. The callable must be registered as a "safe constructor" or equivalently have an attribute __safe_for_unpickling__ with a true value. Those items will be pickled, and at unpickling time the callable will be called with the given arguments and must return the unpicked object.
For example, suppose that you want to just pickle modules by name, so that unpickling them just means re-importing them (i.e. suppose for simplicity that you don't care about dynamically modified modules, nested packages, etc, just plain top-level modules). Then:
>>> import sys, pickle, copy_reg
>>> def savemodule(module):
... return __import__, (module.__name__,)
...
>>> copy_reg.pickle(type(sys), savemodule)
>>> s = pickle.dumps(sys)
>>> s
"c__builtin__\n__import__\np0\n(S'sys'\np1\ntp2\nRp3\n."
>>> z = pickle.loads(s)
>>> z
<module 'sys' (built-in)>
I'm using the old-fashioned ASCII form of pickle so that s, the string containing the pickle, is easy to examine: it instructs unpickling to call the built-in import function, with the string sys as its sole argument. And z shows that this does indeed give us back the built-in sys module as the result of the unpickling, as desired.
Now, you'll have to make things a bit more complex than just __import__ (you'll have to deal with saving and restoring dynamic changes, navigate a nested namespace, etc), and thus you'll have to also call copy_reg.constructor (passing as argument your own function that performs this work) before you copy_reg the module-saving function that returns your other function (and, if in a separate run, also before you unpickle those pickles you made using said function). But I hope this simple cases helps to show that there's really nothing much to it that's at all "intrinsically" complicated!-)
How about the following, which is a wrapper you can use to wrap some modules (maybe any module) in something that's pickle-able. You could then subclass the Pickler object to check if the target object is a module, and if so, wrap it. Does this accomplish what you desire?
class PickleableModuleWrapper(object):
def __init__(self, module):
# make a copy of the module's namespace in this instance
self.__dict__ = dict(module.__dict__)
# remove anything that's going to give us trouble during pickling
self.remove_unpickleable_attributes()
def remove_unpickleable_attributes(self):
for name, value in self.__dict__.items():
try:
pickle.dumps(value)
except Exception:
del self.__dict__[name]
import pickle
p = pickle.dumps(PickleableModuleWrapper(pickle))
wrapped_mod = pickle.loads(p)
Hmmm, something like this?
import sys
attribList = dir(someobject)
for attrib in attribList:
if(type(attrib) == type(sys)): #is a module
#put in a facade, either recursively list the module and do the same thing, or just put in something like str('modulename_module')
else:
#proceed with normal pickle
Obviously, this would go into an extension of the pickle class with a reimplemented dump method...

Categories