Context
In an interactive prototyping development on the notebook connected to a cluster, I would like to define a class that is both available in the client __main__ session and interactively update on the cluster engine nodes to be able to move instances of that class around by passing such instances a argument to a LoadBalanced view. The following demonstrates the typical user session:
First setup the parallel clustering environment:
>>> from IPython.parallel import Client
>>> rc = Client()
>>> lview = rc.load_balanced_view()
>>> rc[:]
<DirectView [0, 1, 2]>
In a notebook cell let's define the code snippet of the component we are interactively editing:
>>> class MyClass(object):
... def __init__(self, parameter):
... self.parameter = parameter
...
... def update_something(self, some_data):
... # do something smart here with some_data & internal state
...
... def compute_something(self, other_data):
... # do something smart here with other data & internal state
... return something
...
In the next cell, let's create a script that builds instances of this class and then use the load balanced view of the cluster environment to evaluate our component on a wide range of input parameters:
>>> def process(obj, some_data, other_data):
... obj.update_something(some_data)
... return obj.compute_something(other_data)
...
>>> tasks = []
>>> some_instances = [MyClass(i) for i in range(10)]
>>> for obj in some_instances:
... for some_data in data_source_1:
... for other_data in data_source_2:
... ar = lview.apply_async(process, obj, some_data, other_data)
... tasks.append(ar)
...
>>> # wait for computation to end
>>> results = [ar.get() for ar in tasks]
Problem
That will obviously not work as the engines of the load balanced view will be unable to unpickle the instances passed as first argument to the process function. The process function definition itself is passed successfully as I assume that apply_async does bytecode instrospection to pickle it (by accessing the .code attribute of the function) and then just does a simple pickle for the remaining arguments.
Possible solutions (that don't work for me)
One alternative solution would be to use the %%px cell magic on the cell holding the definition of the class MyClass. However that would prevent me to build the class instances in the client script that also do the scheduling. I would need to copy and paste the cell content in an other cell without the %%px magic (or execute the cell twice once with magic and another time without the magic) but this is tedious when I am still editing the methods of the class in an iterative development & evaluation setting.
An alternative solution would be to embed the class definition inside the process function but I find this not practical as I would like to reuse that class definition in other functions later in my notebook.
Alternatively I could just stop using a class and only work with functions that can be shipped over to the engines by passing then as first argument to the apply_async. However I don't like that either as I would like to prototype my code in an object oriented way for later extraction from the notebook and including the resulting class in an object oriented library. The notebook session serving as a collaborative prototyping tool using for exchanging ideas between developers using the http://nbviewer.ipython.org publisher.
The final alternative would be to write my class in a python module on a file on the filesystem and ship that file to the engines PYTHONPATH using NFS for instance. That works but prevent me to work only in the notebook environment which defeats the whole purpose of interactive prototyping in the notebook.
So basically, is there a way to define a class interactively and then ship its definition around to the engines?
It should be possible to pickle a class definition using the inspect.getsource in the client then send the source to the engines and use the eval builtin but unfortunately source inspection does not work for classes defined inside the DummyMod built-in module:
TypeError: <IPython.core.interactiveshell.DummyMod object at 0x10c2c4e50> is a built-in class
Is there a way to inspect the bytecode of a class definition instead?
Or is it possible to use the %%px magic so as to both execute the content of the cell locally on the client and on each engine?
Thanks for the detailed question (and pinging me on Twitter).
First, maybe it should be considered a bug that you can't just push classes,
because the simple solution should be
rc[:]['MyClass'] = MyClass
but pickling interactively defined classes results only in a reference ('\x80\x02c__main__\nMyClass\nq\x01.'), giving your DummyMod AttributeError.
This can probably be fixed internally in IPython's serialization.
On to an actual working solution, though.
Adding local execution to %%px is super easy, just:
def pxlocal(line, cell):
ip = get_ipython()
ip.run_cell_magic("px", line, cell)
ip.run_cell(cell)
get_ipython().register_magic_function(pxlocal, "cell")
And now you have a %%pxlocal magic that runs %%px in addition to running the cell locally.
Then all you have to do is:
%%pxlocal
class MyClass(object):
# etc
to define your class everywhere.
I will add a --local flag to %%px, so this extra step isn't necessary.
A complete, working example notebook.
I think you could use "dill" to pickle the interactively defined class, and not have to worry about %%pxlocal magic, using DummyMod, and faking of namespaces.
To pickle a class interactively, just do "import dill" and then build your class as you first did. You should be able to then send it across any sane map or apply_async function.
Related
What is the best way to handle objects with robot framework? I am starting to write a python class to handle API interactions, which I can therefore use as keywords in robot framework (RF). My question is how does one pass data from one method to another? Do I have to pass the object back to every function to get the data?
In the example below, I call the class and it initializes, but can I reference an instance of the class if I wanted? Or am I supposed to write every method to handle the entire object I get back from another method? Hopefully this makes sense, I basically want to use python like I normally would but inside of RF.
More specifically, is it feasible to distinguish between several instances if I call them all at once?
Test python foo.py:
class foo:
def intialize(self, api):
self.api_item = api
def get_api():
return self.api_item
def do_something_with_api
# doing something with an API, then return results
def do_something_else_with_api
# doing something with an API, then return results
Test Robot file:
*** Settings ***
Library /path/foo.py
*** Variables ***
${api_url} "https://apiurl.com/"
*** Tasks ***
Setup Initialize Settings
${session}= MgsRestApiHandler.intialize ${api_url}
In RF when a class is loaded as a Library there's always an instance object that's created for it. Thus if you have state variables within it, they'll be present for all class methods ("keywords") in your RF source.
In other words, in your example all methods will have access to self.api_item (after initialize() is called); by the way, why don't you add a normal constructor __init()__ and define the var there, even with None value, so it's cleaner?
is it feasible to distinguish between several instances
You can instantiate several instances of the same class ("Library") by importing them multiple times and using the WITH NAME Robot Framework syntax:
*** Settings ***
Library /path/foo.py WITH NAME o1
Library /path/foo.py WITH NAME o2
The "drawback" is you now have to prefix the method call with the instance name - otherwise the framework doesn't know for which object you want to call it:
*** Tasks ***
Setup Initialize Settings
${session1}= o1.intialize ${api_url}
${session2}= o2.intialize ${api_url2}
And if I understand one of your questions correctly (or if not - take this as a general trivia :) - in RF whatever a method/keyword returns - from primitives to complex objects (e.g.nclass instances) is assigned to that variable that in front of the call.
So if you have a method/keyword down the line that expects a complex object, you can pass that returned value - the framework will not mangle it in any way, it'll be passing around a normal pyhton object.
I have a situation where there's a complex object that can be referenced by unique name like package.subpackage.MYOBJECT. While it's possible to pickle this object using standard pickle algorithm, resulting data string will be very big.
I'm looking for some way to get same pickling semantic for an object that is already here for classes and functions: Python's pickle just dumps their fully qualified names, not code. This way just string like package.subpackage.MYOBJECT will be dumped and upon unpickling object will be imported, just like it happens for functions or classes.
It seems that this task boils down to making object aware of variable name it's bound to, but I have no clues how to do it.
Here's short example to explain myself clearly (obvious imports are skipped).
File bigpackage/bigclasses/models.py:
class SomeInterface():
__meta__ = ABCMeta
#abstractmethod
def operation():
pass
class ImplementationA(SomeInterface):
def operation():
print "ImplementationA"
class ImplementationB(SomeInterface):
def operation():
print "ImplementationB"
IMPL_A = ImplementationA()
IMPL_B = ImplementationB()
File bigpackage/bigclasses/tasks.py:
#celery.task
def background_task(impl, somearg):
assert isinstance(impl, SomeInterface)
impl.operation()
print somearg
File bigpackage/bigclasses/work.py:
from bigpackage.bigclasses.models import IMPL_A, IMPL_B
from bigpackage.bigclasses.tasks import background_task
background_task.submit(IMPL_A, "arg1")
background_task.submit(IMPL_B, "arg2")
Here I have trivial background Celery task that accept one of two available implementations of SomeInterface as an argument. Task's arguments are pickled by Celery, passed to a queue and executed on some worker server, that runs exactly the same code base. My idea is to avoid deep pickling of IMPL_A and IMPL_B and instead pass them as bigpackage.bigclasses.models.IMPL_A and bigpackage.bigclasses.models.IMPL_B correspondingly. That will help with performance and total traffic for queue server and also provide some safety against changes in IMPL_A and IMPL_B that will make them non-pickleable (for example lambda anywhere in object attributes hierarchy).
I have tried multiple approaches to pickle a python function with dependencies, following many recommendations on StackOverflow, (such as dill, cloudpickle, etc.) but all seem to run into a fundamental issue that I cannot figure out.
I have a main module that tries to pickle a function from an imported module, sends it over ssh to be unpickled and executed at a remote machine.
So main has:
import dill (for example)
import modulea
serial=dill.dumps( modulea.func )
send (serial)
On the remote machine:
import dill
receive serial
funcremote = dill.loads( serial )
funcremote()
If the functions being pickled and sent are top level functions defined in main itself, everything works. When they are in an imported module, the loads function fails with messages of the type "module modulea not found".
It appears that the module name is pickled along with the function name. I do not see any way to "fix up" the pickle to remove the dependency, or alternately, to create a dummy module in the receiver to become the recipient of the unpickling.
Any pointers will be much appreciated.
--prasanna
I'm the dill author. I do this exact thing over ssh, but with success. Currently, dill and any of the other serializers pickle modules by reference… so to successfully pass a function defined in a file, you have to ensure that the relevant module is also installed on the other machine. I do not believe there is any object serializer that serializes modules directly (i.e. not by reference).
Having said that, dill does have some options to serialize object dependencies. For example, for class instances, the default in dill is to not serialize class instances by reference… so the class definition can also be serialized and send with the instance. In dill, you can also (use a very new feature to) serialize file handles by serializing the file, instead of the doing so by reference. But again, if you have the case of a function defined in a module, you are out-of-luck, as modules are serialized by reference pretty darn universally.
You might be able to use dill to do so, however, just not with pickling the object, but with extracting the source and sending the source code. In pathos.pp and pyina, dill us used to extract the source and the dependencies of any object (including functions), and pass them to another computer/process/etc. However, since this is not an easy thing to do, dill can also use the failover of trying to extract a relevant import and send that instead of the source code.
You can understand, hopefully, this is a messy messy thing to do (as noted in one of the dependencies of the function I am extracting below). However, what you are asking is successfully done in the pathos package to pass code and dependencies to different machines across ssh-tunneled ports.
>>> import dill
>>>
>>> print dill.source.importable(dill.source.importable)
from dill.source import importable
>>> print dill.source.importable(dill.source.importable, source=True)
def _closuredsource(func, alias=''):
"""get source code for closured objects; return a dict of 'name'
and 'code blocks'"""
#FIXME: this entire function is a messy messy HACK
# - pollutes global namespace
# - fails if name of freevars are reused
# - can unnecessarily duplicate function code
from dill.detect import freevars
free_vars = freevars(func)
func_vars = {}
# split into 'funcs' and 'non-funcs'
for name,obj in list(free_vars.items()):
if not isfunction(obj):
# get source for 'non-funcs'
free_vars[name] = getsource(obj, force=True, alias=name)
continue
# get source for 'funcs'
#…snip… …snip… …snip… …snip… …snip…
# get source code of objects referred to by obj in global scope
from dill.detect import globalvars
obj = globalvars(obj) #XXX: don't worry about alias?
obj = list(getsource(_obj,name,force=True) for (name,_obj) in obj.items())
obj = '\n'.join(obj) if obj else ''
# combine all referred-to source (global then enclosing)
if not obj: return src
if not src: return obj
return obj + src
except:
if tried_import: raise
tried_source = True
source = not source
# should never get here
return
I imagine something could also be built around the dill.detect.parents method, which provides a list of pointers to all parent object for any given object… and one could reconstruct all of any function's dependencies as objects… but this is not implemented.
BTW: to establish a ssh tunnel, just do this:
>>> t = pathos.Tunnel.Tunnel()
>>> t.connect('login.university.edu')
39322
>>> t
Tunnel('-q -N -L39322:login.university.edu:45075 login.university.edu')
Then you can work across the local port with ZMQ, or ssh, or whatever. If you want to do so with ssh, pathos also has that built in.
I am writing a Python wrapper for a C library using the cffi.
The C library has to be initialized and shut down. Also, the cffi needs some place to save the state returned from ffi.dlopen().
I can see two paths here:
Either I wrap this whole stateful business in a class like this
class wrapper(object):
def __init__(self):
self.c = ffi.dlopen("mylibrary")
self.c.initialize()
def __del__(self):
self.c.terminate()
Or I provide two global functions that hide the state in a global variable
def initialize():
global __library
__library = ffi.dlopen("mylibrary")
__library.initialize()
def terminate():
__library.terminate()
del __library
The first path is somewhat cumbersome in that it requires the user to always create an object that really serves no other purpose other than managing the library state. On the other hand, it makes sure that terminate() is actually called every time.
The second path seems to result in a somewhat easier API. However, it exposes some hidden global state, which might be a bad thing. Also, if the user forgets to call terminate(), the C library is not unloaded correctly (which is not a big problem on the C side).
Which one of these paths would be more pythonic?
Exposing a wrapper object only makes sense in python if the library actually supports something like multiple instances in one application. If it doesn't support that or it's not really relevant go for kindall's suggestion and just initialize the library when imported and add an atexit handler for cleanup.
Adding wrappers around a stateless api or even an api without support for keeping different sets of state is not really pythonic and would raise expectations that different instances have some kind of isolation.
Example code:
import atexit
# Normal library initialization
__library = ffi.dlopen("mylibrary")
__library.initialize()
# Private library cleanup function
def __terminate():
__library.terminate()
# register function to be called on clean interpreter termination
atexit.register(__terminate)
For more details about atexit this question has some more details, as has the python documentation of course.
I manage a fairly large python-based quantum chemistry suite, PyQuante. I'm currently struggling with how to set various defaults so that users can choose among different options at runtime.
For example, I have three different methods for computing electron repulsion integrals. Let's call them a,b,c. I used to simply pick the one I liked best (say, c), and have that hard-wired into the module that computes these integrals.
I have now modified this to use a module, Defaults.py, that contains all such hard-wires. But this is set at compile/install time. I would now like users to be able to override these options at runtime, say, using a .pyquanterc.py file.
In my integral routines, I currently have something like
from Defaults import integral_method
I know about dictionaries, and the .update() method. But I don't know how I would use this in real life. My defaults module looks like
integral_method = c
should I modify the end of Defaults.py to look for a .pythonrc.py file and override these values? E.g.
if os.path.exists('$HOME/.pythonrc.py'): do_something
If so, what should do_something look like?
With your current setup, the user can change the default functions in his scripts quite easily:
import Defaults
Defaults.integral_method = somefunc
If the user adds this to his script, all your modules that use integral_method from Defaults will use somefunc to calculate integrals.
I might do this via a factory class.
class IntegralSolver:
"""
Factory class containing methods for solving integrals.
>>> solver = IntegralSolver("method1")
>>> solver(x)
# solution via method1
Can also be used directly:
>>> IntegralSolver.method2(x)
# solution via method2
"""
def __init__(self, method):
self.__call__ = getattr(self, method)
#staticmethod
def method1(x):
return method1_solution
#staticmethod
def method2(x):
return method2_solution
It really depends on how your user runs the toolset. If they twiddle the python code each time, just setting a block at the top labeled OPTIONS should be good. If they run it off the command line, use the argparse library to allow them to switch options on the command line. Perhaps have it read the options out of a file with configParser to read a default file with your options, and if the user sets it, an additional file with their options.