Recently a question was posed regarding some Python code attempting to facilitate distributed computing through the use of pickled processes. Apparently, that functionality has historically been possible, but for security reasons the same functionality is disabled. On the second attempted at transmitting a function object through a socket, only the reference was transmitted. Correct me if I am wrong, but I do not believe this issue is related to Python's late binding. Given the presumption that process and thread objects can not be pickled, is there any way to transmit a callable object? We would like to avoid transmitting compressed source code for each job, as that would probably make the entire attempt pointless. Only the Python core library can be used for portability reasons.
You could marshal the bytecode and pickle the other function things:
import marshal
import pickle
marshaled_bytecode = marshal.dumps(your_function.func_code)
# In this process, other function things are lost, so they have to be sent separated.
pickled_name = pickle.dumps(your_function.func_name)
pickled_arguments = pickle.dumps(your_function.func_defaults)
pickled_closure = pickle.dumps(your_function.func_closure)
# Send the marshaled bytecode and the other function things through a socket (they are byte strings).
send_through_a_socket((marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure))
In another python program:
import marshal
import pickle
import types
# Receive the marshaled bytecode and the other function things.
marshaled_bytecode, pickled_name, pickled_arguments, pickled_closure = receive_from_a_socket()
your_function = types.FunctionType(marshal.loads(marshaled_bytecode), globals(), pickle.loads(pickled_name), pickle.loads(pickled_arguments), pickle.loads(pickled_closure))
And any references to globals inside the function would have to be recreated in the script that receives the function.
In Python 3, the function attributes used are __code__, __name__, __defaults__ and __closure__.
Please do note that send_through_a_socket and receive_from_a_socket do not actually exist, and you should replace them by actual code that transmits data through sockets.
Related
I am trying to parallelize a function that takes in an object in Python:
In using Pathos, the map function automatically dills the object before distributing it among the processors.
However, it takes ~1 min to dill the object each time, and I need run this function up to 100 times. All in all, it is taking nearly 2 hours to just serialize the object before even running it.
Is there a way to just serialize it once, and use it multiple times?
Thanks very much
The easiest thing to do is to do this manually.
Without an example of your code, I have to make a lot of assumptions and write something pretty vague, so let's take the simplest case.
Assume you're using dill manually, so your existing code looks like this:
obj = function_that_creates_giant_object()
for i in range(zillions):
results.append(pool.apply(func, (dill.dumps(obj),)))
All you have to do is move the dumps out of the loop:
obj = function_that_creates_giant_object()
objpickle = dill.dumps(obj)
for i in range(zillions):
results.append(pool.apply(func, (objpickle,)))
But depending on your actual use, it may be better to just stick a cache in front of dill:
cachedpickle = functools.lru_cache(maxsize=10)(dill.dumps)
obj = function_that_creates_giant_object()
for i in range(zillions):
results.append(pool.apply(wrapped_func, (cachedpickle(obj),))
Of course if you're monkeypatching multiprocessing to use dill in place of pickle, you can just as easy patch it to use this cachedpickle function.
If you're using multiprocess, which is a forked version of multiprocessing that pre-substitutes dill for pickle, it's less obvious how to patch that; you'll need to go through the source and see where it's using dill and get it to use your wrapper. But IIRC, it just does a import dill as pickle somewhere and then uses the same code as (a slightly out-of-date version of multiprocessing), so it isn't all that different.
In fact, you can even write a module that exposes the same interface as pickle and dill:
import functools
import dill
def loads(s):
return dill.loads(s)
#lru_cache(maxsize=10)
def dumps(o):
return dill.dumps(o)
… and just replace the import dill as pickle with import mycachingmodule as pickle.
… or even monkeypatch it after loading with multiprocess.helpers.pickle = mycachingmodule (or whatever the appropriate name is—you're still going to have to find where that relevant import happens in the source of whatever you're using).
And that's about as complicated as it's likely to get.
I'm writing some code for an esp8266 micro controller using micro-python and it has some different class as well as some additional methods in the standard built in classes. To allow me to debug on my desktop I've built some helper classes so that the code will run. However I've run into a snag with micro-pythons time function which has a time.sleep_ms method since the standard time.sleep method on micropython does not accept floats. I tried using the following code to extend the built in time class but it fails to import properly. Any thoughts?
class time(time):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def sleep_ms(self, ms):
super().sleep(ms/1000)
This code exists in a file time.py. Secondly I know I'll have issues with having to import time.time that I would like to fix. I also realize I could call this something else and put traps for it in my micro controller code however I would like to avoid any special functions in what's loaded into the controller to save space and cycles.
You're not trying to override a class, you're trying to monkey-patch a module.
First off, if your module is named time.py, it will never be loaded in preference to the built-in time module. Truly built-in (as in compiled into the interpreter core, not just C extension modules that ship with CPython) modules are special, they are always loaded without checking sys.path, so you can't even attempt to shadow the time module, even if you wanted to (you generally don't, and doing so is incredibly ugly). In this case, the built-in time module shadows you; you can't import your module under the plain name time at all, because the built-in will be found without even looking at sys.path.
Secondly, assuming you use a different name and import it for the sole purpose of monkey-patching time (or do something terrible like adding the monkey patch to a custom sitecustomize module, it's not trivial to make the function truly native to the monkey-patched module (defining it in any normal way gives it a scope of the module where it was defined, not the same scope as other functions from the time module). If you don't need it to be "truly" defined as part of time, the simplest approach is just:
import time
def sleep_ms(ms):
return time.sleep(ms / 1000)
time.sleep_ms = sleep_ms
Of course, as mentioned, sleep_ms is still part of your module, and carries your module's scope around with it (that's why you do time.sleep, not just sleep; you could do from time import sleep to avoid qualifying it, but it's still a local alias that might not match time.sleep if someone else monkey-patches time.sleep later).
If you want to make it behave like it's part of the time module, so you can reference arbitrary things in time's namespace without qualification and always see the current function in time, you need to use eval to compile your code in time's scope:
import time
# Compile a string of the function's source to a code object that's not
# attached to any scope at all
# The filename argument is garbage, it's just for exception traceback
# reporting and the like
code = compile('def sleep_ms(ms): sleep(ms / 1000)', 'time.py', 'exec')
# eval the compiled code with a scope of the globals of the time module
# which both binds it to the time module's scope, and inserts the newly
# defined function directly into the time module's globals without
# defining it in your own module at all
eval(code, vars(time))
del code, time # May as well leave your monkey-patch module completely empty
I have a situation where there's a complex object that can be referenced by unique name like package.subpackage.MYOBJECT. While it's possible to pickle this object using standard pickle algorithm, resulting data string will be very big.
I'm looking for some way to get same pickling semantic for an object that is already here for classes and functions: Python's pickle just dumps their fully qualified names, not code. This way just string like package.subpackage.MYOBJECT will be dumped and upon unpickling object will be imported, just like it happens for functions or classes.
It seems that this task boils down to making object aware of variable name it's bound to, but I have no clues how to do it.
Here's short example to explain myself clearly (obvious imports are skipped).
File bigpackage/bigclasses/models.py:
class SomeInterface():
__meta__ = ABCMeta
#abstractmethod
def operation():
pass
class ImplementationA(SomeInterface):
def operation():
print "ImplementationA"
class ImplementationB(SomeInterface):
def operation():
print "ImplementationB"
IMPL_A = ImplementationA()
IMPL_B = ImplementationB()
File bigpackage/bigclasses/tasks.py:
#celery.task
def background_task(impl, somearg):
assert isinstance(impl, SomeInterface)
impl.operation()
print somearg
File bigpackage/bigclasses/work.py:
from bigpackage.bigclasses.models import IMPL_A, IMPL_B
from bigpackage.bigclasses.tasks import background_task
background_task.submit(IMPL_A, "arg1")
background_task.submit(IMPL_B, "arg2")
Here I have trivial background Celery task that accept one of two available implementations of SomeInterface as an argument. Task's arguments are pickled by Celery, passed to a queue and executed on some worker server, that runs exactly the same code base. My idea is to avoid deep pickling of IMPL_A and IMPL_B and instead pass them as bigpackage.bigclasses.models.IMPL_A and bigpackage.bigclasses.models.IMPL_B correspondingly. That will help with performance and total traffic for queue server and also provide some safety against changes in IMPL_A and IMPL_B that will make them non-pickleable (for example lambda anywhere in object attributes hierarchy).
I have tried multiple approaches to pickle a python function with dependencies, following many recommendations on StackOverflow, (such as dill, cloudpickle, etc.) but all seem to run into a fundamental issue that I cannot figure out.
I have a main module that tries to pickle a function from an imported module, sends it over ssh to be unpickled and executed at a remote machine.
So main has:
import dill (for example)
import modulea
serial=dill.dumps( modulea.func )
send (serial)
On the remote machine:
import dill
receive serial
funcremote = dill.loads( serial )
funcremote()
If the functions being pickled and sent are top level functions defined in main itself, everything works. When they are in an imported module, the loads function fails with messages of the type "module modulea not found".
It appears that the module name is pickled along with the function name. I do not see any way to "fix up" the pickle to remove the dependency, or alternately, to create a dummy module in the receiver to become the recipient of the unpickling.
Any pointers will be much appreciated.
--prasanna
I'm the dill author. I do this exact thing over ssh, but with success. Currently, dill and any of the other serializers pickle modules by reference… so to successfully pass a function defined in a file, you have to ensure that the relevant module is also installed on the other machine. I do not believe there is any object serializer that serializes modules directly (i.e. not by reference).
Having said that, dill does have some options to serialize object dependencies. For example, for class instances, the default in dill is to not serialize class instances by reference… so the class definition can also be serialized and send with the instance. In dill, you can also (use a very new feature to) serialize file handles by serializing the file, instead of the doing so by reference. But again, if you have the case of a function defined in a module, you are out-of-luck, as modules are serialized by reference pretty darn universally.
You might be able to use dill to do so, however, just not with pickling the object, but with extracting the source and sending the source code. In pathos.pp and pyina, dill us used to extract the source and the dependencies of any object (including functions), and pass them to another computer/process/etc. However, since this is not an easy thing to do, dill can also use the failover of trying to extract a relevant import and send that instead of the source code.
You can understand, hopefully, this is a messy messy thing to do (as noted in one of the dependencies of the function I am extracting below). However, what you are asking is successfully done in the pathos package to pass code and dependencies to different machines across ssh-tunneled ports.
>>> import dill
>>>
>>> print dill.source.importable(dill.source.importable)
from dill.source import importable
>>> print dill.source.importable(dill.source.importable, source=True)
def _closuredsource(func, alias=''):
"""get source code for closured objects; return a dict of 'name'
and 'code blocks'"""
#FIXME: this entire function is a messy messy HACK
# - pollutes global namespace
# - fails if name of freevars are reused
# - can unnecessarily duplicate function code
from dill.detect import freevars
free_vars = freevars(func)
func_vars = {}
# split into 'funcs' and 'non-funcs'
for name,obj in list(free_vars.items()):
if not isfunction(obj):
# get source for 'non-funcs'
free_vars[name] = getsource(obj, force=True, alias=name)
continue
# get source for 'funcs'
#…snip… …snip… …snip… …snip… …snip…
# get source code of objects referred to by obj in global scope
from dill.detect import globalvars
obj = globalvars(obj) #XXX: don't worry about alias?
obj = list(getsource(_obj,name,force=True) for (name,_obj) in obj.items())
obj = '\n'.join(obj) if obj else ''
# combine all referred-to source (global then enclosing)
if not obj: return src
if not src: return obj
return obj + src
except:
if tried_import: raise
tried_source = True
source = not source
# should never get here
return
I imagine something could also be built around the dill.detect.parents method, which provides a list of pointers to all parent object for any given object… and one could reconstruct all of any function's dependencies as objects… but this is not implemented.
BTW: to establish a ssh tunnel, just do this:
>>> t = pathos.Tunnel.Tunnel()
>>> t.connect('login.university.edu')
39322
>>> t
Tunnel('-q -N -L39322:login.university.edu:45075 login.university.edu')
Then you can work across the local port with ZMQ, or ssh, or whatever. If you want to do so with ssh, pathos also has that built in.
I have a plypython function which does some json magic. For this it obviously imports the json library.
Is the import called on every call to the function? Are there any performance implication I have to be aware of?
The import is executed on every function call. This is the same behavior you would get if you wrote a normal Python module with the import statement inside a function body as oppposed to at the module level.
Yes, this will affect performance.
You can work around this by caching your imports like this:
CREATE FUNCTION test() RETURNS text
LANGUAGE plpythonu
AS $$
if 'json' in SD:
json = SD['json']
else:
import json
SD['json'] = json
return json.dumps(...)
$$;
This is admittedly not very pretty, and better ways to do this are being discussed, but they won't happen before PostgreSQL 9.4.
The declaration in the body of a PL/Python function will eventually become an ordinary Python function and will thus behave as such. When a Python function imports a module for the first time the module is cached in the sys.modules dictionary (https://docs.python.org/3/reference/import.html#the-module-cache). Subsequent imports of the same module will simply bind the import name to the module object found in the dictionary. In a sense, what I'm saying may cast some doubt on the usefulness of the tip given in the accepted answer, since it makes it somewhat redundant, as Python already does a similar caching for you.
To sum things up, I'd say that if you import in the standard way of simply using the import or from [...] import constructs, then you need not worry about repeated imports, in functions or otherwise, Python has got you covered.
On the other hand, Python allows you to bypass its native import semantics and to implement your own (with the __import__() function and importlib module). If this is what you're doing, maybe you should review what's available in the toolbox (https://docs.python.org/3/reference/import.html).