Does joblib duplicate class instances passed as arguments to a function? - python

I am using joblib to run some simulations in parallel (for a MWE see my answer to this question). The simulator is a class instance which gets called at every step with an external action. If I am running 4 simulations, I want 4 different independent simulators. However, my call to joblib is something like this
simulator = Simulator(...)
results = Parallel(n_jobs=4)([delayed(run_simulation)(simulator) for i in range(4)])
The same Simulator instance is passed as argument in all cases. Is this instance duplicated before being given to the workers (which is what I want), or do they all use the same instance (which I do not want)?
I have answered my own question, although I am not sure of the answer. Any further insights would be appreciated.

This MWE of the problem described in the question appears to suggest that it does
from joblib import Parallel, delayed
class A:
def __init__(self):
self.c = 1
def call(self, a):
self.c += a
def get(self):
return self.c
a = A()
def call_a(inst):
inst.call(1)
return inst.get()
result = Parallel(n_jobs=4)([delayed(call_a)(a) for i in range(10)])
When you run it you obtain
In [2]: result
Out[2]: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
If the instance a was shared, you would have obtained the numbers from 2 to 11.

Related

How to implement a function with variable arguments in a project assignment

In my first week of learning cs and working on my first project assignment (in python3) I have hit a roadblock. There is an inbuilt function in the project which I'm unable to access
The function basically does this-
"A test dice is deterministic: it always cycles through a fixed sequence of values that are passed as arguments. Test dice are generated by the make_test_dice function."
test_dice = make_test_dice(4, 1, 2)
>>>test_dice()
4
1 on the 2nd call, 2 on the 3rd call and 4 on the 4th call
Now after researching for quite a while I came across the args and yield function for doing this job but I've been unable to implement it due to errors.
Any help on implementing this defined function would be appreciated!
You could do the following with cycle from the standard library itertools:
from itertools import cycle
def make_test_dice(*args):
next_dice = cycle(args)
def dice():
return next(next_dice)
return dice
With
test_dice = make_test_dice(4, 1, 2)
test = [test_dice() for _ in range(5)]
you'll get
[4, 1, 2, 4, 1]
But: Is this really your first Python assignment?

Storing objects on workers and executing methods

I have an application where I have a set of objects that do a lot of setting up (this takes up to 30s-1minute per object). Once they have been set-up, I want to pass a parameter vector (small, <50 floats) and return a couple of small arrays back. Computations per object are "very fast" compared to setting up the object. I need to run this fast method many, many times, and have a cluster which I can use.
My idea is that this should be "pool" of workers where each worker gets an object initialised with its own particular configuration (which will be different), and stays in memory in the worker. Subsequently, a method of this object on each worker gets a vector and returns some arrays. These are then combined by the main program (order is not important).
Some MWE that works is as follows (serial version first, dask version second):
import datetime as dt
import itertools
import numpy as np
from dask.distributed import Client, LocalCluster
# Set up demo dask on local machine
cluster = LocalCluster()
client = Client(cluster)
class Model(object):
def __init__(self, power):
self.power = power
def powpow(self, x):
return np.power(x, self.power)
def neg(self, x):
return -x
def compute_me(self, x):
return sum(self.neg(self.powpow(x)))
# Set up the objects locally
bag = [Model(power)
for power in [1, 2, 3, 4, 5]
]
x = [13, 23, 37]
result = [obj.compute_me(x) for obj in bag]
# Using dask
# Wrapper function to pass the local object
# and parameter
def wrap(obj, x):
return obj.compute_me(x)
res = []
for obj,xx in itertools.product(bag, [x,]):
res.append(client.submit(wrap, obj, xx))
result_dask = [r.result() for r in res]
np.allclose(result, result_dask)
In my real-world case, the Model class does a lot of initialisation, pre-calculations, etc, and it takes possibly 10-50 times longer to initialise than to run the compute_me method. Basically, in my case, it'd be beneficial to have each worker have a pre-defined instance of Model locally, and have dask deliver the input to compute_me.
This post (2nd answer) initialising and storing in namespace, but the example doesn't show how pass different initialisation arguments to each worker. Or is there some other way of doing this?

What is the difference between a higher-order function and a class?

I was going through the basics of functional programming, and eventually came accross the concept of higher-order functions. I saw an example in this video by Corey Schafer (starts at 11:00), which shows a Python function that can wrap messages in arbitrary HTML tags:
def html_tag(tag):
def wrap_text(msg):
print('<{0}>{1}</{0}>'.format(tag, msg))
return wrap_text
print_h1 = html_tag('h1')
print_h1('Test Headline!')
print_h1('Another Headline!')
print_p = html_tag('p')
print_p('Test Paragraph!')
Output:
<h1>Test Headline!</h1>
<h1>Another Headline!</h1>
<p>Test Paragraph!</p>
I get that it gives you the flexibility of re-using the same function for different purposes (different tags, in this example). But you could achieve the same result using Python classes, too:
class HTML_tag:
def __init__(self, tag):
self.tag = tag
def wrap_text(self, msg):
print('<{0}>{1}</{0}>'.format(self.tag, msg))
print_h1 = HTML_tag('h1')
print_h1.wrap_text('Test Headline!')
print_h1.wrap_text('Another Headline!')
print_p = HTML_tag('p')
print_p.wrap_text('Test Paragraph!')
Output:
<h1>Test Headline!</h1>
<h1>Another Headline!</h1>
<p>Test Paragraph!</p>
The higher-order function approach definitely looks cleaner, but apart from the aesthetics, are there any other reasons I might want to prefer a higher-order function over a class? E.g., regarding aspects like
Performance
Memory
...
Higher order functions take and/or return functions. Let's look at both cases.
Taking a function as a parameter
Here, a HOF is definitely the way to go. The class version amounts to a HOF with extra steps. For the class version, you need to have pre-negotiated the key the function is callable on. It's really a useless wrapper around the meat of what you're trying to accomplish.
HOF
def my_map(fn, arr):
result = []
for a in arr:
result.append(fn(a))
return result
my_map(lambda a: a + 1, [1, 2, 3]) # [2, 3, 4]
Class version
def my_map(inst, arr):
result = []
for a in arr:
result.append(inst.fn(a))
return result
class my_mapper:
def fn(self, a):
return a + 1
my_map(my_mapper(), [1, 2, 3]) # [2, 3, 4]
Returning a function
In both versions here, what we're doing is creating an encapsulation of some value a, and a function that works over it.
I think that a class is generally useful if you want more than one function to be defined over some data, when the encapsulated data can change (you're encoding a state machine), or when you expect operations to be specific to your class (ie. users of your class need to know the operations defined over the class).
I would use a function that returns a function, when what I'm doing amounts to partial application, (I have a function that takes multiple parameters, and I want to preapply some, like 'add'). I would also use functools.partial to do this.
def adder(a):
return lambda b: a + b
class adder_class:
def __init__(self, a):
self.a = a
def add(self, b):
return a + b
Ultimately, whether it's best to use a HOF or a class will become clear from context.

Overriding inherited method without name mangling [duplicate]

This question already has answers here:
How do I call a parent class's method from a child class in Python?
(16 answers)
Closed 5 years ago.
Note: similar question here, but I don't believe it's an exact duplicate given the specifications.
Below, I have two classes, one inheriting from the other. Please note these are just illustrative in nature.
In _Pandas.array(), I want to simply wrap a pandas DataFrame around the NumPy array returned from _Numpy.array(). I'm aware of what is wrong with my current code (_Pandas.array() gets redefined, attempts to call itself, and undergoes infinite recursion), but not how to fix it without name mangling or quasi-private methods on the parent class.
import numpy as np
import pandas as pd
class _Numpy(object):
def __init__(self, x):
self.x = x
def array(self):
return np.array(self.x)
class _Pandas(_Numpy):
def __init__(self, x):
super(_Pandas, self).__init__(x)
def array(self):
return pd.DataFrame(self.array())
a = [[1, 2], [3, 4]]
_Pandas(a).array() # Intended result - pd.DataFrame(np.array(a))
# Infinite recursion as method shuffles back & forth
I'm aware that I could do something like
class _Numpy(object):
def __init__(self, x):
self.x = x
def _array(self): # Changed to leading underscore
return np.array(self.x)
class _Pandas(_Numpy):
def __init__(self, x):
super().__init__(x)
def array(self):
return pd.DataFrame(self._array())
But this seems very suboptimal. In reality, I'm using _Numpy frequently--it's not just a generic parent class--and I'd prefer not to preface all its methods with a single underscore. How else can I go about this?
Uhm... just want to check why in _Pandas class you don't call super directly?
class _Pandas(_Numpy):
def __init__(self, x):
super(_Pandas,self).__init__(x)
def array(self):
return pd.DataFrame(super(_Pandas,self).array())
I tried that and got the below result, don't know if it's what you wanted or I have missed anything
a = [[1, 2], [3, 4]]
_Pandas(a).array()
0 1
0 1 2
1 3 4

How to get around the pickling error of python multiprocessing without being in the top-level?

I've researched this question multiple times, but haven't found a workaround that either works in my case, or one that I understand, so please bear with me.
Basically, I have a hierarchical organization of functions, and that is preventing me from multiprocessing in the top-level. Unfortunately, I don't believe I can change the layout of the program - because I need all the variables that I create after the initial inputs.
For example, say I have this:
import multiprocessing
def calculate(x):
# here is where I would take this input x (and maybe a couple more inputs)
# and build a larger library of variables that I use further down the line
def domath(y):
return x * y
pool = multiprocessing.Pool(3)
final= pool.map(domath, range(3))
calculate(2)
This yields the following error:
Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I was thinking of globals, but I'm afraid that I'd have to define too many and that may slow my program down quite a bit.
Is there any workaround without having to restructure the whole program?
You could use pathos.multiprocessing, which is a fork of multiprocessing that uses the dill serializer instead of pickle. dill can serialize pretty much anything in python. Then, no need to edit your code.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>>
>>> def calculate(x):
... def domath(y):
... return x*y
... return Pool().map(domath, range(3))
...
>>> calculate(2)
[0, 2, 4]
You can even go nuts with it… as most things are pickled. No need for the odd non-pythonic solutions you have to cook up with pure multiprocessing.
>>> class Foo(object):
... def __init__(self, x):
... self.x = x
... def doit(self, y):
... return ProcessingPool().map(self.squared, calculate(y+self.x))
... def squared(self, z):
... return z*z
...
>>> def thing(obj, y):
... return getattr(obj, 'doit')(y)
...
>>> ProcessingPool().map(thing, ProcessingPool().map(Foo, range(3)), range(3))
[[0, 0, 0], [0, 4, 16], [0, 16, 64]]
Get pathos here: https://github.com/uqfoundation
The problem you encountered is actually a feature. The pickle source is actually designed to prevent this sort of behavior in order to prevent malicious code from being executed. Please consider that when addressing any applicable security implementation.
First off we have some imports.
import marshal
import pickle
import types
Here we have a function which takes in a function as an argument, pickles the parts of the object, then returns a tuple containing all the parts:
def pack(fn):
code = marshal.dumps(fn.__code__)
name = pickle.dumps(fn.__name__)
defs = pickle.dumps(fn.__defaults__)
clos = pickle.dumps(fn.__closure__)
return (code, name, defs, clos)
Next we have a function which takes the four parts of our converted function. It translates those four parts, and creates then returns a function out of those parts. You should take note that globals are re-introduced into here because our process does not handle those:
def unpack(code, name, defs, clos):
code = marshal.loads(code)
glob = globals()
name = pickle.loads(name)
defs = pickle.loads(defs)
clos = pickle.loads(clos)
return types.FunctionType(code, glob, name, defs, clos)
Here we have a test function. Notice I put an import within the scope of the function. Globals are not handled through our pickling process:
def test_function(a, b):
from random import randint
return randint(a, b)
Finally we pack our test object and print the result to make sure everything is working:
packed = pack(test_function)
print((packed))
Lastly, we unpack our function, assign it to a variable, call it, and print its output:
unpacked = unpack(*packed)
print((unpacked(2, 20)))
Comment if you have any questions.
How about just taking the embedded function out?
This seems to me the clearest solution (since you didn't give your expected output, I had to guess):
$ cat /tmp/tmp.py
import multiprocessing
def calculate(x):
# here is where I would take this input x (and maybe a couple more inputs)
# and build a larger library of variables that I use further down the line
pool = multiprocessing.Pool(3)
_lst = [(x, y) for x in (x,) for y in range(3)]
final= pool.map(domath, _lst)
print(final)
def domath(l):
return l[0] * l[1]
calculate(2)
$ python /tmp/tmp.py
[0, 2, 4]
$

Categories