Terminating multiprocess pool when one of the workers found proper solution - python

I've created a program, which can be sum up to something like this:
from itertools import combinations
class Test(object):
def __init__(self, t2):
self.another_class_object = t2
def function_1(self,n):
a = 2
while(a <= n):
all_combs = combinations(range(n),a)
for comb in all_combs:
if(another_class_object.function_2(comb)):
return 1
a += 1
return -1
Function combinations is imported from itertools. Function_2 returns True or False depending on the input and is a method in another class object, e.g.:
class Test_2(object):
def __init__(self, list):
self.comb_list = list
def function_2(self,c):
return c in self.comb_list
Everything is working just fine. But now I want to change it a little bit and implement multiprocessing. I found this topic that shows an example of how to exit the script when one of the worker process determines no more work needs to be done. So I made following changes:
added a definition of pool into __init__ method: self.pool = Pool(processes=8)
created a callback function:
all_results = []
def callback_function(self, result):
self.all_results.append(result)
if(result):
self.pool.terminate()
changed function_1:
def function_1(self,n):
a = 2
while(a <= n):
all_combs = combinations(range(n),a)
for comb in all_combs:
self.pool.apply_async(self.another_class_object.function_2, args=comb, callback=self.callback_function)
#self.pool.close()
#self.pool.join()
if(True in all_results):
return 1
a += 1
return -1
Unfortunately, it does not work as I expected. Why? After debugging it looks like the callback function is never reached. I thought that it would be reached by every worker. Am I wrong? What can be the problem?

I did not try your code as such, but I tried your structure. Are you sure the problem is in callback function and not the worker function? I did not manage to get apply_async launch a single instance of the worker function if the function was a class method. It just did not do anything. Apply_async completes without error but it does not implement the worker.
As soon as I moved the worker function (in your case another_class_object.function2) as a standalone global function outside classes, it started working as expected and the callback was triggered normally. The callback function, in contrast, seems to work fine as a class method.
There seems to be discussion about this for example here: Why can I pass an instance method to multiprocessing.Process, but not a multiprocessing.Pool?
Is this in any way useful?
Hannu

Question: ... not work as I expected. ... What can be the problem?
It's always necessary to get() the Results from pool.apply_async(... to see the Errors from the Pool Processes.
Change to the following:
pp = []
for comb in all_combs:
pp.append(pool.apply_async(func=self.another_class_object.function_2, args=comb, callback=self.callback_function))
pool.close()
for ar in pp:
print('ar=%s' % ar.get())
And you will see this Error:
TypeError: function_2() takes 2 positional arguments but 3 were given
Fix for this Error, change args=comb to args=(comb,):
pp.append(pool.apply_async(func=self.another_class_object.function_2, args=(comb,), callback=self.callback_function))
Tested with Python: 3.4.2

Related

Luigi: how to pass different arguments to leaf tasks?

This is my second attempt at understanding how to pass arguments to dependencies in Luigi. The first one was here.
The idea is: I have TaskC which depends on TaskB, which depends on TaskA, which depends on Task0. I want this whole sequence to be exactly the same always, except I want to be able to control what file Task0 reads from, lets call it path. Luigi's philosophy is normally that each task should only know about the Tasks it depends on, and their parameters. The problem with this is that TaskC, TaskB, and TaskA all would have to accept variable path for the sole purpose of then passing it to Task0.
So, the solution that Luigi provides for this is called Configuration Classes
Here's some example code:
from pathlib import Path
import luigi
from luigi import Task, TaskParameter, IntParameter, LocalTarget, Parameter
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
class Task0(Task):
path = Parameter(default=config.path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
class TaskA(Task):
arg = IntParameter(default=0)
def requires(self): return Task0(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskA{self.arg}.txt")
class TaskB(Task):
arg = IntParameter(default=0)
def requires(self): return TaskA(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskB{self.arg}.txt")
class TaskC(Task):
arg = IntParameter(default=0)
def requires(self): return TaskB(arg=self.arg)
def run(self): Path(self.output().path).touch()
def output(self): return LocalTarget(f"taskC{self.arg}.txt")
(Ignore all the output and run stuff. They're just there so the example runs successfully.)
The point of the above example is controlling the line print(f"READING FROM {self.path}") without having tasks A, B, C depend on path.
Indeed, with Configuration Classes I can control the Task0 argument. If Task0 is not passed a path parameter, it takes its default value, which is config().path.
My problem now is that this appears to me to work only at "build time", when the interpreter first loads the code, but not at run time (the details aren't clear to me).
So neither of these work:
A)
if __name__ == "__main__":
for i in range(3):
config.path = f"newpath_{i}"
luigi.build([TaskC(arg=i)], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 4 tasks of which:
* 4 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
I'm not sure why this doesn't work.
B)
if __name__ == "__main__":
for i in range(3):
luigi.build([TaskC(arg=i), config(path=f"newpath_{i}")], log_level="INFO")
===== Luigi Execution Summary =====
Scheduled 5 tasks of which:
* 5 ran successfully:
- 1 Task0(path=defaultpath.txt, arg=2)
- 1 TaskA(arg=2)
- 1 TaskB(arg=2)
- 1 TaskC(arg=2)
- 1 config(path=newpath_2)
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
This actually makes sense. There's two config classes, and I only managed to change the path of one of them.
Help?
EDIT: Of course, having path reference a global variable works, but then it's not a Parameter in the usual Luigi sense.
EDIT2: I tried point 1) of the answer below:
config has the same definition
class config(luigi.Config):
path = Parameter(default="defaultpath.txt")
I fixed mistake pointed out, i.e. Task0 is now:
class Task0(Task):
path = Parameter(default=config().path)
arg = IntParameter(default=0)
def run(self):
print(f"READING FROM {self.path}")
Path(self.output().path).touch()
def output(self): return LocalTarget(f"task0{self.arg}.txt")
and finally I did:
if __name__ == "__main__":
for i in range(3):
config.path = Parameter(f"file_{i}")
luigi.build([TaskC(arg=i)], log_level="WARNING")
This doesn't work, Task0 still gets path="defaultpath.txt".
So what you're trying to do is create tasks with params without passing these params to the parent class. That is completely understandable, and I have been annoyed at times in trying to handle this.
Firstly, you are using the config class incorrectly. When using a Config class, as noted in https://luigi.readthedocs.io/en/stable/configuration.html#configuration-classes, you need to instantiate the object. So, instead of:
class Task0(Task):
path = Parameter(default=config.path)
...
you would use:
class Task0(Task):
path = Parameter(default=config().path)
...
While this now ensures you are using a value and not a Parameter object, it still does not solve your problem. When creating the class Task0, config().path would be evaluated, therefore it's not assigning the reference of config().path to path, but instead the value when called (which will always be defaultpath.txt). When using the class in the correct manner, luigi will construct a Task object with only luigi.Parameter attributes as the attribute names on the new instance as seen here: https://github.com/spotify/luigi/blob/master/luigi/task.py#L436
So, I see two possible paths forward.
1.) The first is to set the config path at runtime like you had, except set it to be a Parameter object like this:
config.path = luigi.Parameter(f"newpath_{i}")
However, this would take a lot of work to get your tasks using config.path working as now they need to take in their parameters differently (can't be evaluated for defaults when the class is created).
2.) The much easier way is to simply specify the arguments for your classes in the config file. If you look at https://github.com/spotify/luigi/blob/master/luigi/task.py#L825, you'll see that the Config class in Luigi, is actually just a Task class, so you can anything with it you could do with a class and vice-versa. Therefore, you could just have this in your config file:
[Task0]
path = newpath_1
...
3.) But, since you seem to be wanting to run multiple tasks with the different arguments for each, I would just recommend passing in args through the parents as Luigi encourages you to do. Then you could run everything with:
luigi.build([TaskC(arg=i) for i in range(3)])
4.) Finally, if you really need to get rid of passing dependencies, you can create a ParamaterizedTaskParameter that extends luigi.ObjectParameter and uses the pickle of a task instance as the object.
Of the above solutions, I highly suggest either 2 or 3. 1 would be difficult to program around, and 4 would create some very ugly parameters and is a bit more advanced.
Edit: Solutions 1 and 2 are more of hacks than anything, and it is just recommended that you bundle parameters in DictParameter.

Prevent calling a function more than once if the parameters have been used before

I would like a way to limit the calling of a function to once per values of parameters.
For example
def unique_func(x):
return x
>>> unique_func([1])
[1]
>>> unique_func([1])
*** wont return anything ***
>>> unique_func([2])
[2]
Any suggestions? I've looked into using memoization but not established a solution just yet.
This is not solved by the suggested Prevent a function from being called twice in a row since that only solves when the previous func call had them parameters.
Memoization uses a mapping of arguments to return values. Here, you just want a mapping of arguments to None, which can be handled with a simple set.
def idempotize(f):
cache = set()
def _(x):
if x in cache:
return
cache.add(x)
return f(x)
return _
#idempotize
def unique_fun(x):
...
With some care, this can be generalized to handle functions with multiple arguments, as long as they are hashable.
def idempotize(f):
cache = set()
def _(*args, **kwargs):
k = (args, frozenset(kwargs.items()))
if k in cache:
return
return f(*args, **kwargs)
return _
Consider using the built-in functools.lru_cache() instead of rolling your own.
It won't return nothing on the second function call with the same arugments (it will return the same thing as the first function call) but maybe you can live with that. It would seem like a negligible price to pay, compared to the advantages of using something that's maintained as part of the standard library.
Requires your argument x to be hashable, so won't work with lists. Strings are fine.
from functools import lru_cache
#lru_cache()
def unique_fun(x):
...
I've built a function decorator to handle this scenario, that limits function calls to the same function in a given timeframe.
You can directly use it via PyPI with pip install ofunctions.threading or checkout the github sources.
Example: I want to limit calls to the same function with the same parameters to one call per 10 seconds:
from ofunctions.threading import no_flood
#no_flood(10)
def my_function():
print("It's me, the function")
for _ in range(0, 5):
my_function()
# Will print the text only once.
if after 10 seconds the function is called again, we'll allow a new execution, but will prevent any other execution for the next 10 seconds.
By default #no_flood will limit function calls with the same parameter, so that calling func(1) and func(2) are still allowed concurrently.
The #no_flood decorator can also limit all function calls to a given function regardless of it's parameters:
from ofunctions.threading import no_flood
#no_flood(10, False)
def my_function(var):
print("It's me, function number {}".format(var))
for i in range(0, 5):
my_function(i)
# Will only print function text once

can i use multiprocessing with several different method at the sametime in python2/3

well,i want to make my two different methods proceeding at the sametime.
i wish to accomplish by multiprocessing,but i found all of examples are proceeding the same methods with multiprocessors,not the different method with the multiprocessors.
my code is as follows,it proceeds in sequence,not out of sequence
#-*-coding:utf-8-*-
import multiprocessing
import time
#def thread_test(num1,num2):
#x=5
def haha(num1):
for i in range(num1):
time.sleep(1)
print('a')
def hehe(num2):
for i in range(num2):
time.sleep(1)
print('b')
if __name__=='__main__':
pool = multiprocessing.Pool(processes=4)
pool.apply_async(haha(5))
pool.apply_async(hehe(5))
pool.close()
pool.join()
print("done")
the print is as follows
a
a
a
a
a
b
b
b
b
b
done
I think this has been answered before but I cannot find where to duplicate this and I need a bit more than a comment to describe what is happening, so here it goes:
The problem with your code is that you are in fact not executing your functions in different processes. Instead you are trying to execute the return value of your functions haha and hehe in new processes. And since you have not defined any return value, they return None.
.apply_async (and similar functions) need to be called with the bare function name as first parameter and then the arguments as a second parameter (wraped as a tuple). This is needed due to the execution order defined by Python (and virtually all other programming languages) where the function argumnets are evaluated before the function itself. Thus, when you call a function with an argument being another function call then the inner function call is evaluated first.
The solution to this, is therefore to call the outer function, not with an inner function call, but with the bare function name (which then works as a reference to the inner function) and then the arguments to the inner function as a separate argument to the outer function. This way there is nothint to evaluate before the outer function starts to execute. For the situation at hand, the solution is thus simple. Just change your code as:
#-*-coding:utf-8-*-
import multiprocessing
import time
def haha(num1):
for i in range(num1):
time.sleep(1)
print('a')
def hehe(num2):
for i in range(num2):
time.sleep(1)
print('b')
if __name__=='__main__':
pool = multiprocessing.Pool(processes=4)
pool.apply_async(haha, (5,)) # instead of pool.apply_async(haha(5))
pool.apply_async(hehe, (5,)) # instead of pool.apply_async(hehe(5))
pool.close()
pool.join()
print("done")
I hope this explanation makes sense to you, and helps you to watch out for these situations in the future.

How could I resolve python - parallel issue?

I wrote a code which is something like this:
import time
import pp
class gener(object):
def __init__(self, n):
self.n = n
def __call__(self):
return self
def __iter__(self):
n = self.n
for i in range(n):
time.sleep(2)
yield i
def gen():
return gener(3)
job_server = pp.Server(4)
job = job_server.submit(
gen,
args=(),
depfuncs=("gener",),
modules=("time",),
)
print job()
following #zilupe useful comment I got the following output:
<__main__.gener object at 0x7f862dc18a90>
How can I insert class gener iteration to parallel?
I'd like to run it parallel with other functions.
I need this generator like class, because I want to replace an other code from a modul of a rather complicated program package and it would be hard to refactor that code.
I tried a lot of thing without any success so far.
I made a rather solid investigation on this site, but I couldn't find the right answer suitable for me.
Anybody, any help?
Thank you.
ADDITION:
According to #zilupe useful comments some additional information:
The main purpose is to parallelize the iteration inside of class gener:
def __iter__(self):
n = self.n
for i in range(n):
time.sleep(2)
yield i
I only created of gen() function because I wasn't be able to figure out, how could I use it directly in parallel.submit().
This doesn't answer your question but you get the error because depfuncs has to be a tuple of functions, not strings -- the line should be: depfuncs=(gener,),
Do you realise that function gen is only creating an instance of gener and is not actually calling it?
What are you trying to parallelise here -- creation of a generator or iteration over the generator? If it's the iteration, you should probably create the generator first, pass it to gen in args and then in gen iterate over it:
def gen(g):
for _ in g:
pass
job_server.submit(gen,args=(gener(3),),...)

python multiprocessing : setting class attribute value

I have an class called Experiment and another called Case. One Experiment is made of many individual cases. See Class definitions below,
from multiprocessing import Process
class Experiment (object):
def __init__(self, name):
self.name = name
self.cases = []
self.cases.append(Case('a'))
self.cases.append(Case('b'))
self.cases.append(Case('c'))
def sr_execute(self):
for c in self.cases:
c.setVars(6)
class Case(object):
def __init__(self, name):
self.name = name
def setVars(self, var):
self.var = var
In my Experiment Class, I have a function called sr_execute. This function shows the desired behavior. I am interested in parsing thru all cases and set an attribute for each of the cases. When I run the following code,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.sr_execute()
for c in e.cases: print c.name, c.var
I get,
a 6
b 6
c 6
This is the desired behavior.
However, I would like to do this in parallel using multiprocessing. To do this, I add a mp_execute() function to the Experiment Class,
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
However, this does not work. When I execute the following,
if __name__ == '__main__':
#multiprocessing.freeze_support()
e = Experiment('exp')
e.mp_execute()
for c in e.cases: print c.name, c.var
I get an error,
AttributeError: 'Case' object has no attribute 'var'
Apparently, I am unable to set class attribute using multiprocessing.
Any clues what is going on,
When you call:
def mp_execute(self):
processes = []
for c in self.cases:
processes.append(Process(target= c.setVars, args = (6,)))
[p.start() for p in processes]
[p.join() for p in processes]
when you create the Process it will use a copy of your object and the modifications to such object are not passed to the main program because different processes have different adress spaces. It would work if you used Threads
since in that case no copy is created.
Also note that your code will probably fail in Windows because you are passing a method as target and Windows requires the target to be picklable (and instance methods are not pickable).
The target should be a function defined at the top level of a module in order to work on all Oses.
If you want to communicate to the main process the changes you could:
Use a Queue to pass the result
Use a Manager to built a shared object
Anyway you must handle the communication "explicitly" either by setting up a "channel" (like a Queue) or setting up a shared state.
Style note: Do not use list-comprehensions in this way:
[p.join() for p in processes]
it's simply wrong. You are only wasting space creating a list of Nones. It is also probably slower compared to the right way:
for p in processes:
p.join()
Since it has to append the elements to the list.
Some say that list-comprehensions are slightly faster than for loops, however:
The difference in performance is so small that it generally doesn't matter
They are faster if and only if you consider this kind of loops:
a = []
for element in something:
a.append(element)
If the loop, like in this case, does not create a list, then the for loop will be faster.
By the way: some use map in the same way to perform side-effects. This again is wrong because you wont gain much in speed for the same reason as before and it fails completely in python3 where map returns an iterator and hence it will not execute the functions at all, thus making the code less portable.
#Bakuriu's answer offers good styling and efficiency suggestions. And true that each process gets a copy of the master process stack, hence the changes made by forked processes will not be reflected in address space of the master process unless you utilize some form of IPC (e.g. Queue, Pipe, Manager).
But the particular AttributeError: 'Case' object has no attribute 'var' error that you are getting has an additional reason, namely that your Case objects do not yet have the var attribute at the time you launch your processes. Instead, the var attribute is created in the setVars() method.
Your forked processes do indeed create the variable when they call setVars() (and actually even set it to 6), but alas, this change is only in the copies of Case objects, i.e. not reflected in the master process's memory space (where the variable still does not exist).
To see what I mean, change your Case class to this:
class Case(object):
def __init__(self, name):
self.name = name
self.var = 7 # Create var in the constructor.
def setVars(self, var):
self.var = var
By adding the var member variable in the constructor, your master process will have access to it. Of course, the changes in the forked processes will still not be reflected in the master process, but at least you don't get the error:
a 7
b 7
c 7
Hope this sheds light on what's going on. =)
SOLUTION:
The least-intrusive (to original code) thing to do is use ctypes object from shared memory:
from multiprocessing import Value
class Case(object):
def __init__(self, name):
self.name = name
self.var = Value('i', 7) # Use ctypes "int" from shared memory.
def setVars(self, var):
self.var.value = var # Set the variable's "value" attribute.
and change your main() to print c.var.value:
for c in e.cases: print c.name, c.var.value # Print the "value" attribute.
Now you have the desired output:
a 6
b 6
c 6

Categories