So, I have the following code which im using to run the tasks in multiple functions at the same time:
if __name__ == '__main__':
po = Pool(processes = 10)
resultslist = []
i = 1
while i <= 2:
arg = [i]
result = po.apply_async(getAllTimes, arg)
resultslist.append(result)
i += 1
feedback = []
for res in resultslist:
multipresults = res.get()
feedback.append(multipresults)
matchesBegin, matchesEnd = feedback[0][0], feedback[0][1]
TheTimes = feedback[1]
This works well for me. I'm currently using it to run two jobs at the same time.
But the problem is, i dont always need all the two simultaneously running jobs to complete before I move on to the next phases of the script. Sometimes, if the first job completes successfully and im able to confirm it by verifying whats in matchesBegin, matchesEnd, I want to be able to just move on and kill off the other job.
My issue is, i dont know how to do that.
Job 1 usually completes much faster than Job 2. So, what im trying to do here is, IF job 1 completes before Job 2, AND the content of the variables from Job 1 (matchesBegin, matchesEnd) is True, then, i want Job 2 to be blown away because I dont need it anymore. If i dont blow it away, it will only prolong the completion of the script. Job 2 should only be allowed to continue to run if results of the variables from Job 1 arent True.
I do not know all the details of your use case, but I would hope this provides you some direction. Essentially, what you've started with apply_async() could do that job, but you would also need to use its callback argument and evaluate incoming result to see if it fulfills your criteria and take a corresponding action if it does. I've hacked around your code a bit and got this:
class ParallelCall:
def __init__(self, jobs=None, check_done=lambda res: None):
self.pool = Pool(processes=jobs)
self.pending_results = []
self.return_results = []
self.check_done = check_done
def _callback(self, incoming_result):
self.return_results.append(incoming_result)
if self.check_done(incoming_result):
self.pool.terminate()
return incoming_result
def run_fce(self, fce, *args, **kwargs):
self.pending_results.append(self.pool.apply_async(fce,
*args, **kwargs,
callback=self._callback))
def collect(self):
self.pool.close()
self.pool.join()
return self.return_results
Which you could use like this:
def final_result(result_to_check):
return result_to_check[0] == result_to_check[1]
if __name__ == '__main__':
runner = ParallelCall(jobs=2, check_done=final_result)
for i in range(1,3):
arg = [i]
runner.run_fce(getAllTimes, arg)
feedback = runner.collect()
TheTimes = feedback[-1] # last completed getAllTimes call
What does it do? runner is an instance of ParallelCall (note: I've used only two workers as you seem to only run two jobs) which uses final_result() function to evaluate the result whether it is a suitable candidate for valid final result. In this case, it's first and second item are equal.
We use that to start getAllTimes two times like in your example above. It uses apply_async() just as you did, but we now also have a callback registered through which we pass the result when it becomes available. We also pass it through the function registered with check_done to see if we got an acceptable final result and if so (return value evaluates to True) we just stop all the worker processes.
Disclaimer: this is not exactly what your example does, because the returning list is not in order in which function calls has taken place, but in which the results became available.
Then we collect() available results into feedback. This method closes the pool to not accept any further tasks (close()) and then waits for the workers to finish (wait()) (they could be stopped if one of the incoming results matched the registered criterion). Then we return all the results (either up to matching result or until all work has been done).
I've put this into ParallelCall class so that I can conveniently keep track of the pending and finished results as well know what my pool is. Default check_done is basically a (callable) nop.
Related
I have the following sample function:
def set_value(thing, set_quantity):
set_function(thing, set_quantity) #sets "thing" to the desired quantity
read_value = read_function(thing) #reads quantity of "thing"
if read_value != set_quantity:
raise Exception('Values are not equal')
I then have some test cases (they're Robot Framework test cases but I don't think that really matters for the question I have) that look something like this:
#test case 1
set_value('Bag_A', 5)
#Test case 2
set_value('Bag_B', 3)
#Test case 3
set_value('Bag_A', 2)
set_value('Bag_B', 8)
set_value('Purse_A',4)
This is the desired behavior I want:
For Test Case 1, set_value() is called once and is executed right after the function call.
Test Case 2 is similar to #1 and its behavior should be the same.
For Test Case 3, we have 3 items: Bag_A, Bag_B, and Purse_A. Bag_A and Bag_B each call set_value(), but I want them to run concurrently. This is because they're both "Bags" (regardless of the _A or _B assignation), so I want the function to recognize that they are both "Bags" and should be run concurrently. Once those are executed and finished, then I want the Purse_A function call to set_value() to be executed (since "Purse" is not in the same category as "Bag").
The object type ("bag", "purse", etc) can be anything, so I'm not expecting a particular value from a small amount of pre-defined possibilities.
How would I go about doing this? By the way, I cannot change anything in the way the Test Cases are written.
I tried looking into the asyncio module since you're able to run things concurrently using tasks, but the things is I don't understand how to get set_function() to know how many tasks there will be since it all depends on any future functions calls which depends on each test case.
Sometimes, if you want your program to do some thing, you just have to write your own code to do the thing:
thread_pool = concurrent.futures.ThreadPoolExecutor(...)
def do_one(task):
...
def do_something_with(result):
...
def do_concurrently(tasks):
global thread_pool
futures = []
for task in tasks:
futures.append(thread_pool.submit(do_one, task))
for future in futures:
result = future.result()
do_something_with(result)
def do_sequentially(tasks):
for task in tasks:
result = do_one(task)
do_something_with(result)
def task_performer(...):
tasks = ...construct list of tasks...
if ...tasks should be performed concurrently...:
do_concurrently(tasks)
else:
do_sequentially(tasks)
Very procedural. I can't help it, I'm an old-fart. Maybe there's some more modern, stylish way to do it, but you get the idea, right?
In a django project, as part of rather heavy import pipeline, i'm having to call three tasks that are part of a different codebase/application. Both applications point to the same AMQP broker and actually calling a task from the other app works great.
As part of our worklow, we want to chain those three tasks, and wait for the whole chain to return before proceeding (each task depends on the previous one, but we only care about the results of the last one). We have to call this chain for each item we're inserting in the DB. In order to try and speed up the process, i'm basically populating a dictionary with AsyncResults objects returned by calling my_chain.apply_async as we go to launch the task, and then iterate over this result dict to retrieve the data we need and further process it.
In pseudo code, this looks like this:
def launch_chains():
# assuming i need to preserve the order in which the chains are
# called for later processing
result_dict = OrderedDict()
for item in all_items:
# Need to use signature objects because the tasks are on a
# different app / codebase
task_1 = remote_celery_app.signature(
'path.to.my.task1',
args=[whatever],
queue="my_queue"
)
task_2 = remote_celery_app.signature(
# ...
)
item_chain = celery.chain(
task_1,
task_2,
app=remote_celery_app
)
result_dict[item.pk] = item_chain.apply_async()
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
data = r.get()
process_data(...)
On a rather small batch of items, this seemed to work fine, however when testing this with a heavier set (about 1600 objects, which is closer to what we'll need in production), the process loops ends up hanging on when calling get.
What happens seems pretty random: i'll usually get the very first result, and the second get call will hang indefinitely. Sometimes it won't even get the first results, and sometimes i'll manage to get about 30 responses (out of the expected 1600) (in those cases, the loop will actually terminate).
All tasks are launched and return data (i see the log on the remote app go when populating the dictionary), it's just when trying to grab the results that something goes wrong.
One way to fix this is to move the apply_async call to the processing loop, like this:
def launch_chains():
# for loop, chain creation, etc...
result_dict[item.pk] = item_chain # removing the apply_async call...
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
r = r.apply_async() # ... and adding it back here
data = r.get()
process_data(...)
This works perfecly, and i get to process all items, but it's obviously taking much longer, because i have to wait for each call to finish, while the previous version should allow me to starts iterating over results that were returned while still populating my result dict.
When playing around in a python shell, what usually happens is that the first chain I try to process (no matter if it's the first one that got launched of another one picked randomly) will be in a SUCCESS state, and then all other will permanently remain PENDING.
I've been able to reproduce the behaviour by creating a dummy task on the remote app, and calling it with the following management command:
# remote_app.tasks.py
def add(*args):
return sum(args)
# simulating some processing delay.
# Removing this does not fix the problem
time.sleep(random.random())
# test_command.py
l1 = [
[2, 2],
[3, 3],
]
l2 = [
[4, 4],
[6, 6],
]
def get_remote_chain(i):
"""
Call the tasks remotely. This seems to work fine whith a small number
of calls, but ends up hanging when i start to call it a bunch.
Behaviour is basically the same as described above.
"""
t1 = remote_app.signature(
'remote.path.to.add',
args=l1[i],
queue='somequeue'
)
t2 = remote_app.signature(
'remote.path.to.add',
args=l2[i],
queue='somequeue'
)
return celery.chain(
t1, t2, app=simi_app
)
def get_local_chain(i):
"""
I copy pasted the task locally, and tried to call this version of the
chain instead:
Everything works fine, no matter how many times we call it
"""
return celery.chain(
add.s(*l1[i]),
add.s(*l1[i]),
)
def launch_chains():
results = OrderedDict()
for i in range(100):
for j in range(2):
# task_chain = get_local_chain(j)
task_chain = get_remote_chain(j)
results['%d-%d' % (j, i)] = task_chain.apply_async()
return results
class Command(BaseCommand):
def handle(self, *args, **kwargs):
results = launch_chains()
print('dict ok')
for rid, r in results.items():
print(rid, r.get())
# Trying an alternate way to process the results, trying to
# ensure we wait until each task is ready.
# Again, this works fine with local tasks, but ends up hanging,
# apparently randomly, after a few remote ones have been "processed".
#
# rlen = len(results)
# while results:
# for rid, r in results.items():
# if r.ready():
# print(rid, r.status)
# print(r.get())
# results.pop(rid)
# if len(results) != rlen:
# print('RETRYING ALL, %d remaining' % len(results))
# rlen = len(results)
I'll admit i'm don't know celery and AMQP that much, so i have no idea if this is a configration problem of if i'm "doing it wrong". However, since it works fine locally, i'm pretty sure the problem comes from the communication between the two apps.
I'm using celery 3.1.16 on the django app, and the remote app is 4.0.2... For a whole bunch of reasons i can't upgrade on the django side, not sure if i'll be able to downgrade on remote one (a quick try breaks because the code uses various new things from 4.0).
Any ideas ? Do not hesitate to ask for more config info if needed, can't think of anything more right now.
Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.
I have a python script where at the top of the file I have:
result_queue = Queue.Queue()
key_list = *a large list of small items* #(actually from bucket.list() via boto)
I have learned that Queues are process safe data structures. I have a method:
def enqueue_tasks(keys):
for key in keys:
try:
result = perform_scan.delay(key)
result_queue.put(result)
except:
print "failed"
The perform_scan.delay() function here actually calls a celery worker, but I don't think is relevant (it is an asynchronous process call).
I also have:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
Lastly I have a main() function:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print len(result_queue)
The result from the print statement is a 0. Yet if I include a print statement of the size of result_queue in enqueue_tasks, while the program is running, I can see that the size is increasing and things are being added to the queue.
Ideas of what is happening?
It looks like there's a simpler solution to this problem.
You're building a list of futures. The whole point of futures is that they're future results. In particular, whatever each function returns, that's the (eventual) value of the future. So, don't do the whole "push results onto a queue" thing at all, just return them from the task function, and pick them up from the futures.
The simplest way to do this is to break that loop up so that each key is a separate task, with a separate future. I don't know whether that's appropriate for your real code, but if it is:
def do_task(key):
try:
return perform_scan.delay(key)
except:
print "failed"
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(do_task, key) for key in key_list]
# If you want to do anything with these results, you probably want
# a loop around concurrent.futures.as_completed or similar here,
# rather than waiting for them all to finish, ignoring the results,
# and printing the number of them.
concurrent.futures.wait(futures)
print len(futures)
Of course that doesn't do the grouping. But do you need it?
The most likely reason for the grouping to be necessary is that the tasks are so tiny that the overhead in scheduling them (and pickling the inputs and outputs) swamps the actual work. If that's true, then you can almost certainly wait until a whole batch is done to return any results. Especially given that you're not even looking at the results until they're all done anyway. (This model of "split into groups, process each group, merge back together" is pretty common in cases like numerical work, where each element may be tiny, or elements may not be independent of each other, but there are groups that are big enough or independent from the rest of the work.)
At any rate, that's almost as simple:
def do_tasks(keys):
results = []
for key in keys:
try:
result = perform_scan.delay(key)
results.append(result)
except:
print "failed"
return results
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
print sum(len(results) for results in concurrent.futures.as_completed(futures))
Or, if you prefer to first wait and then calculate:
def main():
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(enqueue_tasks, group) for group in grouper(key_list, 40)]
concurrent.futures.wait(futures)
print sum(len(future.result()) for future in futures)
But again, I doubt you need even this.
You need to use a multiprocessing.Queue, not a Queue.Queue. Queue.Queue is thread-safe, not process-safe, so the changes you make to it in one process are not reflected in any others.
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards