How to monitor a group of tasks in celery? - python

I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?

I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.

I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet

The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.

Related

Is there a way to guarantee that coroutines have been started before doing something that will allow them to finish?

I have a method that fetches some data from an API using IDs that are passed to it. Before this happens, the IDs are placed in a queue so that multiple (up to 100) can be used per API call. API calls are carried out by a method called flush_queue, which must be called at the end of my program to guarantee that all of the IDs that were added to the queue will have been used. I wrote an async function that takes in an ID, creates a future that flush_queue will eventually set the result of, and then returns the data obtained for that ID when it's available:
async def get_data(self, id):
self.queued_ids.append(id)
self.results[id] = asyncio.get_running_loop().create_future()
if len(self.queued_ids) == 100:
await self.flush_queue()
return await self.results[id]
(The future created and placed in the results dict will have the data obtained from the API that corresponds to that ID set as its result by flush_queue so that the data can be returned by the above method.) This works well for the case where the queue is flushed because there are enough IDs to make an API request; the problem is, I need to ensure that each user id is added to the queue before I call flush_queue at the completion of my program. Therefore, I need each call to get_data to have started before I do that; however, I can't wait for each coroutine object returned by each call to finish, because they won't finish until after flush_queue completes. Is there any way to ensure that a series of coroutines has started (but not necessarily finished) before doing something like calling flush_queue?
As discussed in the comments, just use a countdown latch/counting semaphore. Increment the counter when you enter get_data, and wait for it become zero.

How to make list of twisted deferred operations truly async

I am new to using Twisted library, I want to make a list of operations async. Take example of the following pseudo code:
#defer.inlineCallbacks
def getDataAsync(host):
data = yield AsyncHttpAPI(host) # some asyc api which returns deferred
return data
#defer.inlineCallbacks
def funcPrintData():
hosts = []; # some list of hosts, say 1000 in number
for host in hosts:
data = yield getDataAsync(host)
# why doesn't the following line get printed as soon as first result is available
# it waits for all getDataAsync to be queued before calling the callback and so print data
print(data)
Please comment if the question is not clear. Is there a better way of doing this? Should I instead be using the DeferredList ?
The line:
data = yield getDataAsync(host)
means "stop running this function until the getDataAsync(host) operation has completed. If the function stops running, the for loop can't get to any subsequent iterations so those operations can't even begin until after the first getDataAsync(host) has completed. If you want to run everything concurrently then you need to not stop running the function until all of the operations have started. For example:
ops = []
for host in hosts:
ops.append(getDataAsync(host))
After this runs, all of the operations will have started regardless of whether or not any have finished.
What you do with ops depends on whether you want results in the same order as hosts or if you want them all at once when they're all ready or if you want them one at a time in the order the operations succeed.
DeferredList is for getting them all at once when they're all ready as a list in the same order as the input list (ops):
datas = yield DeferredList(ops)
If you want to process each result as it becomes available, it's easier to use addCallback:
ops = []
for host in hosts:
ops.append(getDataAsync(host).addCallback(print))
This still doesn't yield so the whole group of operations are started. However, the callback on each operation runs as soon as that operation has a result. You're still left with a list of Deferred instances in ops which you can still use to wait for all of the results to finish if you want or attach overall error handling to (at least one of those is a good idea otherwise you have dangling operations that you can't easily account for in callers of funcPrintDat).

Is there any way to understand when all tasks are finished?

Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Task fanout - how to bulk add Tasks to the Queue - more than 5

I am using a task (queueing-task) to queue multiple others tasks — fanout. When I try to use Queue.add with task argument being a list of Task instances with more than 5 element's and in transaction… I get this error.
JointException: taskqueue.DatastoreError caused by:
<class 'google.appengine.api.datastore_errors.BadRequestError'>
Too many messages, maximum allowed 5
Is there another way to queue more than 5 tasks in a transaction?
Or...
Maybe I don't need a transaction, cause:
I don't care if any of those tasks get queued twice anyway, and
if queueing will fail for any of them, then the whole queueing-task will be re run.
So tell me how do I queue more than 5 tasks in a transaction or tell me to not use transaction cause I don't really need one.
One solution close to solving your problem is to add one transactional task that fans-out the remaining tasks. Just add the one fan-out task in your existing transaction.
Unless there is a business logic reason to do so, do not re-run a task that has already run. Preventing tasks from being re-inserted (i.e. duplicated) is straightforward and saves resources. Your fan-out task will basically look like:
class FanOutTask(webapp.RequestHandler):
def get(self):
name = self.request.get('name')
params = deserialize(self.request.get('params'))
try:
task_params = params.get('stuff')
taskqueue.add(url='/worker/1', name=name + '-1', params=task_params)
except TaskAlreadyExistsError:
pass
try:
task_params = params.get('more')
taskqueue.add(url='/worker/2', name=name + '-2', params=task_params)
except TaskAlreadyExistsError:
pass
Adding the fan-out task transactionally ensures it is enqueued. Errors resulting from the task already being run get caught and ignored, other errors cause the fan-out task to re-run. With this pattern you can insert many sub-tasks pretty easily.

Categories