Celery - Remote chain randomly hangs - python

In a django project, as part of rather heavy import pipeline, i'm having to call three tasks that are part of a different codebase/application. Both applications point to the same AMQP broker and actually calling a task from the other app works great.
As part of our worklow, we want to chain those three tasks, and wait for the whole chain to return before proceeding (each task depends on the previous one, but we only care about the results of the last one). We have to call this chain for each item we're inserting in the DB. In order to try and speed up the process, i'm basically populating a dictionary with AsyncResults objects returned by calling my_chain.apply_async as we go to launch the task, and then iterate over this result dict to retrieve the data we need and further process it.
In pseudo code, this looks like this:
def launch_chains():
# assuming i need to preserve the order in which the chains are
# called for later processing
result_dict = OrderedDict()
for item in all_items:
# Need to use signature objects because the tasks are on a
# different app / codebase
task_1 = remote_celery_app.signature(
'path.to.my.task1',
args=[whatever],
queue="my_queue"
)
task_2 = remote_celery_app.signature(
# ...
)
item_chain = celery.chain(
task_1,
task_2,
app=remote_celery_app
)
result_dict[item.pk] = item_chain.apply_async()
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
data = r.get()
process_data(...)
On a rather small batch of items, this seemed to work fine, however when testing this with a heavier set (about 1600 objects, which is closer to what we'll need in production), the process loops ends up hanging on when calling get.
What happens seems pretty random: i'll usually get the very first result, and the second get call will hang indefinitely. Sometimes it won't even get the first results, and sometimes i'll manage to get about 30 responses (out of the expected 1600) (in those cases, the loop will actually terminate).
All tasks are launched and return data (i see the log on the remote app go when populating the dictionary), it's just when trying to grab the results that something goes wrong.
One way to fix this is to move the apply_async call to the processing loop, like this:
def launch_chains():
# for loop, chain creation, etc...
result_dict[item.pk] = item_chain # removing the apply_async call...
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
r = r.apply_async() # ... and adding it back here
data = r.get()
process_data(...)
This works perfecly, and i get to process all items, but it's obviously taking much longer, because i have to wait for each call to finish, while the previous version should allow me to starts iterating over results that were returned while still populating my result dict.
When playing around in a python shell, what usually happens is that the first chain I try to process (no matter if it's the first one that got launched of another one picked randomly) will be in a SUCCESS state, and then all other will permanently remain PENDING.
I've been able to reproduce the behaviour by creating a dummy task on the remote app, and calling it with the following management command:
# remote_app.tasks.py
def add(*args):
return sum(args)
# simulating some processing delay.
# Removing this does not fix the problem
time.sleep(random.random())
# test_command.py
l1 = [
[2, 2],
[3, 3],
]
l2 = [
[4, 4],
[6, 6],
]
def get_remote_chain(i):
"""
Call the tasks remotely. This seems to work fine whith a small number
of calls, but ends up hanging when i start to call it a bunch.
Behaviour is basically the same as described above.
"""
t1 = remote_app.signature(
'remote.path.to.add',
args=l1[i],
queue='somequeue'
)
t2 = remote_app.signature(
'remote.path.to.add',
args=l2[i],
queue='somequeue'
)
return celery.chain(
t1, t2, app=simi_app
)
def get_local_chain(i):
"""
I copy pasted the task locally, and tried to call this version of the
chain instead:
Everything works fine, no matter how many times we call it
"""
return celery.chain(
add.s(*l1[i]),
add.s(*l1[i]),
)
def launch_chains():
results = OrderedDict()
for i in range(100):
for j in range(2):
# task_chain = get_local_chain(j)
task_chain = get_remote_chain(j)
results['%d-%d' % (j, i)] = task_chain.apply_async()
return results
class Command(BaseCommand):
def handle(self, *args, **kwargs):
results = launch_chains()
print('dict ok')
for rid, r in results.items():
print(rid, r.get())
# Trying an alternate way to process the results, trying to
# ensure we wait until each task is ready.
# Again, this works fine with local tasks, but ends up hanging,
# apparently randomly, after a few remote ones have been "processed".
#
# rlen = len(results)
# while results:
# for rid, r in results.items():
# if r.ready():
# print(rid, r.status)
# print(r.get())
# results.pop(rid)
# if len(results) != rlen:
# print('RETRYING ALL, %d remaining' % len(results))
# rlen = len(results)
I'll admit i'm don't know celery and AMQP that much, so i have no idea if this is a configration problem of if i'm "doing it wrong". However, since it works fine locally, i'm pretty sure the problem comes from the communication between the two apps.
I'm using celery 3.1.16 on the django app, and the remote app is 4.0.2... For a whole bunch of reasons i can't upgrade on the django side, not sure if i'll be able to downgrade on remote one (a quick try breaks because the code uses various new things from 4.0).
Any ideas ? Do not hesitate to ask for more config info if needed, can't think of anything more right now.

Related

How to make list of twisted deferred operations truly async

I am new to using Twisted library, I want to make a list of operations async. Take example of the following pseudo code:
#defer.inlineCallbacks
def getDataAsync(host):
data = yield AsyncHttpAPI(host) # some asyc api which returns deferred
return data
#defer.inlineCallbacks
def funcPrintData():
hosts = []; # some list of hosts, say 1000 in number
for host in hosts:
data = yield getDataAsync(host)
# why doesn't the following line get printed as soon as first result is available
# it waits for all getDataAsync to be queued before calling the callback and so print data
print(data)
Please comment if the question is not clear. Is there a better way of doing this? Should I instead be using the DeferredList ?
The line:
data = yield getDataAsync(host)
means "stop running this function until the getDataAsync(host) operation has completed. If the function stops running, the for loop can't get to any subsequent iterations so those operations can't even begin until after the first getDataAsync(host) has completed. If you want to run everything concurrently then you need to not stop running the function until all of the operations have started. For example:
ops = []
for host in hosts:
ops.append(getDataAsync(host))
After this runs, all of the operations will have started regardless of whether or not any have finished.
What you do with ops depends on whether you want results in the same order as hosts or if you want them all at once when they're all ready or if you want them one at a time in the order the operations succeed.
DeferredList is for getting them all at once when they're all ready as a list in the same order as the input list (ops):
datas = yield DeferredList(ops)
If you want to process each result as it becomes available, it's easier to use addCallback:
ops = []
for host in hosts:
ops.append(getDataAsync(host).addCallback(print))
This still doesn't yield so the whole group of operations are started. However, the callback on each operation runs as soon as that operation has a result. You're still left with a list of Deferred instances in ops which you can still use to wait for all of the results to finish if you want or attach overall error handling to (at least one of those is a good idea otherwise you have dangling operations that you can't easily account for in callers of funcPrintDat).

end python multithreading if one of the threads end first

So, I have the following code which im using to run the tasks in multiple functions at the same time:
if __name__ == '__main__':
po = Pool(processes = 10)
resultslist = []
i = 1
while i <= 2:
arg = [i]
result = po.apply_async(getAllTimes, arg)
resultslist.append(result)
i += 1
feedback = []
for res in resultslist:
multipresults = res.get()
feedback.append(multipresults)
matchesBegin, matchesEnd = feedback[0][0], feedback[0][1]
TheTimes = feedback[1]
This works well for me. I'm currently using it to run two jobs at the same time.
But the problem is, i dont always need all the two simultaneously running jobs to complete before I move on to the next phases of the script. Sometimes, if the first job completes successfully and im able to confirm it by verifying whats in matchesBegin, matchesEnd, I want to be able to just move on and kill off the other job.
My issue is, i dont know how to do that.
Job 1 usually completes much faster than Job 2. So, what im trying to do here is, IF job 1 completes before Job 2, AND the content of the variables from Job 1 (matchesBegin, matchesEnd) is True, then, i want Job 2 to be blown away because I dont need it anymore. If i dont blow it away, it will only prolong the completion of the script. Job 2 should only be allowed to continue to run if results of the variables from Job 1 arent True.
I do not know all the details of your use case, but I would hope this provides you some direction. Essentially, what you've started with apply_async() could do that job, but you would also need to use its callback argument and evaluate incoming result to see if it fulfills your criteria and take a corresponding action if it does. I've hacked around your code a bit and got this:
class ParallelCall:
def __init__(self, jobs=None, check_done=lambda res: None):
self.pool = Pool(processes=jobs)
self.pending_results = []
self.return_results = []
self.check_done = check_done
def _callback(self, incoming_result):
self.return_results.append(incoming_result)
if self.check_done(incoming_result):
self.pool.terminate()
return incoming_result
def run_fce(self, fce, *args, **kwargs):
self.pending_results.append(self.pool.apply_async(fce,
*args, **kwargs,
callback=self._callback))
def collect(self):
self.pool.close()
self.pool.join()
return self.return_results
Which you could use like this:
def final_result(result_to_check):
return result_to_check[0] == result_to_check[1]
if __name__ == '__main__':
runner = ParallelCall(jobs=2, check_done=final_result)
for i in range(1,3):
arg = [i]
runner.run_fce(getAllTimes, arg)
feedback = runner.collect()
TheTimes = feedback[-1] # last completed getAllTimes call
What does it do? runner is an instance of ParallelCall (note: I've used only two workers as you seem to only run two jobs) which uses final_result() function to evaluate the result whether it is a suitable candidate for valid final result. In this case, it's first and second item are equal.
We use that to start getAllTimes two times like in your example above. It uses apply_async() just as you did, but we now also have a callback registered through which we pass the result when it becomes available. We also pass it through the function registered with check_done to see if we got an acceptable final result and if so (return value evaluates to True) we just stop all the worker processes.
Disclaimer: this is not exactly what your example does, because the returning list is not in order in which function calls has taken place, but in which the results became available.
Then we collect() available results into feedback. This method closes the pool to not accept any further tasks (close()) and then waits for the workers to finish (wait()) (they could be stopped if one of the incoming results matched the registered criterion). Then we return all the results (either up to matching result or until all work has been done).
I've put this into ParallelCall class so that I can conveniently keep track of the pending and finished results as well know what my pool is. Default check_done is basically a (callable) nop.

Chain a celery task's results into a distributed group

Like in this other question, I want to create a celery group from a list that's returned by a celery task. The idea is that the first task will return a list, and the second task will explode that list into concurrent tasks for every item in the list.
The plan is to use this while downloading content. The first task gets links from a website, and the second task is a chain that downloads the page, processes it, and then uploads it to s3. Finally, once all the subpages are done, the website is marked as done in our DB. Something like:
chain(
get_links_from_website.si('https://www.google.com'),
dmap.s( # <-- Distributed map
download_sub_page.s() |
process_sub_page.s() |
upload_sub_page_to_s3.s()
),
mark_website_done.s()
)
The solution I've seen so far seems to do an adequate job of this, but fails when the second task is a chain, due to issues with clone not doing a deepcopy (see the comments on this answer for details):
#task
def dmap(it, callback):
# Map a callback over an iterator and return as a group
callback = subtask(callback)
return group(callback.clone([arg,]) for arg in it)()
It also has the problem that if the iterable is 10,000 items long, it will create a group with 10,000 items. That is blowing up our memory usage, as you can imagine.
So, what I'm looking for is a way to do dmap that:
Doesn't blow up RAM by creating monstrous groups (maybe there's a way to chunk through the iterable?)
Works on celery chains without issues with deepcopy.
celery canvas provides chunks to split a task into chunks. Unfortunately, this won't work with primitives like chain, group.
You can use celery signals to prevent issues with dmap/clone.
ch = chain(
download_sub_page.s(),
process_sub_page.s(),
upload_sub_page.s(),
)
#task_success.connect(sender='get_links_from_website')
def task_success_handler(sender=None, headers=None, body=None, **kwargs):
result = kwargs['result']
header = [ch(i) for i in result]
callback = mark_website_done.si()
chord(header)(callback)
Create a chain for processing pages and hook the last task to it using a chord. This function gets executed whenever get_links_from_website runs succcessfully.
Depending on the time taken by chain, you can also save results of get_links_from_website somewhere. Then iterate over a batch of them to queue up chains and with the last batch, you can hook a callback to last task.
This is a bit hacky but we're using deepcopy to clone the callback, this fixes the bug with Signature's shallow copy
def dmap(it, callback, final=None):
# Map a callback over an iterator and return as a group
callback = subtask(callback)
run_in_parallel = group(subtask(copy.deepcopy(dict(callback))).clone([arg, ]) for arg in it)
if len(run_in_parallel.tasks) == 0:
return []
if final:
return chord(run_in_parallel)(final)
return run_in_parallel.delay()
Note that this will only work for one nesting level (i.e. callback is a chain/group/chord) but will not work for deeply nested callbacks
For deeply nested callback graphs we use this hack which is a bit slower but works flawlessly
# Hack to completely clone a signature with possibly complex subtasks (chains, chords, etc...)
run_in_parallel = group(pickle.loads(pickle.dumps(callback)).clone([arg, ]) for arg in it)
And for the size of the groups you can always split the iterator to chunks
If anyone runs into this, Jether's answer helped a lot, but it wasn't perfect. For us, there were three issues:
If the callback is itself a chain, the answer doesn't pass arguments onto the chain. https://stackoverflow.com/a/59023231/19882725 helps provide a solution to this, via clone_signature. This seems to work for reasonably nested chains using RabbitMQ as a broker, but we didn't try anything extreme (and thus didn't need to adapt it to use pickle).
After adding (1), passing final broke - we adopted the solution from https://github.com/celery/celery/issues/5265 to convert final from a dict to a Signature.
Finally, we found that final wouldn't actually execute in many cases because chord was receiving a Group rather than a list of tasks.
For anyone curious, here's our final solution:
import copy
from celery import Signature, chord, group, shared_task, subtask
def clone_signature(sig, args=(), kwargs=(), **opts):
"""
Turns out that a chain clone() does not copy the arguments properly - this
clone does.
From: https://stackoverflow.com/a/53442344/3189
"""
if sig.subtask_type and sig.subtask_type != "chain":
raise NotImplementedError(
"Cloning only supported for Tasks and chains, not {}".format(
sig.subtask_type
)
)
clone = sig.clone()
if hasattr(clone, "tasks"):
task_to_apply_args_to = clone.tasks[0]
else:
task_to_apply_args_to = clone
args, kwargs, opts = task_to_apply_args_to._merge(
args=args, kwargs=kwargs, options=opts
)
task_to_apply_args_to.update(args=args, kwargs=kwargs, options=copy.deepcopy(opts))
return clone
#shared_task
def dmap(it, callback, final=None):
if not len(it):
return []
callback = subtask(callback)
run_in_parallel = [
clone_signature(callback, args if type(args) is list else [args]) for args in it
]
if not final:
return group(*run_in_parallel).delay()
# see https://github.com/celery/celery/issues/5265
if not isinstance(final, Signature):
final["immutable"] = True
final = Signature.from_dict(final)
return chord(run_in_parallel)(final)
This allowed us to successfully execute nested dmaps like the following:
chain(
taskA.s(),
dmap.s(
chain(
taskB.s(),
taskC.s(),
dmap.s(
taskD.s(),
final=chain(
taskE.s(),
taskF.s(),
),
),
),
),
).delay()

Is there any way to understand when all tasks are finished?

Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Categories