Chain a celery task's results into a distributed group - python

Like in this other question, I want to create a celery group from a list that's returned by a celery task. The idea is that the first task will return a list, and the second task will explode that list into concurrent tasks for every item in the list.
The plan is to use this while downloading content. The first task gets links from a website, and the second task is a chain that downloads the page, processes it, and then uploads it to s3. Finally, once all the subpages are done, the website is marked as done in our DB. Something like:
chain(
get_links_from_website.si('https://www.google.com'),
dmap.s( # <-- Distributed map
download_sub_page.s() |
process_sub_page.s() |
upload_sub_page_to_s3.s()
),
mark_website_done.s()
)
The solution I've seen so far seems to do an adequate job of this, but fails when the second task is a chain, due to issues with clone not doing a deepcopy (see the comments on this answer for details):
#task
def dmap(it, callback):
# Map a callback over an iterator and return as a group
callback = subtask(callback)
return group(callback.clone([arg,]) for arg in it)()
It also has the problem that if the iterable is 10,000 items long, it will create a group with 10,000 items. That is blowing up our memory usage, as you can imagine.
So, what I'm looking for is a way to do dmap that:
Doesn't blow up RAM by creating monstrous groups (maybe there's a way to chunk through the iterable?)
Works on celery chains without issues with deepcopy.

celery canvas provides chunks to split a task into chunks. Unfortunately, this won't work with primitives like chain, group.
You can use celery signals to prevent issues with dmap/clone.
ch = chain(
download_sub_page.s(),
process_sub_page.s(),
upload_sub_page.s(),
)
#task_success.connect(sender='get_links_from_website')
def task_success_handler(sender=None, headers=None, body=None, **kwargs):
result = kwargs['result']
header = [ch(i) for i in result]
callback = mark_website_done.si()
chord(header)(callback)
Create a chain for processing pages and hook the last task to it using a chord. This function gets executed whenever get_links_from_website runs succcessfully.
Depending on the time taken by chain, you can also save results of get_links_from_website somewhere. Then iterate over a batch of them to queue up chains and with the last batch, you can hook a callback to last task.

This is a bit hacky but we're using deepcopy to clone the callback, this fixes the bug with Signature's shallow copy
def dmap(it, callback, final=None):
# Map a callback over an iterator and return as a group
callback = subtask(callback)
run_in_parallel = group(subtask(copy.deepcopy(dict(callback))).clone([arg, ]) for arg in it)
if len(run_in_parallel.tasks) == 0:
return []
if final:
return chord(run_in_parallel)(final)
return run_in_parallel.delay()
Note that this will only work for one nesting level (i.e. callback is a chain/group/chord) but will not work for deeply nested callbacks
For deeply nested callback graphs we use this hack which is a bit slower but works flawlessly
# Hack to completely clone a signature with possibly complex subtasks (chains, chords, etc...)
run_in_parallel = group(pickle.loads(pickle.dumps(callback)).clone([arg, ]) for arg in it)
And for the size of the groups you can always split the iterator to chunks

If anyone runs into this, Jether's answer helped a lot, but it wasn't perfect. For us, there were three issues:
If the callback is itself a chain, the answer doesn't pass arguments onto the chain. https://stackoverflow.com/a/59023231/19882725 helps provide a solution to this, via clone_signature. This seems to work for reasonably nested chains using RabbitMQ as a broker, but we didn't try anything extreme (and thus didn't need to adapt it to use pickle).
After adding (1), passing final broke - we adopted the solution from https://github.com/celery/celery/issues/5265 to convert final from a dict to a Signature.
Finally, we found that final wouldn't actually execute in many cases because chord was receiving a Group rather than a list of tasks.
For anyone curious, here's our final solution:
import copy
from celery import Signature, chord, group, shared_task, subtask
def clone_signature(sig, args=(), kwargs=(), **opts):
"""
Turns out that a chain clone() does not copy the arguments properly - this
clone does.
From: https://stackoverflow.com/a/53442344/3189
"""
if sig.subtask_type and sig.subtask_type != "chain":
raise NotImplementedError(
"Cloning only supported for Tasks and chains, not {}".format(
sig.subtask_type
)
)
clone = sig.clone()
if hasattr(clone, "tasks"):
task_to_apply_args_to = clone.tasks[0]
else:
task_to_apply_args_to = clone
args, kwargs, opts = task_to_apply_args_to._merge(
args=args, kwargs=kwargs, options=opts
)
task_to_apply_args_to.update(args=args, kwargs=kwargs, options=copy.deepcopy(opts))
return clone
#shared_task
def dmap(it, callback, final=None):
if not len(it):
return []
callback = subtask(callback)
run_in_parallel = [
clone_signature(callback, args if type(args) is list else [args]) for args in it
]
if not final:
return group(*run_in_parallel).delay()
# see https://github.com/celery/celery/issues/5265
if not isinstance(final, Signature):
final["immutable"] = True
final = Signature.from_dict(final)
return chord(run_in_parallel)(final)
This allowed us to successfully execute nested dmaps like the following:
chain(
taskA.s(),
dmap.s(
chain(
taskB.s(),
taskC.s(),
dmap.s(
taskD.s(),
final=chain(
taskE.s(),
taskF.s(),
),
),
),
),
).delay()

Related

end python multithreading if one of the threads end first

So, I have the following code which im using to run the tasks in multiple functions at the same time:
if __name__ == '__main__':
po = Pool(processes = 10)
resultslist = []
i = 1
while i <= 2:
arg = [i]
result = po.apply_async(getAllTimes, arg)
resultslist.append(result)
i += 1
feedback = []
for res in resultslist:
multipresults = res.get()
feedback.append(multipresults)
matchesBegin, matchesEnd = feedback[0][0], feedback[0][1]
TheTimes = feedback[1]
This works well for me. I'm currently using it to run two jobs at the same time.
But the problem is, i dont always need all the two simultaneously running jobs to complete before I move on to the next phases of the script. Sometimes, if the first job completes successfully and im able to confirm it by verifying whats in matchesBegin, matchesEnd, I want to be able to just move on and kill off the other job.
My issue is, i dont know how to do that.
Job 1 usually completes much faster than Job 2. So, what im trying to do here is, IF job 1 completes before Job 2, AND the content of the variables from Job 1 (matchesBegin, matchesEnd) is True, then, i want Job 2 to be blown away because I dont need it anymore. If i dont blow it away, it will only prolong the completion of the script. Job 2 should only be allowed to continue to run if results of the variables from Job 1 arent True.
I do not know all the details of your use case, but I would hope this provides you some direction. Essentially, what you've started with apply_async() could do that job, but you would also need to use its callback argument and evaluate incoming result to see if it fulfills your criteria and take a corresponding action if it does. I've hacked around your code a bit and got this:
class ParallelCall:
def __init__(self, jobs=None, check_done=lambda res: None):
self.pool = Pool(processes=jobs)
self.pending_results = []
self.return_results = []
self.check_done = check_done
def _callback(self, incoming_result):
self.return_results.append(incoming_result)
if self.check_done(incoming_result):
self.pool.terminate()
return incoming_result
def run_fce(self, fce, *args, **kwargs):
self.pending_results.append(self.pool.apply_async(fce,
*args, **kwargs,
callback=self._callback))
def collect(self):
self.pool.close()
self.pool.join()
return self.return_results
Which you could use like this:
def final_result(result_to_check):
return result_to_check[0] == result_to_check[1]
if __name__ == '__main__':
runner = ParallelCall(jobs=2, check_done=final_result)
for i in range(1,3):
arg = [i]
runner.run_fce(getAllTimes, arg)
feedback = runner.collect()
TheTimes = feedback[-1] # last completed getAllTimes call
What does it do? runner is an instance of ParallelCall (note: I've used only two workers as you seem to only run two jobs) which uses final_result() function to evaluate the result whether it is a suitable candidate for valid final result. In this case, it's first and second item are equal.
We use that to start getAllTimes two times like in your example above. It uses apply_async() just as you did, but we now also have a callback registered through which we pass the result when it becomes available. We also pass it through the function registered with check_done to see if we got an acceptable final result and if so (return value evaluates to True) we just stop all the worker processes.
Disclaimer: this is not exactly what your example does, because the returning list is not in order in which function calls has taken place, but in which the results became available.
Then we collect() available results into feedback. This method closes the pool to not accept any further tasks (close()) and then waits for the workers to finish (wait()) (they could be stopped if one of the incoming results matched the registered criterion). Then we return all the results (either up to matching result or until all work has been done).
I've put this into ParallelCall class so that I can conveniently keep track of the pending and finished results as well know what my pool is. Default check_done is basically a (callable) nop.

Celery - Remote chain randomly hangs

In a django project, as part of rather heavy import pipeline, i'm having to call three tasks that are part of a different codebase/application. Both applications point to the same AMQP broker and actually calling a task from the other app works great.
As part of our worklow, we want to chain those three tasks, and wait for the whole chain to return before proceeding (each task depends on the previous one, but we only care about the results of the last one). We have to call this chain for each item we're inserting in the DB. In order to try and speed up the process, i'm basically populating a dictionary with AsyncResults objects returned by calling my_chain.apply_async as we go to launch the task, and then iterate over this result dict to retrieve the data we need and further process it.
In pseudo code, this looks like this:
def launch_chains():
# assuming i need to preserve the order in which the chains are
# called for later processing
result_dict = OrderedDict()
for item in all_items:
# Need to use signature objects because the tasks are on a
# different app / codebase
task_1 = remote_celery_app.signature(
'path.to.my.task1',
args=[whatever],
queue="my_queue"
)
task_2 = remote_celery_app.signature(
# ...
)
item_chain = celery.chain(
task_1,
task_2,
app=remote_celery_app
)
result_dict[item.pk] = item_chain.apply_async()
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
data = r.get()
process_data(...)
On a rather small batch of items, this seemed to work fine, however when testing this with a heavier set (about 1600 objects, which is closer to what we'll need in production), the process loops ends up hanging on when calling get.
What happens seems pretty random: i'll usually get the very first result, and the second get call will hang indefinitely. Sometimes it won't even get the first results, and sometimes i'll manage to get about 30 responses (out of the expected 1600) (in those cases, the loop will actually terminate).
All tasks are launched and return data (i see the log on the remote app go when populating the dictionary), it's just when trying to grab the results that something goes wrong.
One way to fix this is to move the apply_async call to the processing loop, like this:
def launch_chains():
# for loop, chain creation, etc...
result_dict[item.pk] = item_chain # removing the apply_async call...
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
r = r.apply_async() # ... and adding it back here
data = r.get()
process_data(...)
This works perfecly, and i get to process all items, but it's obviously taking much longer, because i have to wait for each call to finish, while the previous version should allow me to starts iterating over results that were returned while still populating my result dict.
When playing around in a python shell, what usually happens is that the first chain I try to process (no matter if it's the first one that got launched of another one picked randomly) will be in a SUCCESS state, and then all other will permanently remain PENDING.
I've been able to reproduce the behaviour by creating a dummy task on the remote app, and calling it with the following management command:
# remote_app.tasks.py
def add(*args):
return sum(args)
# simulating some processing delay.
# Removing this does not fix the problem
time.sleep(random.random())
# test_command.py
l1 = [
[2, 2],
[3, 3],
]
l2 = [
[4, 4],
[6, 6],
]
def get_remote_chain(i):
"""
Call the tasks remotely. This seems to work fine whith a small number
of calls, but ends up hanging when i start to call it a bunch.
Behaviour is basically the same as described above.
"""
t1 = remote_app.signature(
'remote.path.to.add',
args=l1[i],
queue='somequeue'
)
t2 = remote_app.signature(
'remote.path.to.add',
args=l2[i],
queue='somequeue'
)
return celery.chain(
t1, t2, app=simi_app
)
def get_local_chain(i):
"""
I copy pasted the task locally, and tried to call this version of the
chain instead:
Everything works fine, no matter how many times we call it
"""
return celery.chain(
add.s(*l1[i]),
add.s(*l1[i]),
)
def launch_chains():
results = OrderedDict()
for i in range(100):
for j in range(2):
# task_chain = get_local_chain(j)
task_chain = get_remote_chain(j)
results['%d-%d' % (j, i)] = task_chain.apply_async()
return results
class Command(BaseCommand):
def handle(self, *args, **kwargs):
results = launch_chains()
print('dict ok')
for rid, r in results.items():
print(rid, r.get())
# Trying an alternate way to process the results, trying to
# ensure we wait until each task is ready.
# Again, this works fine with local tasks, but ends up hanging,
# apparently randomly, after a few remote ones have been "processed".
#
# rlen = len(results)
# while results:
# for rid, r in results.items():
# if r.ready():
# print(rid, r.status)
# print(r.get())
# results.pop(rid)
# if len(results) != rlen:
# print('RETRYING ALL, %d remaining' % len(results))
# rlen = len(results)
I'll admit i'm don't know celery and AMQP that much, so i have no idea if this is a configration problem of if i'm "doing it wrong". However, since it works fine locally, i'm pretty sure the problem comes from the communication between the two apps.
I'm using celery 3.1.16 on the django app, and the remote app is 4.0.2... For a whole bunch of reasons i can't upgrade on the django side, not sure if i'll be able to downgrade on remote one (a quick try breaks because the code uses various new things from 4.0).
Any ideas ? Do not hesitate to ask for more config info if needed, can't think of anything more right now.

How to monitor a group of tasks in celery?

I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.

Is there any way to understand when all tasks are finished?

Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.

How should I structure a batch processing pipeline using Twisted?

I'm slowly but surely getting the hang of twisted, but I'm not sure how I should be approaching this particular project.
I'm trying to create a class for batch processing of web pages. There are multiple web pages that I would like to process independently, so it makes sense to have a pipeline of sorts for each url. Additionally, I would like to call a one-time preprocessing function before any of the urls are processed, and when all urls have been processed, I would like to call a post-processing function. Importantly, I'd like to be able to subclass this processing class and override certain methods based on the contents I'm trying to process -- not all web-pages will require the same processing steps.
If this were synchronous code, I would probably do this with a context manager. Consider the following example code:
class Pipeline(object):
def __init__(self, urls):
self.urls = urls # iterable
self.continue = False
def __enter__(self):
self.continue = self.preprocess()
def __exit__(self, type, value, traceback):
if self.continue: # if we decided to run the batch pipeline...
self.postprocess()
def preprocess(self):
# does some stuff and returns a bool
def postprocess(self):
# do some stuff
def pipeline(self):
for url in self.urls:
try:
# download url, do some stuff
except:
# recover so that other urls are not interrupted
After which, I would use it as follows:
with Pipeline(list_of_urls) as p:
p.pipeline()
This works well with synchronous network operations, but not with Twisted, since the pipeline function will return before the end of the processing pipeline, thus calling __exit__.
Additionally, I would like each URL's processing to take place completely separately because there may be conditional branching based on the result of my queries. For this reason, using Twisted's DeferredList is not desirable.
In a nutshell, I need the following:
Preprocess must run before all else
Postprocess must run when the following is true:
At least one url began processing (preprocess returns True)
All urls have either completed or thrown an exception
What's the sanest way to set something like this up with Twisted? The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Any advice?
In light of the comments for the original question, I would suggest using DeferredList in combination with maybeDeferred. I propose the latter because of this:
The issue I'm having is that some of the code involves asynchronous IO and some is just straight synchronous logic (i.e. processing the results in memory), so I'm not sure how to make the whole thing work with deferreds (or whether I even should).
Using a maybeDeferred allows you to treat all of your function calls as though they were asynchronous, whether they actually are or not.
Following structure helps me in close condition:
def Processor:
def __init__()
self.runned = 0
self.hasSuccess = True
self.preprocess()
def launch(self, urls):
for url in urls:
dfrd = url.process()
dfrd.addCallback(self.succ)
dfrd.addErrback(self.fail)
self.runned += 1
def fail(self, reason):
self.runned -= 1
if self.runned == 0 and self.hasSuccess:
self.postprocess()
def succ(self, arg):
self.runned -= 1
self.hasSuccess = True
if not self.runned
self.postprocess()

Categories