I am new to using Twisted library, I want to make a list of operations async. Take example of the following pseudo code:
#defer.inlineCallbacks
def getDataAsync(host):
data = yield AsyncHttpAPI(host) # some asyc api which returns deferred
return data
#defer.inlineCallbacks
def funcPrintData():
hosts = []; # some list of hosts, say 1000 in number
for host in hosts:
data = yield getDataAsync(host)
# why doesn't the following line get printed as soon as first result is available
# it waits for all getDataAsync to be queued before calling the callback and so print data
print(data)
Please comment if the question is not clear. Is there a better way of doing this? Should I instead be using the DeferredList ?
The line:
data = yield getDataAsync(host)
means "stop running this function until the getDataAsync(host) operation has completed. If the function stops running, the for loop can't get to any subsequent iterations so those operations can't even begin until after the first getDataAsync(host) has completed. If you want to run everything concurrently then you need to not stop running the function until all of the operations have started. For example:
ops = []
for host in hosts:
ops.append(getDataAsync(host))
After this runs, all of the operations will have started regardless of whether or not any have finished.
What you do with ops depends on whether you want results in the same order as hosts or if you want them all at once when they're all ready or if you want them one at a time in the order the operations succeed.
DeferredList is for getting them all at once when they're all ready as a list in the same order as the input list (ops):
datas = yield DeferredList(ops)
If you want to process each result as it becomes available, it's easier to use addCallback:
ops = []
for host in hosts:
ops.append(getDataAsync(host).addCallback(print))
This still doesn't yield so the whole group of operations are started. However, the callback on each operation runs as soon as that operation has a result. You're still left with a list of Deferred instances in ops which you can still use to wait for all of the results to finish if you want or attach overall error handling to (at least one of those is a good idea otherwise you have dangling operations that you can't easily account for in callers of funcPrintDat).
Related
In a django project, as part of rather heavy import pipeline, i'm having to call three tasks that are part of a different codebase/application. Both applications point to the same AMQP broker and actually calling a task from the other app works great.
As part of our worklow, we want to chain those three tasks, and wait for the whole chain to return before proceeding (each task depends on the previous one, but we only care about the results of the last one). We have to call this chain for each item we're inserting in the DB. In order to try and speed up the process, i'm basically populating a dictionary with AsyncResults objects returned by calling my_chain.apply_async as we go to launch the task, and then iterate over this result dict to retrieve the data we need and further process it.
In pseudo code, this looks like this:
def launch_chains():
# assuming i need to preserve the order in which the chains are
# called for later processing
result_dict = OrderedDict()
for item in all_items:
# Need to use signature objects because the tasks are on a
# different app / codebase
task_1 = remote_celery_app.signature(
'path.to.my.task1',
args=[whatever],
queue="my_queue"
)
task_2 = remote_celery_app.signature(
# ...
)
item_chain = celery.chain(
task_1,
task_2,
app=remote_celery_app
)
result_dict[item.pk] = item_chain.apply_async()
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
data = r.get()
process_data(...)
On a rather small batch of items, this seemed to work fine, however when testing this with a heavier set (about 1600 objects, which is closer to what we'll need in production), the process loops ends up hanging on when calling get.
What happens seems pretty random: i'll usually get the very first result, and the second get call will hang indefinitely. Sometimes it won't even get the first results, and sometimes i'll manage to get about 30 responses (out of the expected 1600) (in those cases, the loop will actually terminate).
All tasks are launched and return data (i see the log on the remote app go when populating the dictionary), it's just when trying to grab the results that something goes wrong.
One way to fix this is to move the apply_async call to the processing loop, like this:
def launch_chains():
# for loop, chain creation, etc...
result_dict[item.pk] = item_chain # removing the apply_async call...
return result_dict
def process_items():
remote_results = launch_chains()
for item_id, r in remote_results:
r = r.apply_async() # ... and adding it back here
data = r.get()
process_data(...)
This works perfecly, and i get to process all items, but it's obviously taking much longer, because i have to wait for each call to finish, while the previous version should allow me to starts iterating over results that were returned while still populating my result dict.
When playing around in a python shell, what usually happens is that the first chain I try to process (no matter if it's the first one that got launched of another one picked randomly) will be in a SUCCESS state, and then all other will permanently remain PENDING.
I've been able to reproduce the behaviour by creating a dummy task on the remote app, and calling it with the following management command:
# remote_app.tasks.py
def add(*args):
return sum(args)
# simulating some processing delay.
# Removing this does not fix the problem
time.sleep(random.random())
# test_command.py
l1 = [
[2, 2],
[3, 3],
]
l2 = [
[4, 4],
[6, 6],
]
def get_remote_chain(i):
"""
Call the tasks remotely. This seems to work fine whith a small number
of calls, but ends up hanging when i start to call it a bunch.
Behaviour is basically the same as described above.
"""
t1 = remote_app.signature(
'remote.path.to.add',
args=l1[i],
queue='somequeue'
)
t2 = remote_app.signature(
'remote.path.to.add',
args=l2[i],
queue='somequeue'
)
return celery.chain(
t1, t2, app=simi_app
)
def get_local_chain(i):
"""
I copy pasted the task locally, and tried to call this version of the
chain instead:
Everything works fine, no matter how many times we call it
"""
return celery.chain(
add.s(*l1[i]),
add.s(*l1[i]),
)
def launch_chains():
results = OrderedDict()
for i in range(100):
for j in range(2):
# task_chain = get_local_chain(j)
task_chain = get_remote_chain(j)
results['%d-%d' % (j, i)] = task_chain.apply_async()
return results
class Command(BaseCommand):
def handle(self, *args, **kwargs):
results = launch_chains()
print('dict ok')
for rid, r in results.items():
print(rid, r.get())
# Trying an alternate way to process the results, trying to
# ensure we wait until each task is ready.
# Again, this works fine with local tasks, but ends up hanging,
# apparently randomly, after a few remote ones have been "processed".
#
# rlen = len(results)
# while results:
# for rid, r in results.items():
# if r.ready():
# print(rid, r.status)
# print(r.get())
# results.pop(rid)
# if len(results) != rlen:
# print('RETRYING ALL, %d remaining' % len(results))
# rlen = len(results)
I'll admit i'm don't know celery and AMQP that much, so i have no idea if this is a configration problem of if i'm "doing it wrong". However, since it works fine locally, i'm pretty sure the problem comes from the communication between the two apps.
I'm using celery 3.1.16 on the django app, and the remote app is 4.0.2... For a whole bunch of reasons i can't upgrade on the django side, not sure if i'll be able to downgrade on remote one (a quick try breaks because the code uses various new things from 4.0).
Any ideas ? Do not hesitate to ask for more config info if needed, can't think of anything more right now.
I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.
I have a function that dispatches calls to several Redis shards and stores the result in a pre-allocated Python list.
Basically, the code goes as follow:
def myfunc(calls):
results = [None] * len(calls)
for index, (connection, call) in enumerate(calls.iteritems()):
results[index] = call(connection)
return results
Obviously, as of now this calls the various Redis shards in sequence. I intend to use a thread pool and to make those calls happen in parallel as they can take quite some time.
My question is: given that the results list is preallocated and that each calls has a dedicated slot, do I need to lock it to put the results from each thread or is there any guarantee that it will work without ?
I will obviously profile the result in the end but I wouldn't want to lock if I don't really need to.
I have a concurrent.futures.ThreadPoolExecutor and a list. And with the following code I add futures to the ThreadPoolExecutor:
for id in id_list:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
And then I wait upon the list:
concurrent.futures.wait(self._futures)
However, self.myfunc does some network I/O and thus there will be some network exceptions. When errors occur, self.myfunc submits a new self.myfunc with the same id to the same thread pool and add a new future to the same list, just as the above:
try:
do_stuff(id)
except:
future = self._thread_pool.submit(self.myfunc, id)
self._futures.append(future)
return None
Here comes the problem: I got an error on the line of concurrent.futures.wait(self._futures):
File "/usr/lib/python3.4/concurrent/futures/_base.py", line 277, in wait
f._waiters.remove(waiter)
ValueError: list.remove(x): x not in list
How should I properly add new Futures to a list while already waiting upon it?
Looking at the implementation of wait(), it certainly doesn't expect that anything outside concurrent.futures will ever mutate the list passed to it. So I don't think you'll ever get that "to work". It's not just that it doesn't expect the list to mutate, it's also that significant processing is done on list entries, and the implementation has no way to know that you've added more entries.
Untested, I'd suggest trying this instead: skip all that, and just keep a running count of threads still active. A straightforward way is to use a Condition guarding a count.
Initialization:
self._count_cond = threading.Condition()
self._thread_count = 0
When my_func is entered (i.e., when a new thread starts):
with self._count_cond:
self._thread_count += 1
When my_func is done (i.e., when a thread ends), for whatever reason (exceptional or not):
with self._count_cond:
self._thread_count -= 1
self._count_cond.notify() # wake up the waiting logic
And finally the main waiting logic:
with self._count_cond:
while self._thread_count:
self._count_cond.wait()
POSSIBLE RACE
It seems possible that the thread count could reach 0 while work for a new thread has been submitted, but before its my_func invocation starts running (and so before _thread_count is incremented to account for the new thread).
So the:
with self._count_cond:
self._thread_count += 1
part should really be done instead right before each occurrence of
self._thread_pool.submit(self.myfunc, id)
Or write a new method to encapsulate that pattern; e.g., like so:
def start_new_thread(self, id):
with self._count_cond:
self._thread_count += 1
self._thread_pool.submit(self.myfunc, id)
A DIFFERENT APPROACH
Offhand, I expect this could work too (but, again, haven't tested it): keep all your code the same except change how you're waiting:
while self._futures:
self._futures.pop().result()
So this simply waits for one thread at a time, until none remain.
Note that .pop() and .append() on lists are atomic in CPython, so no need for your own lock. And because your my_func() code appends before the thread it's running in ends, the list won't become empty before all threads really are done.
AND YET ANOTHER APPROACH
Keep the original waiting code, but rework the rest not to create new threads in case of exception. Like rewrite my_func to return True if it quits due to an exception, return False otherwise, and start threads running a wrapper instead:
def my_func_wrapper(self, id):
keep_going = True
while keep_going:
keep_going = self.my_func(id)
This may be especially attractive if you someday decide to use multiple processes instead of multiple threads (creating new processes can be a lot more expensive on some platforms).
AND A WAY USING cf.wait()
Another way is to change just the waiting code:
while self._futures:
fs = self._futures[:]
for f in fs:
self._futures.remove(f)
concurrent.futures.wait(fs)
Clear? This makes a copy of the list to pass to .wait(), and the copy is never mutated. New threads show up in the original list, and the whole process is repeated until no new threads show up.
Which of these ways makes most sense seems to me to depend mostly on pragmatics, but there's not enough info about all you're doing for me to make a guess about that.
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards