Process Celery Tasks results in arrival order - python

I am new to celery, i have an "celery-server", like the code below which returns results after some time, depending on calculation. I have emulated this behaviour in this simple program below with the sleep function. What i want is to process the early returning results before the "heavy results". I have written a simple programm, see snipped below, which intentionally creates the "heavy load" task as first call.
Note that the subsequent calls create "lighter" tasks and therefore the celery server returns them earlier. Therefore i want to process the returning results based on the order they arrive at the client. Right now (see client code) it waits until the heavy tasks has returned.
But with the examples from celery docs, i am supposed to wait for results by checking the id, or poll for them (which is dumb, because celery client has to check the id of the "first" arrived result somehow i guess).
How can I process the results of celery in the order they arrive at the client? I don't want to poll in an endless loop for "result.ready()" as this completely screws up IMHO somehow the sense of aync processing.
Found no solution in the docs. What i want to do is "get first arrived resulted and get id", compare this to my "result.id" (did i send the task?) and then process accordingly.
#
# Name this code "tasks.py" and run it with:
# celery worker -A tasks --loglevel=info
#
from celery import Celery
import time
app = Celery('tasks', backend='amqp', broker='amqp://guest:guest#127.0.0.1:5672/%2F')
#app.task()
def add(x,y):
print("x=%s y=%s" % (x,y))
time.sleep(x)
return x + y
Second programm the client: This works like celery docs, howerver celery has already completed 0,1,2 (and therefore the client should work on it).
#!/usr/bin/python3
from tasks import add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
result = add.delay(max-(i+1),0)
results.append(result)
print("")
for i in range(0,max):
result = results[i].get(timeout=10)
print(result)
Result: (the last 4 numbers should appear in arrival order which would be 0,1,2,3 )
3
2
1
0
3
2
1
0

You should implement a callback rather than looping through the results in the order that they were sent to the queue:
http://celery.readthedocs.org/en/latest/userguide/calling.html#linking-callbacks-errbacks
In tasks.py:
#app.task()
def process_add(result):
print(result)
In client.py:
from tasks import add, process_add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
add.apply_async((max-(i+1), 0), link=process_add.s())

Related

Celery execute tasks in the order they get called (during runtime)

I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()

Python celery - how to wait for all subtasks in chord

I am unit testing celery tasks.
I have chain tasks that also have groups, so a chord is resulted.
The test should look like:
run celery task ( delay )
wait for task and all subtasks
assert
I tried the following:
def wait_for_result(result):
result.get()
for child in result.children or list():
if isinstance(child, GroupResult):
# tried looping over task result in group
# until tasks are ready, but without success
pass
wait_for_result(child)
This creates a deadlock, chord_unlock being retried forever.
I am not interested in task results.
How can I wait for all the subtasks to finish?
Although this is an old question, I just wanted to share how I got rid of the deadlock issue, just in case it helps somebody.
Like the celery logs says, never use get() inside a task. This indeed will create a deadlock.
I have a similar set of celery tasks which includes chain of group tasks, hence making it a chord. I'm calling these tasks using tornado, by making HTTP request. So what I did was something like this:
#task
def someFunction():
....
#task
def someTask():
....
#task
def celeryTask():
groupTask = group([someFunction.s(i) for i in range(10)])
job = (groupTask| someTask.s())
return job
When celeryTask() is being called by tornado, the chain will start executing, & the UUID of someTask() will be held in job. It will look something like
AsyncResult: 765b29a8-7873-4b28-b05c-7e19c33e950c
This UUID is returned and the celeryTask() exits before even the chain starts executing(ideally), hence leaving space for another process to run.
I then used the tornado layer to check the status of the task. Details on the tornado layer can be found in this stackoverflow question
Have you tried chord + callback ?
http://docs.celeryproject.org/en/latest/userguide/canvas.html#chords
>>> callback = tsum.s()
>>> header = [add.s(i, i) for i in range(100)]
>>> result = chord(header)(callback)
>>> result.get()
9900

Celery task PENDING

I basically want to be able to create a chord consisting of a group of chains. Beyond the fact that I can't seem to make that work is the fact that all the sub-chains have to complete before the chord callback is fired.
So my thought was to create a while loop like:
data = [foo.delay(i) for i in bar]
complete = {}
L = len(data)
cnt = 0
while cnt != L:
for i in data:
ID = i.task_id
try:
complete[ID]
except KeyError:
if i.status == 'SUCCESS':
complete[ID] = run_hourly.delay(i.result)
cnt += 1
if cnt >= L:
return complete.values()
So that when a task was ready it could be acted on without having to wait on other tasks to be complete.
The problem I'm having is that the status of some tasks never get past the 'PENDING' state.
All tasks will reach the 'SUCCESS' state if I add a time.sleep(x) line to the for loop but with a large number of sub tasks in data that solution becomes grossly inefficient.
I'm using memcached as my results backend and rabbitmq. My guess is that the speed of the for loop that iterates over data and calling attributes of it's tasks creates a race condition that breaks the connection to celery's messaging which leaves these zombie tasks that stay in the 'PENDING' state. But then again I could be completely wrong and it certainly wouldn't be the first time..
My questions
Why is time.sleep(foo) needed to avoid a perpetually PENDING task when iterating over a list of just launched tasks?
When a celery task is performing a loop is it blocking? When I try to shutdown the worker that gets stuck in an infinite loop I am unable to do so and have to manually find the python process running the worker and kill it. If I leave the worker to run eventually the python process running it will start to consume several gigs of memory, growing exponentially and what seems to be without bound.
Any insight on this matter would be appreciated. I'm also open to suggestions on ways to avoid the while loop entirely.
I appreciate your time. Thank you.
My chord consisting of a group of chains was being constructed and executed from within a celery task. Which will create problems if you need to access the results of those tasks. Below is a summary of what I was trying to do, what I ended up doing, and what I think I learned in the processes so maybe it can help someone else.
--common_tasks.py--
from cel_test.celery import app
#app.task(ignore_result=False)
def qry(sql, writing=False):
import psycopg2
conn = psycopg2.connect(dbname='space_test', user='foo', host='localhost')
cur = conn.cursor()
cur.execute(sql)
if writing == True:
cur.close()
conn.commit()
conn.close()
return
a = cur.fetchall()
cur.close()
conn.close()
return a
--foo_tasks.py --
from __future__ import absolute_import
from celery import chain, chord, group
import celery
from cel_test.celery import app
from weather.parse_weather import parse_weather
from cel_test.common_tasks import qry
import cPickle as PKL
import csv
import requests
#app.task(ignore_results=False)
def write_the_csv(data, file_name, sql):
with open(file_name, 'wb') as fp:
a = csv.writer(fp, delimiter=',')
for i in data:
a.writerows(i)
qry.delay(sql, True)
return True
#app.task(ignore_result=False)
def idx_point_qry(DIR='/bar/foo/'):
with open(''.join([DIR, 'thing.pkl']), 'rb') as f:
a = PKL.load(f)
return [(i, a[i]) for i in a]
#app.task(ignore_results=False)
def load_page(url):
page = requests.get(url, **{'timeout': 30})
if page.status_code == 200:
return page.json()
#app.task(ignore_results=False)
def run_hourly(page, info):
return parse_weather(page, info).hourly()
#app.task(ignore_results=False)
def pool_loop(info):
data = []
for iz in info:
a = chain(load_page.s(url, iz), run_hourly.s())()
data.append(a)
return data
#app.task(ignore_results=False)
def run_update(file_name, sql, writing=True):
chain(idx_point_qry.s(), pool_loop.s(file_name, sql), write_the_csv.s(writing, sql))()
return True
--- separate.py file ---
from foo_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
run_update.delay(fn, "SELECT * from hourly_weather();")
return True
update_hourly_weather()
I tried 30 or so combinations of .py files listed above along with several combinations of the code existing in them. I tried chords, groups, chains, launching tasks from different tasks, combining tasks.
A few combinations did end up working but I had to call .get() directly in the wrtie_the_csv task on data but celery was throwing a warning that in 4.0 calling get() in a task would raise an error so I thought I ought not to be doing that..
In essence my problem was (and still is) poor design of tasks and the flow between them. Which was leading me to the problem of having synchronize tasks from within other tasks.
The while loop I proposed in my question was an attempt to asynchronously launch tasks when another tasks status had become COMPLETE and then launch another task when that task had become complete and so on.. as opposed to synchronously letting celery do so through a call to chord or chain. What I seemed to find (and I'm not sure I'm right about this) was that from within a celery task you don't have access to the scope needed to discern such things. The following is a quote from the state portion in the celery documentation on tasks “asserting the world is the responsibility of the task”.
I was launching tasks that the task had no idea existed and as such was unable to know anything about them.
My solution was the to load and iterate through the .pkl file and launch a group of chains with a callback synchronously. I basically replaced the pool_loop task with the code below and launched the tasks synchronously instead of asynchronously.
--the_end.py--
from foo_tasks import *
from common_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
dd = []
idx = idx_point_qry()
for i in idx:
dd.append(chain(load_page.s(i, run_hourly.s(i)))
vv = chord(dd)(write_the_csv.s(fn, "SELECT * from hourly_weather();"))
return
update_hourly_weather()

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

django celery: how to set task to run at specific interval programmatically

I found that I can set the task to run at specific interval at specific times from here, but that was only done during task declaration. How do I set a task to run periodically dynamically?
The schedule is derived from a setting, and thus seems to be immutable at runtime.
You can probably accomplish what you're looking for using Task ETAs. This guarantees that your task won't run before the desired time, but doesn't promise to run the task at the designated time—if the workers are overloaded at the designated ETA, the task may run later.
If that restriction isn't an issue, you could write a task which would first run itself like:
#task
def mytask():
keep_running = # Boolean, should the task keep running?
if keep_running:
run_again = # calculate when to run again
mytask.apply_async(eta=run_again)
# ... do the stuff you came here to do ...
The major downside of this approach is that you are relying on the taskstore to remember the tasks in flight. If one of them fails before firing off the next one, then the task will never run again. If your broker isn't persisted to disk and it dies (taking all in-flight tasks with it), then none of those tasks will run again.
You could solve these issues with some kind of transaction logging and a periodic "nanny" task whose job it is to find such repeating tasks that died an untimely death and revive them.
If I had to implement what you've described, I think this is how I would approach it.
celery.task.base.PeriodicTask defines is_due which determines when the next run should be. You could override this function to contain your custom dynamic running logic. See the docs here: http://docs.celeryproject.org/en/latest/reference/celery.task.base.html?highlight=is_due#celery.task.base.PeriodicTask.is_due
An example:
import random
from celery.task import PeriodicTask
class MyTask(PeriodicTask):
def run(self, **kwargs):
logger = self.get_logger(**kwargs)
logger.info("Running my task")
def is_due(self, last_run_at):
# Add your logic for when to run. Mine is random
if random.random() < 0.5:
# Run now and ask again in a minute
return (True, 60)
else:
# Don't run now but run in 10 secs
return (True, 10)
see here http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
i think you can't make it dynamically ... best way is create task in task :D
for example you want run something for X sec later then you create new task with x sec delay and in this task create another task for N*X sec delay ...
This should help you some... http://celery.readthedocs.org/en/latest/faq.html#can-i-change-the-interval-of-a-periodic-task-at-runtime
Once you've defined a custom schedule, assign it to your task as asksol has suggested above.
CELERYBEAT_SCHEDULE = {
"my_name": {
"task": "myapp.tasks.task",
"schedule": myschedule(),
}
}
You might also want to modify CELERYBEAT_MAX_LOOP_INTERVAL if you want your schedule to update more often than every five minutes.

Categories