Query task state - Celery & redis - python

Okay so I have a relatively simple problem I think, and it's like I'm hitting a brick wall with it. I have a flask app, and a webpage that allows you to run a number of scripts on the server side using celery & redis(broker).
All I want to do, is when I start a task to give it a name/id (task will be portrayed as a button on the client side) i.e.
#app.route('/start_upgrade/<task_name>')
def start_upgrade(task_name):
example_task.delay(1, 2, task_name=task_name)
Then after the task has kicked off I want to see if the task is running/waiting/finished in a seperate request, preferably like;
#app.route('/check_upgrade_status/<task_name>')
def get_task_status(task_name):
task = celery.get_task_by_name(task_name)
task_state = task.state
return task_state # pseudocode
But I can't find anything like that in the docs. I am very new to celery though just FYI so assume I know nothing. Also just to be extra obvious, I need to be able to query the task state from python, no CLI commands please.
Any alternative methods of achieving my goal of querying the queue are also welcome.

I ended up figuring out a solution for my question from arthur's post.
In conjunction with redis I created these functions
import redis
from celery.result import AsyncResult
redis_cache = redis.StrictRedis(host='localhost', port=6379, db=0)
def check_task_status(task_name):
task_id = redis_cache.get(task_name)
return AsyncResult(task_id).status
def start_task(task, task_name, *args, **kwargs):
response = task.delay(*args, **kwargs)
redis_cache.set(task_name, response.id)
Which allows me to define specific names to tasks. Note I haven't actually tested this yet but it makes sense so.
Example usage;
start_task(example_task, "example_name", 1, 2)

When you start a task with delay or apply_async an object AsyncResultis created and contains the id of the task. To get it you just have to store it in a variable.
For example
#app.route('/start_upgrade/<task_name>')
def start_upgrade(task_name):
res = example_task.delay(1, 2, task_name=task_name)
print res.id
You can store this id and maybe associate it with something else in a database (or just print it like I did in the example).
Then you can check the status of your task in a python console with :
from celery.result import AsyncResult
AsyncResult(your_task_id).status
Take a look at the result documentation you should get what you need there : http://docs.celeryproject.org/en/latest/reference/celery.result.html

Related

How to retrieve a queued task with task_name

I am building a Python application that works with taskqueue in the following flow
Add a pull task to taskqueue by calling taskqueue.add()
Lease the same task by calling taskqueue.lease_tasks()
After some time, we may want to shorten the lease time by calling taskqueue.modify_task_lease()
The problem is, those 3 steps happen in different web sessions. At step 3, the modify_task_lease() function need a task instance as argument, while I only have task_name in hand, which is passed from step 2 with web hooks.
So is there any way to retrieve a task with its name?
In the document, I found delete_tasks_by_name(), but there is no modify_task_lease_by_name(), which is exactly what I wanted to do.
The delete_tasks_by_name() is just a wrapper around delete_tasks_by_name_async(), which is implemented as
if isinstance(task_name, str):
return self.delete_tasks_async(Task(name=task_name), rpc)
else:
tasks = [Task(name=name) for name in task_name]
return self.delete_tasks_async(tasks, rpc)
So I guess you could similarly use the Task() constructor to obtain the task instance needed by modify_task_lease():
modify_task_lease(Task(name=your_task_name), lease_seconds)

How to monitor a group of tasks in celery?

I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.

Implement delayed Slack slash response

I want to implement slack slash command that has to process fucntion pipeline which takes roughly 30 seconds to process. Now since Slack slash commands only allows 3 seconds to respond, how to go about implementing this. I referred this but don't how to implement it.
Please hold up with me. I am doing this first time.
This is what I have tried. I know how to respond with ok status within 3 seconds but I don't understand how to again call pipeline
import requests
import json
from bottle import route, run, request
from S3_download import s3_download
from index import main_func
#route('/action')
def action():
pipeline()
return "ok"
def pipeline():
s3_download()
p = main_func()
print (p)
if __name__ == "__main__":
run(host='0.0.0.0', port=8082, debug=True)
I came across this article. Is using AWS lambda the only solution?
Can't we do this completely in python?
Something like this:
from boto import sqs
#route('/action', method='POST')
def action():
#retrieving all the required request example
params = request.forms.get('response_url')
sqs_queue = get_sqs_connection(queue_name)
message_object = sqs.message.Message()
message_object.set_body(params)
mail_queue.write(message_object)
return "request under process"
and you can have another process which processes the queue and call long running function:
sqs_queue = get_sqs_connection(queue_name)
for sqs_msg in sqs_queue.get_messages(10, wait_time_seconds=5):
processed_msg = json.loads(sqs_msg.get_body())
response = pipeline(processed_msg)
if response:
sqs_queue.delete_message(sqs_msg)
you can run this 2nd process maybe in a diff standalone python file, as a daemon process or cron.
I`v used sqs Amazon Queue here, but there are different options available.
You have an option or two for doing this in a single process, but it's fraught with peril. If you spin up a new Thread to handle the long process, you might end up deploying or crashing in the middle and losing it.
If durability is important to you, look into background-task workers like SQS, Lambda, or even a Celery task queue backed with Redis. A separate task has some interesting failure modes, and these tools will help you deal with them better than just spawning a thread.

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Scheduling thousands of one-off (non-reoccuring) tasks for near-simultaneous execution via Django-celery

Some context: I'm building a Django App that allows a user to pre-save an action, and schedule the exact date/time in the future they want said action to execute. E.g, scheduling a post to be programmatically pushed to ones Facebook wall next week at 5:30am.
I'm looking for a task scheduling system that could handle a thousand instances of a one-off task, all set to execute near-simultaneously (error margin plus or minus a minute).
I'm considering Django-celery/Rabbitmq for this, but I noticed the Celery docs do not address tasks meant for one-time use. Is Django-celery the right choice here (perhaps by subclassing CrontabSchedule) or is my energy better spent researching some other approach? Perhaps hacking together something with the Sched Module and Cron.
Edit 2:
For some reason, my head was originally stuck in the realm of recurring tasks. Here is a simpler solution.
All you really need is to define one task for each user action. You can skip storing tasks to be executed in your database--that's what celery is here for!
Reusing your facebook post example again, and again assuming you have a function post_to_facebook somewhere, which takes a user and some text, does some magic, and posts the text to that user's facebook, you can just define it to be a task like this:
# Task to send one update.
#celery.task(ignore_result=True)
def post_to_facebook(user, text):
# perform magic
return whatever_you_want
When a user is ready to enqueue such a post, you just tell celery when to run the task:
post_to_facebook.apply_async(
(user, text), # args
eta=datetime.datetime(2012, 9, 15, 11, 45, 4, 126440) # pass execution options as kwargs
)
This is all detailed here, among a whole bunch of available call options: http://docs.celeryproject.org/en/latest/userguide/calling.html#eta-and-countdown
If you need the result of the call, you can skip the ignore_result param in the task definition and get an AsyncResult object back, and then check it for the results of the call. More here: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#keeping-results
Some of the answer below is still relevant. You still want a task for each user action, you still want to think about task design, etc., but this is a much simpler road to doing what you asked about.
Original answer using recurring tasks follows:
Dannyroa has the right idea. I'll build upon that a bit here.
Edit / TLDR:
The answer is Yes, celery is suited to your needs. You just may need to rethink your task definition.
I assume you aren't allowing your users to write arbitrary Python code to define their tasks. Short of that, you will have to predefine some actions users can schedule, and then allow them to schedule those actions as they like. Then, you can just run one scheduled task for each user action, checking for entries and performing the action for each entry.
One user action:
Using your Facebook example, you would store users' updates in a table:
class ScheduledPost(Model):
user = ForeignKey('auth.User')
text = TextField()
time = DateTimeField()
sent = BooleanField(default=False)
Then you would run a task every minute, checking for entries in that table scheduled to be posted in the last minute (based on the error margin you mentioned). If it is very important that you hit your one minute window, you might schedule the task more often, say, every 30 seconds. The task might look like this (in myapp/tasks.py):
#celery.task
def post_scheduled_updates():
from celery import current_task
scheduled_posts = ScheduledPost.objects.filter(
sent=False,
time__gt=current_task.last_run_at, #with the 'sent' flag, you may or may not want this
time__lte=timezone.now()
)
for post in scheduled_posts:
if post_to_facebook(post.text):
post.sent = True
post.save()
The config might look like this:
CELERYBEAT_SCHEDULE = {
'fb-every-30-seconds': {
'task': 'tasks.post_scheduled_updates',
'schedule': timedelta(seconds=30),
},
}
Additional user actions:
For each user action in addition to posting to Facebook, you can define a new table and a new task:
class EmailToMom(Model):
user = ForeignKey('auth.User')
text = TextField()
subject = CharField(max_length=255)
sent = BooleanField(default=False)
time = DateTimeField()
#celery.task
def send_emails_to_mom():
scheduled_emails = EmailToMom.objects.filter(
sent=False,
time__lt=timezone.now()
)
for email in scheduled_emails:
sent = send_mail(
email.subject,
email.text,
email.user.email,
[email.user.mom.email],
)
if sent:
email.sent = True
email.save()
CELERYBEAT_SCHEDULE = {
'fb-every-30-seconds': {
'task': 'tasks.post_scheduled_updates',
'schedule': timedelta(seconds=30),
},
'mom-every-30-seconds': {
'task': 'tasks.send_emails_to_mom',
'schedule': timedelta(seconds=30),
},
}
Speed and optimization:
To get more throughput, instead of iterating over the updates to post and sending them serially during a post_scheduled_updates call, you could spawn up a bunch of subtasks and do them in parallel (given enough workers). Then the call to post_scheduled_updates runs very quickly and schedules a whole bunch of tasks--one for each fb update--to run asap. That would look something like this:
# Task to send one update. This will be called by post_scheduled_updates.
#celery.task
def post_one_update(update_id):
try:
update = ScheduledPost.objects.get(id=update_id)
except ScheduledPost.DoesNotExist:
raise
else:
sent = post_to_facebook(update.text)
if sent:
update.sent = True
update.save()
return sent
#celery.task
def post_scheduled_updates():
from celery import current_task
scheduled_posts = ScheduledPost.objects.filter(
sent=False,
time__gt=current_task.last_run_at, #with the 'sent' flag, you may or may not want this
time__lte=timezone.now()
)
for post in scheduled_posts:
post_one_update.delay(post.id)
The code I have posted is not tested and certainly not optimized, but it should get you on the right track. In your question you implied some concern about throughput, so you'll want to look closely at places to optimize. One obvious one is bulk updates instead of iteratively calling post.sent=True;post.save().
More info:
More info on periodic tasks: http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html.
A section on task design strategies: http://docs.celeryproject.org/en/latest/userguide/tasks.html#performance-and-strategies
There is a whole page about optimizing celery here: http://docs.celeryproject.org/en/latest/userguide/optimizing.html.
This page about subtasks may also be interesting: http://docs.celeryproject.org/en/latest/userguide/canvas.html.
In fact, I recommend reading all the celery docs.
What I'll do is to create a model called ScheduledPost.
I'll have a PeriodicTask that runs every 5 minutes or so.
The task will check the ScheduledPost table for any post that needs to be pushed to Facebook.

Categories