How to retrieve a queued task with task_name - python

I am building a Python application that works with taskqueue in the following flow
Add a pull task to taskqueue by calling taskqueue.add()
Lease the same task by calling taskqueue.lease_tasks()
After some time, we may want to shorten the lease time by calling taskqueue.modify_task_lease()
The problem is, those 3 steps happen in different web sessions. At step 3, the modify_task_lease() function need a task instance as argument, while I only have task_name in hand, which is passed from step 2 with web hooks.
So is there any way to retrieve a task with its name?
In the document, I found delete_tasks_by_name(), but there is no modify_task_lease_by_name(), which is exactly what I wanted to do.

The delete_tasks_by_name() is just a wrapper around delete_tasks_by_name_async(), which is implemented as
if isinstance(task_name, str):
return self.delete_tasks_async(Task(name=task_name), rpc)
else:
tasks = [Task(name=name) for name in task_name]
return self.delete_tasks_async(tasks, rpc)
So I guess you could similarly use the Task() constructor to obtain the task instance needed by modify_task_lease():
modify_task_lease(Task(name=your_task_name), lease_seconds)

Related

Celery execute tasks in the order they get called (during runtime)

I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()

How can I multi-thread 3 different functions that return the same value and pick the fastest

I am using the threading module and have 3 different functions where they return the same value but use different methods to return that value.
I want and ID and will call it my_id
For example:
Function #1: Scrape website using a mobile endpoint and parsing the json for my_id
Function #2: Scrape website using desktop endpoint and parse json for my_id
Function #3: Scrape desktop website HTML and find my_id
What I would like to do is run each function at the same time and whichever one returns my_id the fastest, I take it and continue with my code.
What is the best way to go about this?
You can make use of concurrent.futures.
Create three threads and launch them using concurrent.futures.Executor.submit()
This returns you the future objects for each of the thread.
Then you can
concurrent.futures.wait(fs, timeout=None, return_when=FIRST_COMPLETED)
which will block the main thread, until one of the 3 child threads complete.
Then you can go ahead and use your result.
concurrent.futures.wait Returns a named 2-tuple of sets. The first set, named done and not_done
You can get your result from the completed futures object using the result() method, and you can safely shutdown the executor using the Executor.shutdown()
You can add your objects to a list, and start them like:
futures = []
for task in task_list:
futures.append(executor.submit(task.run))
concurrent.futures.wait(futures,timeout=None,return_when=FIRST_COMPLETED)

How to monitor a group of tasks in celery?

I have a situation where a periodic monthly big_task reads a file and enqueue one chained-task per row in this file, where the chained tasks are small_task_1 and small_task_2:
class BigTask(PeriodicTask):
run_every = crontab(hour=00, minute=00, day_of_month=1)
def run(self):
task_list = []
with open("the_file.csv" as f:
for row in f:
t = chain(
small_task_1.s(row),
small_task_2.s(),
)
task_list.append(t)
gr = group(*task_list)
r = gr.apply_async()
I would like to get statistics about the number of enqueued, failed tasks (and detail about the exception) for each small_task, as soon as all of them are finished (whatever the status is) to send a summary email to the project admins.
I first thought of using chord, but callback is not executed if any of the headers task fails, which will surely happen in my case.
I could also use r.get() in the BigTask, very convenient, but not recommended to wait for a task result into another task (even if here, I guess the risk of worker deadlock is poor since task will be executed only once a month).
Important note: input file contains ~700k rows.
How would you recommend to proceed?
I'm not sure if it can help you to monitor, but about the chord and the callback issue you could use link_error callback (for catching exceptions). In your case for example you can use it like:
small_task_1.s(row).set(link_error=error_task))
and implement celery error_task that send you notification or whatever.
In celery 4, you can set it once for the all canvas (but it didn't work for me in 3.1):
r = gr.apply_async(link_error=error_task)
For the monitoring part, you can use flower of course.
Hope that help
EDIT: An alternative (without using additional persistency) would be to catch the exception and add some logic to the result and the callback. For example:
def small_task_1():
try:
// do stuff
return 'success', result
except:
return 'fail', result
and then in your callback task iterate over the results tuples and check for fails because doing the actual logic.
I found the best solution to be iterate over the group results, after the group is ready.
When you issue a Group, you have a ResultSet object. You can .save() this object, to get it later and check if .is_ready, or you can call .join() and wait for the results.
When it ends, you can access .results and you have a list of AsyncResult objects. These objects all have a .state property that you can access and check if the task was successul or not.
However, you can only check the results after the group ends. During the process, you can get the value of .completed_count() and have an idea of group progress.
https://docs.celeryproject.org/en/latest/reference/celery.result.html#celery.result.ResultSet
The solution we use for a partly similar problem where celery builtin stuff (tasks states etc) doesn't really cut it is to manually store desired informations in Redis and retrieve them when needed.

Query task state - Celery & redis

Okay so I have a relatively simple problem I think, and it's like I'm hitting a brick wall with it. I have a flask app, and a webpage that allows you to run a number of scripts on the server side using celery & redis(broker).
All I want to do, is when I start a task to give it a name/id (task will be portrayed as a button on the client side) i.e.
#app.route('/start_upgrade/<task_name>')
def start_upgrade(task_name):
example_task.delay(1, 2, task_name=task_name)
Then after the task has kicked off I want to see if the task is running/waiting/finished in a seperate request, preferably like;
#app.route('/check_upgrade_status/<task_name>')
def get_task_status(task_name):
task = celery.get_task_by_name(task_name)
task_state = task.state
return task_state # pseudocode
But I can't find anything like that in the docs. I am very new to celery though just FYI so assume I know nothing. Also just to be extra obvious, I need to be able to query the task state from python, no CLI commands please.
Any alternative methods of achieving my goal of querying the queue are also welcome.
I ended up figuring out a solution for my question from arthur's post.
In conjunction with redis I created these functions
import redis
from celery.result import AsyncResult
redis_cache = redis.StrictRedis(host='localhost', port=6379, db=0)
def check_task_status(task_name):
task_id = redis_cache.get(task_name)
return AsyncResult(task_id).status
def start_task(task, task_name, *args, **kwargs):
response = task.delay(*args, **kwargs)
redis_cache.set(task_name, response.id)
Which allows me to define specific names to tasks. Note I haven't actually tested this yet but it makes sense so.
Example usage;
start_task(example_task, "example_name", 1, 2)
When you start a task with delay or apply_async an object AsyncResultis created and contains the id of the task. To get it you just have to store it in a variable.
For example
#app.route('/start_upgrade/<task_name>')
def start_upgrade(task_name):
res = example_task.delay(1, 2, task_name=task_name)
print res.id
You can store this id and maybe associate it with something else in a database (or just print it like I did in the example).
Then you can check the status of your task in a python console with :
from celery.result import AsyncResult
AsyncResult(your_task_id).status
Take a look at the result documentation you should get what you need there : http://docs.celeryproject.org/en/latest/reference/celery.result.html

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Categories