I found that I can set the task to run at specific interval at specific times from here, but that was only done during task declaration. How do I set a task to run periodically dynamically?
The schedule is derived from a setting, and thus seems to be immutable at runtime.
You can probably accomplish what you're looking for using Task ETAs. This guarantees that your task won't run before the desired time, but doesn't promise to run the task at the designated timeāif the workers are overloaded at the designated ETA, the task may run later.
If that restriction isn't an issue, you could write a task which would first run itself like:
#task
def mytask():
keep_running = # Boolean, should the task keep running?
if keep_running:
run_again = # calculate when to run again
mytask.apply_async(eta=run_again)
# ... do the stuff you came here to do ...
The major downside of this approach is that you are relying on the taskstore to remember the tasks in flight. If one of them fails before firing off the next one, then the task will never run again. If your broker isn't persisted to disk and it dies (taking all in-flight tasks with it), then none of those tasks will run again.
You could solve these issues with some kind of transaction logging and a periodic "nanny" task whose job it is to find such repeating tasks that died an untimely death and revive them.
If I had to implement what you've described, I think this is how I would approach it.
celery.task.base.PeriodicTask defines is_due which determines when the next run should be. You could override this function to contain your custom dynamic running logic. See the docs here: http://docs.celeryproject.org/en/latest/reference/celery.task.base.html?highlight=is_due#celery.task.base.PeriodicTask.is_due
An example:
import random
from celery.task import PeriodicTask
class MyTask(PeriodicTask):
def run(self, **kwargs):
logger = self.get_logger(**kwargs)
logger.info("Running my task")
def is_due(self, last_run_at):
# Add your logic for when to run. Mine is random
if random.random() < 0.5:
# Run now and ask again in a minute
return (True, 60)
else:
# Don't run now but run in 10 secs
return (True, 10)
see here http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
i think you can't make it dynamically ... best way is create task in task :D
for example you want run something for X sec later then you create new task with x sec delay and in this task create another task for N*X sec delay ...
This should help you some... http://celery.readthedocs.org/en/latest/faq.html#can-i-change-the-interval-of-a-periodic-task-at-runtime
Once you've defined a custom schedule, assign it to your task as asksol has suggested above.
CELERYBEAT_SCHEDULE = {
"my_name": {
"task": "myapp.tasks.task",
"schedule": myschedule(),
}
}
You might also want to modify CELERYBEAT_MAX_LOOP_INTERVAL if you want your schedule to update more often than every five minutes.
Related
I have a task that is taking 300+ seconds to complete, and I believe (though haven't confirmed) that Celery doesn't want you to create long running tasks. So rather than increasing our Celery task time limit, I'm trying to refactor the task using chain to break it up into more shorter running steps:
Current code
#task
def long_task():
# Each step can take over 2 minutes
x = step1()
y = step2(x)
z = step3(y)
def main():
# this fails with timeout frequently
long_task.delay()
Refactored code
#task
def step1():
...
#task
def step2():
...
#task
def step3():
...
#task
def chained_task():
c = chain(step1.s(), step2.s(), step3.s())
c.delay()
def main_refactored():
# will this still timeout?
chained_task.delay()
My question is: Will this design actually prevent the timeout? I don't think so, since the chain inside chained_task() still has to complete within the time limit, and just because we are taskifying the subparts and chaining them, the whole chain still has to complete serially which seems no different than before.
I want to truly break up the parent task into pieces, execute the pieces in serial order like a chain, and do that with 3 tasks taking 100s rather than a single task taking 300s. I believe chain is what I need, but not sure how to use it here. Thanks for your guidance!
I have a task consisting of subtasks in a chain. How can I ensure a second call of this task does not start before the first one has finished?
#shared_task
def task(user):
res = chain(subtask_1.s(), # each subtask takes ~1 hour
subtask_2.s(),
subtask_3.s())
return res.apply_async()
A django view might now trigger to call this task:
# user A visits page that triggers task
task.delay(userA)
# 10 seconds later, while task() is still executing, user B visits page
task.delay(userB)
This leads to the tasks racing each other instead of being executed in sequential order. E.g. once a worker has finished with subtask_1() of the first task, it begins working on subtask_1() of the second task, instead of subtask_2() and subtask_3() of the first one.
Is there a way to elegently avoid this? I guess the problem is the order the subtasks get added to the queue.
I have already set worker --concurreny=1, however that still doesn't change the order he consumes from the queue.
Official docs (task cookbook) seem to offer a solution which I don't understand and doesn't work for me unfortunately.
Perhaps include a blocking mechanism within the task, after the chain, with a while not res.ready(): sleep(1) kind of hack?
You can wait for first task to finish and then execute second one like this.
res = task.delay(userA)
res.get() # will block until finished
task.delay(userB)
But it will block the calling thread until first one finished. You can chain tasks to avoid blocking, but for that you have to modify task signature a little to accept task result as argument.
#shared_task
def task(_, user): signature takes one extra argument
# skipped
and
from celery.canvas import chain
chain(task.s(None, userA), task.s(userB))()
Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.
I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards
Some context: I'm building a Django App that allows a user to pre-save an action, and schedule the exact date/time in the future they want said action to execute. E.g, scheduling a post to be programmatically pushed to ones Facebook wall next week at 5:30am.
I'm looking for a task scheduling system that could handle a thousand instances of a one-off task, all set to execute near-simultaneously (error margin plus or minus a minute).
I'm considering Django-celery/Rabbitmq for this, but I noticed the Celery docs do not address tasks meant for one-time use. Is Django-celery the right choice here (perhaps by subclassing CrontabSchedule) or is my energy better spent researching some other approach? Perhaps hacking together something with the Sched Module and Cron.
Edit 2:
For some reason, my head was originally stuck in the realm of recurring tasks. Here is a simpler solution.
All you really need is to define one task for each user action. You can skip storing tasks to be executed in your database--that's what celery is here for!
Reusing your facebook post example again, and again assuming you have a function post_to_facebook somewhere, which takes a user and some text, does some magic, and posts the text to that user's facebook, you can just define it to be a task like this:
# Task to send one update.
#celery.task(ignore_result=True)
def post_to_facebook(user, text):
# perform magic
return whatever_you_want
When a user is ready to enqueue such a post, you just tell celery when to run the task:
post_to_facebook.apply_async(
(user, text), # args
eta=datetime.datetime(2012, 9, 15, 11, 45, 4, 126440) # pass execution options as kwargs
)
This is all detailed here, among a whole bunch of available call options: http://docs.celeryproject.org/en/latest/userguide/calling.html#eta-and-countdown
If you need the result of the call, you can skip the ignore_result param in the task definition and get an AsyncResult object back, and then check it for the results of the call. More here: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#keeping-results
Some of the answer below is still relevant. You still want a task for each user action, you still want to think about task design, etc., but this is a much simpler road to doing what you asked about.
Original answer using recurring tasks follows:
Dannyroa has the right idea. I'll build upon that a bit here.
Edit / TLDR:
The answer is Yes, celery is suited to your needs. You just may need to rethink your task definition.
I assume you aren't allowing your users to write arbitrary Python code to define their tasks. Short of that, you will have to predefine some actions users can schedule, and then allow them to schedule those actions as they like. Then, you can just run one scheduled task for each user action, checking for entries and performing the action for each entry.
One user action:
Using your Facebook example, you would store users' updates in a table:
class ScheduledPost(Model):
user = ForeignKey('auth.User')
text = TextField()
time = DateTimeField()
sent = BooleanField(default=False)
Then you would run a task every minute, checking for entries in that table scheduled to be posted in the last minute (based on the error margin you mentioned). If it is very important that you hit your one minute window, you might schedule the task more often, say, every 30 seconds. The task might look like this (in myapp/tasks.py):
#celery.task
def post_scheduled_updates():
from celery import current_task
scheduled_posts = ScheduledPost.objects.filter(
sent=False,
time__gt=current_task.last_run_at, #with the 'sent' flag, you may or may not want this
time__lte=timezone.now()
)
for post in scheduled_posts:
if post_to_facebook(post.text):
post.sent = True
post.save()
The config might look like this:
CELERYBEAT_SCHEDULE = {
'fb-every-30-seconds': {
'task': 'tasks.post_scheduled_updates',
'schedule': timedelta(seconds=30),
},
}
Additional user actions:
For each user action in addition to posting to Facebook, you can define a new table and a new task:
class EmailToMom(Model):
user = ForeignKey('auth.User')
text = TextField()
subject = CharField(max_length=255)
sent = BooleanField(default=False)
time = DateTimeField()
#celery.task
def send_emails_to_mom():
scheduled_emails = EmailToMom.objects.filter(
sent=False,
time__lt=timezone.now()
)
for email in scheduled_emails:
sent = send_mail(
email.subject,
email.text,
email.user.email,
[email.user.mom.email],
)
if sent:
email.sent = True
email.save()
CELERYBEAT_SCHEDULE = {
'fb-every-30-seconds': {
'task': 'tasks.post_scheduled_updates',
'schedule': timedelta(seconds=30),
},
'mom-every-30-seconds': {
'task': 'tasks.send_emails_to_mom',
'schedule': timedelta(seconds=30),
},
}
Speed and optimization:
To get more throughput, instead of iterating over the updates to post and sending them serially during a post_scheduled_updates call, you could spawn up a bunch of subtasks and do them in parallel (given enough workers). Then the call to post_scheduled_updates runs very quickly and schedules a whole bunch of tasks--one for each fb update--to run asap. That would look something like this:
# Task to send one update. This will be called by post_scheduled_updates.
#celery.task
def post_one_update(update_id):
try:
update = ScheduledPost.objects.get(id=update_id)
except ScheduledPost.DoesNotExist:
raise
else:
sent = post_to_facebook(update.text)
if sent:
update.sent = True
update.save()
return sent
#celery.task
def post_scheduled_updates():
from celery import current_task
scheduled_posts = ScheduledPost.objects.filter(
sent=False,
time__gt=current_task.last_run_at, #with the 'sent' flag, you may or may not want this
time__lte=timezone.now()
)
for post in scheduled_posts:
post_one_update.delay(post.id)
The code I have posted is not tested and certainly not optimized, but it should get you on the right track. In your question you implied some concern about throughput, so you'll want to look closely at places to optimize. One obvious one is bulk updates instead of iteratively calling post.sent=True;post.save().
More info:
More info on periodic tasks: http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html.
A section on task design strategies: http://docs.celeryproject.org/en/latest/userguide/tasks.html#performance-and-strategies
There is a whole page about optimizing celery here: http://docs.celeryproject.org/en/latest/userguide/optimizing.html.
This page about subtasks may also be interesting: http://docs.celeryproject.org/en/latest/userguide/canvas.html.
In fact, I recommend reading all the celery docs.
What I'll do is to create a model called ScheduledPost.
I'll have a PeriodicTask that runs every 5 minutes or so.
The task will check the ScheduledPost table for any post that needs to be pushed to Facebook.