Not talking about the delay method.
I want to be able to get a task, given it's task_id and change it's ETA on the fly, before it is executed.
For now I have to cancel it, and re-schedule one. Troublesome if the scheduled process involve a lot of stuff.
You should store some 'pause' value outside of celery/task queue. I do this with a mailer using celery. I can pause parts of the system by setting values in either memcache or mysql. The tasks then make sure to query the outside resource before executing the task. If it's meant to be paused it sets it does a task.retry() that causes it to go through the retry delay time and such.
Assuming you are using django-celery and PeriodicTask with DatabaseScheduler, you need to modify your PeriodicTask interval or crontab and save it. If your task is defined by an interval, modify the last_run_at property.
You run celerybeat with the database scheduler with:
python manage.py celerybeat -S djcelery.schedulers.DatabaseScheduler
Related
I'm adding a job to a scheduler using apscheduler using a script. Unfortunately, the job is not properly scheduled when using a script as I didn't start the scheduler.
scheduler = self.getscheduler() # initializes and returns scheduler
scheduler.add_job(trigger=trigger, func = function, jobstore = 'mongo') #sample code. Note that I did not call scheduler.start()
I'm seeing a message: apscheduler.scheduler - INFO - Adding job tentatively -- it will be properly scheduled when the scheduler starts
The script is supposed to add jobs to the scheduler (not to run the scheduler at that particular instance) and there are some other info which are to be added on the event of a job added to the database. Is it possible to add a job and force the scheduler to add it to the jobstore without actually running the scheduler?
I know, that it is possible to start and shutdown the scheduler after addition of each job to make the scheduler save the job information into the jobstore. Is that really a good approach?
Edit: My original intention was to isolate initialization process of my software. I just wanted to add some jobs to a scheduler, which is not yet started. The real issue is that I've given permission for the user to start and stop scheduler. I cannot assure that there is a running instance of scheduler in the system. I've temporarily fixed the problem by starting the scheduler and shutting it down after addition of jobs. It works.
You would have to have some way to notify the scheduler that a job has been added, so that it could wake up and adjust the delay to its next wakeup. It's better to do this via some sort of RPC mechanism. What kind of mechanism is appropriate for your particular use case, I don't know. But RPyC and Execnet are good candidates. Use one of them or something else to remotely control the scheduler process to add said jobs, and you'll be fine.
I have a standalong script that scrapes a page, initiates a connection to a database, and writes database to it. I need it to execute periodically after x hours. I can make it with using a bash script, with the pseudocode:
while true
do
python scraper.py
sleep 60*60*x
done
From what I read about message brokers, they are used for sending "signals" from one running program to another, like HTTP in principle. Like I have a piece of code that accepts an email id from user, it sends signal with email-id to another piece of code that will send the email.
I need celery to run a periodic task on heroku. I already have a mongodb on a separate server. WHy do I need to run another server for rabbitmq or redis just for this? Can I use celery without the broker?
Celery architecture is designed to scale and distribute tasks across several servers. For sites like yours it might be an overkill. Queue service is generally needed to maintain the task list and signal the status of finished tasks.
You might want to take a look in Huey instead. Huey is small-scale Celery "Clone" needing only Redis as an external dependency, not RabbitMQ. It's still using Redis queue mechanism to line the tasks in queue.
There also exists Advanced Python scheduler which does not need even Redis, but can hold the state of the queue in memory in-process.
Alternatively if you have very small amount of periodical tasks, no delayed tasks, I would just use Cron and pure Python scripts to run the tasks.
As the Celery documentation explains:
Celery communicates via messages, usually using a broker to mediate between clients and workers. To initiate a task, a client adds a message to the queue, which the broker then delivers to a worker.
You can use your existing MongoDB database as broker. see Using MongoDB.
For the application like this, its better use Django Background Tasks
,
Installation
Install from PyPI:
pip install django-background-tasks
Add to INSTALLED_APPS:
INSTALLED_APPS = (
# ...
'background_task',
# ...
)
Migrate your database:
python manage.py makemigrations background_task
python manage.py migrate
Creating and registering tasks
To register a task use the background decorator:
from background_task import background
from django.contrib.auth.models import User
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
This will convert the notify_user into a background task function. When you call it from regular code it will actually create a Task object and stores it in the database. The database then contains serialised information about which function actually needs running later on. This does place limits on the parameters that can be passed when calling the function - they must all be serializable as JSON. Hence why in the example above a user_id is passed rather than a User object.
Calling notify_user as normal will schedule the original function to be run 60 seconds from now:
notify_user(user.id)
This is the default schedule time (as set in the decorator), but it can be overridden:
notify_user(user.id, schedule=90) # 90 seconds from now
notify_user(user.id, schedule=timedelta(minutes=20)) # 20 minutes from now
notify_user(user.id, schedule=timezone.now()) # at a specific time
Also you can run original function right now in synchronous mode:
notify_user.now(user.id) # launch a notify_user function and wait for it
notify_user = notify_user.now # revert task function back to normal function.
Useful for testing.
You can specify a verbose name and a creator when scheduling a task:
notify_user(user.id, verbose_name="Notify user", creator=user)
I have a reminder type app that schedules tasks in celery using the "eta" argument. If the parameters in the reminder object changes (e.g. time of reminder), then I revoke the task previously sent and queue a new task.
I was wondering if there's any good way of keeping track of revoked tasks across celeryd restarts. I'd like to have the ability to scale celeryd processes up/down on the fly, and it seems that any celeryd processes started after the revoke command was sent will still execute that task.
One way of doing it is to keep a list of revoked task ids, but this method will result in the list growing arbitrarily. Pruning this list requires guarantees that the task is no longer in the RabbitMQ queue, which doesn't seem to be possible.
I've also tried using a shared --statedb file for each of the celeryd workers, but it seems that the statedb file is only updated on termination of the workers and thus not suitable for what I would like to accomplish.
Thanks in advance!
Interesting problem, I think it should be easy to solve using broadcast commands.
If when a new worker starts up it requests all the other workers to dump its revoked
tasks to the new worker. Adding two new remote control commands,
you can easily add new commands by using #Panel.register,
Module control.py:
from celery.worker import state
from celery.worker.control import Panel
#Panel.register
def bulk_revoke(panel, ids):
state.revoked.update(ids)
#Panel.register
def broadcast_revokes(panel, destination):
panel.app.control.broadcast("bulk_revoke", arguments={
"ids": list(state.revoked)},
destination=destination)
Add it to CELERY_IMPORTS:
CELERY_IMPORTS = ("control", )
The only missing problem now is to connect it so that the new worker
triggers broadcast_revokes at startup. I guess you could use the worker_ready
signal for this:
from celery import current_app as celery
from celery.signals import worker_ready
def request_revokes_at_startup(sender=None, **kwargs):
celery.control.broadcast("broadcast_revokes",
destination=sender.hostname)
I had to do something similar in my project and used celerycam with django-admin-monitor. The monitor takes a snapshot of tasks and saves them in the database periodically. And there is a nice user interface to browse and check the status of all tasks. And you can even use it even if your project is not Django based.
I implemented something similar to this some time ago, and the solution I came up with was very similar to yours.
The way I solved this problem was to have the worker fetch the Task object from the database when the job ran (by passing it the primary key, as the documentation recommends). In your case, before the reminder is sent the worker should perform a check to ensure that the task is "ready" to be run. If not, it should simply return without doing any work (assuming that the ETA has changed and another worker will pick up the new job).
A normal approach to cron jobs with a django site would be to use cron to run custom management commands periodically.
But I found this http://code.google.com/p/django-cron/
How does it work, without needing cron? What invokes it to poll?
If it just sets up an address for an http request to hit periodically, what if the job takes a long time, won't the server time out?
It continually fires off a Timer thread, whose whole purpose is to wait a defined amount of time (the polling frequency you set in settings.py) and then run the execute on the django-cron queue again.
It depends on Django being a long-lived process, which if configured correctly it is. It runs a thread to check every 5 minutes (by default) to see if there are any jobs that need to be run, and if so runs them.
I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.
You could add a lock, using something like memcached or just your db.
If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically
You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control
The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.