I'm using Celery with Redis as broker and I can see that the queue is actually a redis list with the serialized task as the items.
My question is, if I have an AsyncResult object as a result of calling <task>.delay(), is there a way to determine the item's position in the queue?
UPDATE:
I'm finally able to get the position using:
from celery.task.control import inspect
i = inspect()
i.reserved()
but its a bit slow since it needs to communicate with all the workers.
The inspect.reserved()/scheduled() you mention may work, but not
always accurate since it only take into account the tasks
that the workers have prefetched.
Celery does not allow out of band operations on the queue, like removing messages
from the queue, or reordering them, because it will not scale in a distributed system.
The messages may not have reached the queue yet, which can result
in race conditions and in practice it is not a sequential queue with transactional
operations, but a stream of messages originating from several locations.
That is, the Celery API is based around strict message passing semantics.
It is possible to access the queue directly on some of the brokers
Celery supports (like Redis or Database), but that is not part of the public API,
and you are discouraged from doing so, but of course if you are not planning on
supporting operations at scale you should do whatever is the most convenient for you
and discard my advice.
If this is just to give the user some idea when his job will be completed, then
I'm sure you could come up with an algorithm to predict when the task will be executed,
if you just had the length of the queue and the time at which each task was inserted.
The first is just a redis.len("celery"), and the latter you could
add yourself by listening to the task_sent signal:
from celery.signals import task_sent
#task_sent.connect
def record_insertion_time(id, **kwargs):
redis.zadd("celery.insertion_times", id)
Using a sorted set here: http://redis.io/commands/zadd
For a pure message passing solution you could use a dedicated monitor
that consumes the Celery event stream and predicts when tasks will finish.
http://docs.celeryproject.org/en/latest/userguide/monitoring.html#event-reference
(just noticed that the task-sent is missing the timestamp field in
the documentation, but a timestamp is sent with that event so I will fix it).
The events also contain a "clock" field which is a logical clock
(see http://en.wikipedia.org/wiki/Lamport_timestamps),
this can be used to detect the order of events in a distributed
system without depending on the system time on each machine
to be in sync (which is ~impossible to achieve).
Related
Pardon my ignorance as I am learning how I can use celery for my purposes.
Suppose I have two tasks: create_ticket and add_message_to_ticket. Usually create_ticket task is created and completed before add_message_to_ticket tasks are created multiple times.
#app.task
def create_ticket(ticket_id):
time.sleep(random.uniform(1.0, 4.0)) # replace with code that processes ticket creation
return f"Successfully processed ticket creation: {ticket_id}"
#app.task
def add_message_to_ticket(ticket_id, who, when, message_contents):
# TODO add code that checks to see if create_ticket task for ticket_id has already been completed
time.sleep(random.uniform(1.0, 4.0)) # replace with code that handles added message
return f"Successfully processed message for ticket {ticket_id} by {who} at {when}"
Now suppose that these tasks are created out of order due to Python's server receiving the events from an external web service out of order. For example, one add_message_to_ticket.delay(82, "auroranil", 1599039427, "This issue also occurs on Microsoft Edge on Windows 10.") gets called few seconds before create_ticket.delay(82) gets called. How would I solve the following problems?
How would I fetch results of celery task create_ticket by specifying ticket_id within task add_message_to_ticket? All I can think of is to maintain a database that stores tickets state, and checks to see if a particular ticket has been created, but I want to know if I am able to use celery's result backend somehow.
If I receive an add_message_to_ticket task with a ticket id where I find out that corresponding ticket does not have create_ticket task completed, do I reject that task, and put that back in the queue?
Do I need to ensure that the tasks are idempotent? I know that is good practice, but is it a requirement for this to work?
Is there a better approach at solving this problem? I am aware of Celery Canvas workflow with primitives such as chain, but I am not sure how I can ensure that these events are processed in order, or be able to put tasks on pending state while it waits for tasks it depends on to be completed based on arguments I want celery to check, which in this case is ticket_id.
I am not particularly worried if I receive multiple user messages for a particular ticket with timestamps out of order, as it is not as important as knowing that a ticket has been created before messages are added to that ticket. The point I am making is that I am coding up several tasks where some events crucially depend on others, whereas the ordering of other events do not matter as much for the Python's server to function.
Edit:
Partial solutions:
Use task_id to identify Celery tasks, with a formatted string containing argument values which identifies that task. For example, task_id="create_ticket(\"TICKET000001\")"
Retry tasks that do not meet dependency requirements. Blocking for subtasks to be completed is bad, as subtask may never complete, and will hog a process in one of the worker machines.
Store arguments as part of result of a completed task, so that you can use that information not available in later tasks.
Relevant links:
Where do you set the task_id of a celery task?
Retrieve result from 'task_id' in Celery from unknown task
Find out whether celery task exists
More questions:
How do I ensure that I send task once per task_id? For instance, I want create_ticket task to be applied asynchronous only once. This is an alternative to making all tasks idempotent.
How do I use AsyncResult in add_message_to_ticket to check for status of create_ticket task? Is it possible to specify a chain somehow even though the first task may have already been completed?
How do I fetch all results of tasks given task name derived from the name of the function definition?
Most importantly, should I use Celery results backend to abstract stored data away from dealing with a database? Or should I scratch this idea and just go ahead with designing a database schema instead?
I'm wondering if there's a way to set up RabbitMQ or Redis to work with Celery so that when I send a task to the queue, it doesn't go into a list of tasks, but rather into a Set of tasks keyed based on the payload of my task, in order to avoid duplicates.
Here's my setup for more context:
Python + Celery. I've tried RabbitMQ as a backend, now I'm using Redis as a backend because I don't need the 100% reliability, easier to use, small memory footprint, etc.
I have roughly 1000 ids that need work done repeatedly. Stage 1 of my data pipeline is triggered by a scheduler and it outputs tasks for stage 2. The tasks contain just the id for which work needs to be done and the actual data is stored in the database. I can run any combination or sequence of stage 1 and stage 2 tasks without harm.
If stage 2 doesn't have enough processing power to deal with the volume of tasks output by stage 1, my task queue grows and grows. This wouldn't have to be the case if the task queue used sets as the underlying data structure instead of lists.
Is there an off-the-shelf solution for switching from lists to sets as distributed task queues? Is Celery capable of this? I recently saw that Redis has just released an alpha version of a queue system, so that's not ready for production use just yet.
Should I architect my pipeline differently?
You can use an external data structure to store and monitor the current state of your celery queue.
1. Lets take a redis key-value for example. Whenever you push a task into celery, you mark a key with your 'id' field as true in redis.
Before trying to push a new task with any 'id', you would check if the key with 'id' is true in redis or not, if yes, you skip pushing the task.
To clear the keys at proper time, you can use after_return handler of celery, which runs when the task has returned. This handler will unset the key 'id' in redis , hence clearing the lock for next task push .
This method ensures you only have ONE instance per id of task running in celery queue. You can also enhance it to allow only N tasks per id by using INCR and DECR commands on the redis key, when the task is pushed and after_return of the task.
Can your tasks in stage 2 check whether the work has already been done and, if it has, then not do the work again? That way, even though your task list will grow, the amount of work you need to do won't.
I haven't come across a solution re the sets / lists, and I'd think there were lots of other ways of getting around this issue.
Use a SortedSet within Redis for your jobs queue. It is indeed a Set so if you put the exact same data inside it won't add a new value in it (it absolutely needs to be the exact same data, you can't override the hash function used in SortedSet in Redis).
You will need a score to use with SortedSet, you can use a timestamp (value as a double, using unixtime for instance) that will allow you to get the most recent items / oldest items if you want. ZRANGEBYSCORE is probably the command you will be looking for.
http://redis.io/commands/zrangebyscore
Moreover, if you need additional behaviours, you can wrap everything inside a Lua Script for atomistic behaviour and custom eviction strategy if needed. For instance calling a "get" script that gets the job and remove it from the queue atomically or evicts data if there is too much back pressure etc.
I'm running Django, Celery and RabbitMQ. What I'm trying to achieve is to ensure, that tasks related to one user are executed in order (specifically, one at the time, I don't want task concurrency per user)
whenever new task is added for user, it should depend on the most recently added task. Additional functionality might include not adding task to queue, if task of this type is queued for this user and has not yet started.
I've done some research and:
I couldn't find a way to link newly created task with already queued one in Celery itself, chains seem to be only able to link new tasks.
I think that both functionalities are possible to implement with custom RabbitMQ message handler, though it might be hard to code after all.
I've also read about celery-tasktree and this might be an easiest way to ensure execution order, but how do I link new task with already "applied_async" task_tree or queue? Is there any way that I could implement that additional no-duplicate functionality using this package?
Edit: There is this also this "lock" example in celery cookbook and as the concept is fine, I can't see a possible way to make it work as intended in my case - simply if I can't acquire lock for user, task would have to be retried, but this means pushing it to the end of queue.
What would be the best course of action here?
If you configure the celery workers so that they can only execute one task at a time (see worker_concurrency setting), then you could enforce the concurrency that you need on a per user basis. Using a method like
NUMBER_OF_CELERY_WORKERS = 10
def get_task_queue_for_user(user):
return "user_queue_{}".format(user.id % NUMBER_OF_CELERY_WORKERS)
to get the task queue based on the user id, every task will be assigned to the same queue for each user. The workers would need to be configured to only consume tasks from a single task queue.
It would play out like this:
User 49 triggers a task
The task is sent to user_queue_9
When the one and only celery worker that is listening to user_queue_9 is ready to consume a new task, the task is executed
This is a hacky answer though, because
requiring just a single celery worker for each queue is a brittle system -- if the celery worker stops, the whole queue stops
the workers are running inefficiently
I only created the last 2 queue names that show in Rabbitmq management Webui in the below table:
The rest of the table has hash-like queues, which I don't know:
1- Who created them? (I know it is celery, but which process, task,etc.)
2- Why they are created, and what they are created for?.
I can notice that when the number of pushed messages increase, the number of those hash-like messages increase.
When using celery, Rabbitmq is used as a default result backend, and also to store errors of failing
tasks(that raised exceptions).
Every new task creates a new queue on the server, with thousands of tasks the
broker may be overloaded with queues and this will affect performance
in negative ways.
Each queue in Rabbit will be a separate Erlang process, so if you’re planning to
keep many results simultaneously you may have to increase the Erlang
process limit, and the maximum number of file descriptors your OS
allows.
Old results will not be cleaned automatically, so we have to tell
rabbit to do so.
The below conf. line dictates the time to live of the temp
queues. The default is 1 day
CELERY_AMQP_TASK_RESULT_EXPIRES = Number of seconds
OR, We can change the backend store totally, and not make it in Rabbit.
CELERY_BACKEND = "amqp"
We may also ignore it:
CELERY_IGNORE_RESULT = True.
Also, when ignoring the result, we can also keep the errors stored for later usage,
which means one more queue for the failing tasks.
CELERY_STORE_ERRORS_EVEN_IF_IGNORED = True.
I will not mark this question as answered, waiting for a better answer.
Rererences:
This SO link
Celery documentation
Rabbitmq documentation
I'm working on a web application that will receive a request from a user and have to hit a number of external APIs to compose the answer to that request. This could be done directly from the main web thread using something like gevent to fan out the request.
Alternatively, I was thinking, I could put incoming requests into a queue and use workers to distribute the load. The idea would be to try to keep it real time, while splitting up the requests amongst several workers. Each of these workers would be querying only one of the many external APIs. The response they receive would then go through a series transformations, be saved into a DB, be transformed to a common schema and saved in a common DB to finally be composed into one big response that would be returned through the web request. The web request is most likely going to be blocking all this time, with a user waiting, so keeping
the queueing and dequeueing as fast as possible is important.
The external API calls can easily be turned into individual tasks. I think the linking
from one api task to a transformation to a DB saving task could be done using a chain, etc, and the final result combining all results returned to the web thread using a chord.
Some questions:
Can this (and should this) be done using celery?
I'm using django. Should I try to use django-celery over plain celery?
Each one of those tasks might spawn off other tasks - such as logging what just
happened or other types of branching off. Is this possible?
Could tasks be returning the data they get - i.e. potentially Kb of data through celery (redis as underlying in this case) or should they write to the DB, and just pass pointers to that data around?
Each task is mostly I/O bound, and was initially just going to use gevent from the web thread to fan out the requests and skip the whole queuing design, but it turns out that it would be reused for a different component. Trying to keep the whole round trip through the Qs real time will probably require many workers making sure the queueus are mostly empty. Or is it? Would running the gevent worker pool help with this?
Do I have to write gevent specific tasks or will using the gevent pool deal with network IO automagically?
Is it possible to assign priority to certain tasks?
What about keeping them in order?
Should I skip celery and just use kombu?
It seems like celery is geared more towards "tasks" that can be deferred and are
not time sensitive. Am I nuts for trying to keep this real time?
What other technologies should I look at?
Update: Trying to hash this out a bit more. I did some reading on Kombu and it seems to be able to do what I'm thinking of, although at a much lower level than celery. Here is a diagram of what I had in mind.
What seems to be possible with raw queues as accessible with Kombu is the ability for a number of workers to subscribe to a broadcast message. The type and number does not need to be known by the publisher if using a queue. Can something similar be achieved using Celery? It seems like if you want to make a chord, you need to know at runtime what tasks are going to be involved in the chord, whereas in this scenario you can simply add listeners to the broadcast, and simply make sure they announce they are in the running to add responses to the final queue.
Update 2: I see there is the ability to broadcast Can you combine this with a chord? In general, can you combine celery with raw kombu? This is starting to sound like a question about smoothies.
I will try to answer as many of the questions as possible.
Can this (and should this) be done using celery?
Yes you can
I'm using django. Should I try to use django-celery over plain celery?
Django has a good support for celery and would make the life much easier during development
Each one of those tasks might spawn off other tasks - such as logging
what just happened or other types of branching off. Is this possible?
You can start subtasks from withing a task with ignore_result = true for only side effects
Could tasks be returning the data they get - i.e. potentially Kb of
data through celery (redis as underlying in this case) or should they
write to the DB, and just pass pointers to that data around?
I would suggest putting the results in db and then passing id around would make your broker and workers happy. Less data transfer/pickling etc.
Each task is mostly I/O bound, and was initially just going to use
gevent from the web thread to fan out the requests and skip the whole
queuing design, but it turns out that it would be reused for a
different component. Trying to keep the whole round trip through the
Qs real time will probably require many workers making sure the
queueus are mostly empty. Or is it? Would running the gevent worker
pool help with this?
Since the process is io bound then gevent will definitely help here. However, how much the concurrency should be for gevent pool'd worker, is something that I'm looking for answer too.
Do I have to write gevent specific tasks or will using the gevent pool
deal with network IO automagically?
Gevent does the monkey patching automatically when you use it in pool. But the libraries that you use should play well with gevent. Otherwise, if your parsing some data with simplejson (which is written in c) then that would block other gevent greenlets.
Is it possible to assign priority to certain tasks?
You cannot assign specific priorities to certain tasks, but route them to different queue and then have those queues being listened to by varying number of workers. The more the workers for a particular queue, the higher would be the priority of that tasks on that queue.
What about keeping them in order?
Chain is one way to maintain order. Chord is a good way to summarize. Celery takes care of it, so you dont have to worry about it. Even when using gevent pool, it would at the end be possible to reason about the order of the tasks execution.
Should I skip celery and just use kombu?
You can, if your use case will not change to something more complex over time and also if you are willing to manage your processes through celeryd + supervisord by yourself. Also, if you don't care about the task monitoring that comes with tools such as celerymon, flower, etc.
It seems like celery is geared more towards "tasks" that can be
deferred and are not time sensitive.
Celery supports scheduled tasks as well. If that is what you meant by that statement.
Am I nuts for trying to keep this real time?
I don't think so. As long as your consumers are fast enough, it will be as good as real time.
What other technologies should I look at?
Pertaining to celery, you should choose result store wisely. My suggestion would be to use cassandra. It is good for realtime data (both write and query wise). You can also use redis or mongodb. They come with their own set of problems as result store. But then a little tweaking in configuration can go a long way.
If you mean something completely different from celery, then you can look into asyncio (python3.5) and zeromq for achieving the same. I can't comment more on that though.