Architecture
Consider a system with DB records. Each record can be in a live or expired status; live records should be processed periodically using an external software module.
I have solved this using a classic producer - consumer architecture with Kombu and RabbitMQ. The producer fetches the records from the DB every few seconds, and the consumer handles them.
The problem
The number of live events greatly varies, and on peak hours the consumer can't handle the load and the queue is clogged with thousand of items.
I would like to make the system adaptive, so that the producer will not send new events to the consumer if the queue is empty.
What have I tried
Searching the Kombu documentation / API
Inspecting the Queue object
Using the RabbitMQ REST API: http://<host>:<port/api/queues/<vhost>/<queue_name>. It works, but it's yet another mechanism to maintain, and I prefer an elegant solution within Kombu.
How do I check whether a RabbitMQ is empty using Python's Kombu?
You can call queue_declare() on the kombu Queue object.
According to the docs the function returns:
Returns a tuple containing 3 items:
the name of the queue (essential for automatically-named queues)
message count
consumer count
Therefore you can do:
name, msg_count, consumer_count = queue.queue_declare()
Related
I need a producer-consumer kind of architecture, where the producer puts data in a queue over and over, and then a consumer reads from that queue as fast as it can process the data.
For the producer and consumer running in separate processes we already have multiprocessing, with Queue where you have put and get. So even if the producer runs as 2-3 times the speed of the consumer, all the data is in the queue (assume memory use is not a problem) and the consumer just calls q.get whenever it needs to.
But I need the producer and consumer to be connected over a network, so probably tough a socket (but I am open to other methods). The big problem with sockets is that they do not separate objects automatically like queues do.
For a multiprocessing.Queue if I call q.get I get the next object, the queue takes care of how many bytes to read and recreates the object for me, q.get just returns the object. With a socket I have to pickle.dumps to send it and then I need to be careful how many bytes to read from the socket (in case there is more than 1 object in the socket) and then pickle.loads the result. The main problem is keeping track of object sizes.
If I put 10 objects of different sizes that add up to 1000 bytes in a Queue then the queue takes care of how many bytes to read for every object when calling q.get. For a socket if I pickle the 10 objects and send them, the socket has no idea how to split the big 1000 byte string inside it, and creating a mechanism for this means adding alot of new code.
Is there some kind of... socket-based Queue or similar?
This is usually solved with an external software that will act as a broker for the producer and consumer over the internet. There are a few open source projects you can look into;
RabbitMQ
Kafka
Redis
Celery
They are all different in their own way, but they all have Python libraries you can easily pip install to begin using them. All of them will require that a third process is running to serve as the broker of messages.
Similarly, there are paid products for this as well - typically hosted in one of the big cloud providers - like AWS SQS.
This is not to say that it is not possible to create a custom socket or server implementation to do this... but, a lot of times in programming, it's best not to try to rebuild the wheel.
I have an SQS queue that is constantly being populated by a data consumer and I am now trying to create the service that will pull this data from SQS using Python's boto.
The way I designed it is that I will have 10-20 threads all trying to read messages from the SQS queue and then doing what they have to do on the data (business logic), before going back to the queue to get the next batch of data once they're done. If there's no data they will just wait until some data is available.
I have two areas I'm not sure about with this design
Is it a matter of calling receive_message() with a long time_out value and if nothing is returned in the 20 seconds (maximum allowed) then just retry? Or is there a blocking method that returns only once data is available?
I noticed that once I receive a message, it is not deleted from the queue, do I have to receive a message and then send another request after receiving it to delete it from the queue? seems like a little bit of an overkill.
Thanks
The long-polling capability of the receive_message() method is the most efficient way to poll SQS. If that returns without any messages, I would recommend a short delay before retrying, especially if you have multiple readers. You may want to even do an incremental delay so that each subsequent empty read waits a bit longer, just so you don't end up getting throttled by AWS.
And yes, you do have to delete the message after you have read or it will reappear in the queue. This can actually be very useful in the case of a worker reading a message and then failing before it can fully process the message. In that case, it would be re-queued and read by another worker. You also want to make sure the invisibility timeout of the messages is set to be long enough the the worker has enough time to process the message before it automatically reappears on the queue. If necessary, your workers can adjust the timeout as they are processing if it is taking longer than expected.
If you want a simple way to set up a listener that includes automatic deletion of messages when they're finished being processed, and automatic pushing of exceptions to a specified queue, you can use the pySqsListener package.
You can set up a listener like this:
from sqs_listener import SqsListener
class MyListener(SqsListener):
def handle_message(self, body, attributes, messages_attributes):
run_my_function(body['param1'], body['param2']
listener = MyListener('my-message-queue', 'my-error-queue')
listener.listen()
There is a flag to switch from short polling to long polling - it's all documented in the README file.
Disclaimer: I am the author of said package.
Another option is to setup a worker application using AWS Beanstalk as described in this blogpost.
Instead of long polling using boto3, your flask application receives the message as a json object in a HTTP post. The HTTP path and type of message being set are configurable in the AWS Elastic Beanstalk Configuration tab:
AWS Elastic Beanstalk has the added benefit of being able to dynamically scale the number of workers as a function of the size of your SQS queue, along with its deployment management benefits.
This is an example application that I found useful as a template.
I'm using Celery with Redis as broker and I can see that the queue is actually a redis list with the serialized task as the items.
My question is, if I have an AsyncResult object as a result of calling <task>.delay(), is there a way to determine the item's position in the queue?
UPDATE:
I'm finally able to get the position using:
from celery.task.control import inspect
i = inspect()
i.reserved()
but its a bit slow since it needs to communicate with all the workers.
The inspect.reserved()/scheduled() you mention may work, but not
always accurate since it only take into account the tasks
that the workers have prefetched.
Celery does not allow out of band operations on the queue, like removing messages
from the queue, or reordering them, because it will not scale in a distributed system.
The messages may not have reached the queue yet, which can result
in race conditions and in practice it is not a sequential queue with transactional
operations, but a stream of messages originating from several locations.
That is, the Celery API is based around strict message passing semantics.
It is possible to access the queue directly on some of the brokers
Celery supports (like Redis or Database), but that is not part of the public API,
and you are discouraged from doing so, but of course if you are not planning on
supporting operations at scale you should do whatever is the most convenient for you
and discard my advice.
If this is just to give the user some idea when his job will be completed, then
I'm sure you could come up with an algorithm to predict when the task will be executed,
if you just had the length of the queue and the time at which each task was inserted.
The first is just a redis.len("celery"), and the latter you could
add yourself by listening to the task_sent signal:
from celery.signals import task_sent
#task_sent.connect
def record_insertion_time(id, **kwargs):
redis.zadd("celery.insertion_times", id)
Using a sorted set here: http://redis.io/commands/zadd
For a pure message passing solution you could use a dedicated monitor
that consumes the Celery event stream and predicts when tasks will finish.
http://docs.celeryproject.org/en/latest/userguide/monitoring.html#event-reference
(just noticed that the task-sent is missing the timestamp field in
the documentation, but a timestamp is sent with that event so I will fix it).
The events also contain a "clock" field which is a logical clock
(see http://en.wikipedia.org/wiki/Lamport_timestamps),
this can be used to detect the order of events in a distributed
system without depending on the system time on each machine
to be in sync (which is ~impossible to achieve).
I have a "queue" of about a million entities on google app engine. I have to "pop" items off of the queue by using a query.
There are a bunch of client processes running all over the place that are constantly making requests to the stack. My problem is that when one of the clients requests an item, I want to make sure that I am removing that item from the front of the queue, sending it to that client process, and no other processes.
Currently, I am querying for the item, modifying its properties so that a query to the queue no longer includes that item, then saving the item. Using this method, it is very common for one item to be sent to more than one client process at the same time. I suspect this is because there is a delay to when I am making the writes and when they are being reflected to other processes.
Perhaps I need to be using transactions in some way, but when I looked into that, there were a couple of "gotchas". What is a good way to approach this problem?
Is there any reason not to implement the "queue" using App Engine's TaskQueue API? If size of the queue is the problem, TaskQueue could contain up to 200 million Tasks for a paid app, so a million entities would be easily handled.
If you want to be able to simulate queries for a certain task in the queue, you could use task tags, and have your client process pull tasks with a certain tag to be processed. Note that pulling tasks is supported through pull queues rather than push queues.
Other than that, if you want to keep your "queue-as-entities" implementation, you could use the Memcache API to signal the client process which entity need to be processed. Memcache provides stronger consistency when you need to share data between instances of your app compared to the eventual consistency of the HRD datastore, with the caveat that data in Memcache could be lost at any point in time.
I see two ways to tackle this:
What you are doing is ok, you just need to use transactions. If your processes are longer then 30s then you can offload them to task queue, which can be a part of transaction.
You could use Pull Queues, where you fill up a queue and than client processes pull tasks from the queue in atomic fashion (lease-delete cycle). With Pull Queues you can be sure that task is leased only once. Also task must be manually deleted from queue after it's done, meaning if your process dies task will be put back in queue after lease expires.
The producer module of my application is run by users who want to submit work to be done on a small cluster. It sends the subscriptions in JSON form through the RabbitMQ message broker.
I have tried several strategies, and the best so far is the following, which is still not fully working:
Each cluster machine runs a consumer module, which subscribes itself to the AMQP queue and issues a prefetch_count to tell the broker how many tasks it can run at once.
I was able to make it work using SelectConnection from the Pika AMQP library. Both consumer and producer start two channels, one connected to each queue. The producer sends requests on channel [A] and waits for responses in channel [B], and the consumer waits for requests on channel [A] and send responses on channel [B]. It seems, however, that when the consumer runs the callback that calculates the response, it blocks, so I have only one task executed at each consumer at each time.
What I need in the end:
the consumer [A] subscribes his tasks (around 5k each time) to the cluster
the broker dispatches N messages/requests for each consumer, where N is the number of concurrent tasks it can handle
when a single task is finished, the consumer replies to the broker/producer with the result
the producer receives the replies, update the computation status and, in the end, prints some reports
Restrictions:
If another user submits work, all of his tasks will be queued after the previous user (I guess this is automatically true from the queue system, but I haven't thought about the implications on a threaded environment)
Tasks have an order to be submitted, but the order they are replied is not important
UPDATE
I have studied a bit further and my actual problem seems to be that I use a simple function as callback to the pika's SelectConnection.channel.basic_consume() function. My last (unimplemented) idea is to pass a threading function, instead of a regular one, so the callback would not block and the consumer can keep listening.
As you have noticed, your process blocks when it runs a callback. There are several ways to deal with this depending on what your callback does.
If your callback is IO-bound (doing lots of networking or disk IO) you can use either threads or a greenlet-based solution, such as gevent, eventlet, or greenhouse. Keep in mind, though, that Python is limited by the GIL (Global Interpreter Lock), which means that only one piece of python code is ever running in a single python process. This means that if you are doing lots of computation with python code, these solutions will likely not be much faster than what you already have.
Another option would be to implement your consumer as multiple processes using multiprocessing. I have found multiprocessing to be very useful when doing parallel work. You could implement this by either using a Queue, having the parent process being the consumer and farming out work to its children, or by simply starting up multiple processes which each consume on their own. I would suggest, unless your application is highly concurrent (1000s of workers), to simply start multiple workers, each of which consumes from their own connection. This way, you can use the acknowledgement feature of AMQP, so if a consumer dies while still processing a task, the message is sent back to the queue automatically and will be picked up by another worker, rather than simply losing the request.
A last option, if you control the producer and it is also written in Python, is to use a task library like celery to abstract the task/queue workings for you. I have used celery for several large projects and have found it to be very well written. It will also handle the multiple consumer issues for you with the appropriate configuration.
Your setup sounds good to me. And you are right, you can simply set the callback to start a thread and chain that to a separate callback when the thread finishes to queue the response back over Channel B.
Basically, your consumers should have a queue of their own (size of N, amount of parallelism they support). When a request comes in via Channel A, it should store the result in the queue shared between the main thread with Pika and the worker threads in the thread pool. As soon it is queued, pika should respond back with ACK, and your worker thread would wake up and start processing.
Once the worker is done with its work, it would queue the result back on a separate result queue and issue a callback to the main thread to send it back to the consumer.
You should take care and make sure that the worker threads are not interfering with each other if they are using any shared resources, but that's a separate topic.
Being unexperienced in threading, my setup would run multiple consumer processes (the number of which basically being your prefetch count). Each would connect to the two queues and they would process jobs happily, unknowning of eachother's existence.