How do I scale Kafka Consumers in python?

How do I scale Kafka Consumers in python? - python

This probably has multiple questions so bear with me. I am still figuring out the right way to use the Kafka Architecture. I know that the partitions of a topic are divided b/w the consumers.
What exactly are consumers? Right now, I am thinking of writing a daemon python process that acts as a consumer. When the consumer consumes a message from Kafka, there is a task that I have to complete. This is a huge task so I am creating sub-tasks that run concurrently. Can I have multiple consumers(python scripts) on the same machine?
I have multiple microservices that I am working on, so each microservice has its own consumer?
When the load increases I have to scale the consumers. I thought of spawning a new machine that has acts as another consumer. But I just feel that I am doing something wrong here and feel that there has to be a better way.
Can you tell me how you scaled your consumers based on the load? Do I have to increase my partitions in topics if I need to increase my consumers? How do I do it dynamically? Can I decrease the partitions when there are fewer messages produced? How many partitions are ideal initially?
And please suggest some good practices to follow.
This is the consumer script that I am using
while True:
message = client.poll(timeout=10)#client is the KafkaConsumer object
if message is not None:
if message.error():
raise KafkaException(message.error())
else:
logger.info('recieved topic {topic} partition {partition} offset {offset} key {key} - {value}'.format(
topic=message.topic(),
partition=message.partition(),
offset=message.offset(),
key=message.key(),
value=message.value()
))
#run task

Can I have multiple consumers(python scripts) on the same machine?
Yes. You can also have Python threads, though.
If you're not consuming multiple topics, then there is no need for multiple consumers.
What exactly are consumers?
Feel free to read over the Apache Kafka site...
each microservice has its own consumer?
Is each service running similar code? Then yes.
I thought of spawning a new machine
Spawn new instances of your app on one machine. Monitor CPU and Mem and Network load. Don't get new machines until at least one of those is above say 70% under normal processing.
Do I have to increase my partitions in topics if I need to increase my consumers?
In general, yes. The number of consumers in a consumer group is limited by the number of partitions in the subscribed topics.
Can I decrease the partitions when there are fewer messages produced?
No. Partitions cannot be decreased
When the load increases I have to scale the consumers
Not necessarily. Is the increased load constantly rising, or are there waves of it? If variable, then you can let Kafka buffer the messages. And the consumer will keep polling and processing as fast as it can.
You need to define your SLAs for how long a message will take to process after reaching a topic from a producer.
How many partitions are ideal initially?
There are multiple articles on this, and it depends specifically on your own hardware and application requirements. Simply logging each message, you could have thousands of partitions...
When the consumer consumes a message from Kafka, there is a task that I have to complete
Sounds like you might want to look at Celery, not necessarily just Kafka. You could also look at Faust for Kafka processing

Related

schedule tasks based purely on idleness, instead of data communication

Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?

While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html

In Celery are there significant performance implications of using many queues

Are there substantial performance implications that I should keep in mind when Celery workers are pulling from multiple (or perhaps many) queues? For example, would there be a significant performance penalty if my system were designed so that workers pulled from 10 to 15 queues rather than just 1 or 2? As a follow-up, what if some of those queues are sometimes empty?

The short answer to your question on queue limits is:
Don't worry having multiple queues will not be worse or better, broker are designed to handle huge numbers of them. Off course in a lot of use cases you don't need so many, except really advanced one. Empty queues don't create any problem, they just take a tiny amount of memory on the broker.
Don't forget also that you have other things like exchanges and bindings, also there you don't have real limits but is better you understand the performance implication of each of them before using it (a TOPIC exchange will use more CPU than a direct one for example)
To give you a more complete answer let's look at the performance topic from a more generic point of view.
When looking at a distributed system based on message passing like Celery there are 2 main topics to analyze from the point of view of performance:
The workers number and concurrency factor.
As you probably already know each celery worker has a concurrency parameter that sets how many tasks can be executed at the same time, this should be set in relation with the server capacity(CPU, RAM, I/O) and off course also based on the type of tasks that the specific consumer will execute (depends on the queue that it will consume).
Off course depending on the total number of tasks you need to execute in a certain time window you will need to decide how many workers/servers you will need to have up and running.
The broker, the Single point of Failure in this architecture style.
The broker, especially RabbitMQ, is designed to manage millions of messages without any problem, however more messages it will need to store more memory will use and more are the messages to route more CPU it will use.
This machine should be well tuned too and if possible be in high availability.
Off course the main thing to avoid is the messages are consumed at a lower rate than they are produced otherwise your queue will keep growing and your RabbitMQ will explode. Here you can find some hints.
There are cases where you may also need to increase the number of tasks executed in a certain time frame but on only in response to peaks of requests. The nice thing about this architecture is that you can monitor the size of the queues and when you understand is growing to fast you could create new machines on the fly with a celery worker already configured and than turn it off when they are not needed. This is a quite cost saving and efficient approach.
One hint, remember to don't store celery tasks results in RabbitMQ.

Pika with RabbitMQ: Message distribution accross multiple consumer Applications from a single Queue

environment:
python, pika, RabbitMQ.
I have a Queue that has some 100 messages already.
when 2 consumer applications are started one after the other, all the preexisting messages are being processed by the first consumer and not being distributed among the two consumers that are up and waiting for messages.
how ever, any new Messages put on to the queue are being distributed among both the consumers.
the problem is if the consumer takes a long time to process, tenter code herehe load is all on one consumer until the consumption of the initial preexisting messages from the queue.
But, if the Consumer1 is killed, the messages get delivered to the Consumer2(which is expected.)
I am using SelectionConnect,
prefetch_count=(tried both 0 and 1),
prefetch_size = 0,
no_ack = False,
is there a way to configure it such a way that, the preexisting messages on the queue will be shared across multiple consumers even if the consumers will be started at different times(like add more consumers based on the load as it increases).
any help is appreciated.
Thank you.

I was able to fix it by just moving the basic_qos call to set the prefetch count to 1 on_channel_create call back method.
for some reason setting prefetch value to 1 just before the basic_consume is not good enough. must be something to do with the pika's io loop.

celery tasks, workers and queues organization

I have some independent tasks which I am currently putting into different/independent workers.
To be understood easily I will walk you through an example. Let's say I have three independent tasks namely sleep, eat, smile. A task may need to work under different celery configurations. So, I think, it is better to separate each of these tasks into different directories with different workers. Some tasks may be required to work on different servers.
I am planning add some more tasks in the future and each of them will be implemented by different developers.
Providing these conditions, there are more than one workers associated to each individual task.
Now, here is the problem and my question.
When I start three smile tasks, one of these will be fetched by smile's worker and carried out. But the next task will be fetched from eat's worker and never will be carried out.
So, what is the accepted, most common pattern? Should I send each tasks into different queues and workers should listen its own queue?

The answer depends on couple of things that should be taken in consideration:
Does order of commands should be preserved ?
If so the best approach is placing some sort of command pattern as serialized message so each fetched/consumed message can be executed in it's order in single place in your application.
If it's not an issue for you - you can play with topic exchange while publishing different message types in single exchange, and having different workers receiving the messages by predefined pattern. This by the way will let you easily add another task lets say "drink" without changing a line in already existing transportation topology/already existing workers.
Are you planning scaling queues among different machines to increase throughput ?
In case you have very intense traffic of tasks (in terms of frequency) it may be worth creating different queue for each task type so latter when you grow you can place each one on different node in rabbit cluster.

In a similar setup, I decided to go with specific queues for different tasks, and then I can decide which worker listens on which queue (which can also be changed dynamically !).

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.

You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.

You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.