Kafka Consumer poll messages with python - python

I have problems with polling messages from Kafka in a Consumer Group.
My Consumer Object assigns to a given partition with
self.ps = TopicPartition(topic, partition )
and after that the consumer assigns to that Partition:
self.consumer.assign([self.ps])
After that I am able to count the messages inside the partition with
self.consumer.seek_to_beginning(self.ps)
pos = self.consumer.position(self.ps)
and self.consumer.seek_to_end(self.ps)
.....
In my topic are over 30000 messages.
The problem is that I am only get exactly one message.
Consumer Configuration with:
max_poll_records= 200
AUTO_OFFSET_RESET is earliest
And here is my function with this I am trying to get the messages:
def poll_messages(self):
data = []
messages = self.consumer.poll(timeout_ms=6000)
for partition, msgs in six.iteritems(messages):
for msg in msgs:
data.append(msg)
return data
Even if I go to the first available offset before start polling the messages
I get only one message.
self.consumer.seek(self.ps, self.get_first_offset())
I hope someone can explain me what I am doing wrong.
Thanks in advance.
Best wishes
Jörn

I believe that you are misunderstanding max_poll_records - this doesn't mean you will get 200 per poll, just a limit on the most you might get. You will need to call poll multiple times. I'd refer you to the docs for simple examples: http://kafka-python.readthedocs.io/en/master/usage.html
I believe a more standard implementation is:
for message in self.consumer:
# do stuff like:
print(msg)

Related

How to process messages in Kafka once, so a Service when is restarted doesnt process all messages

First time using Kafka, I learning Kafka using a microservice architecture and I am finding the next issue.
Every time I restart my service is processing all the messages in the topics. Is there a way I could only process those messages once, flag them as read or something?
This is my snippet in Pytho 3:
class EmailStreamConsumer:
def __init__(self, bootstrap_servers='localhost:9092'):
self.__bootstrap_servers = bootstrap_servers
self.__new_emails_consumer = KafkaConsumer('NewEmails', bootstrap_servers=bootstrap_servers,
auto_offset_reset='earliest')
self.__sent_emails_consumer = KafkaConsumer('SentEmails', bootstrap_servers=bootstrap_servers,
auto_offset_reset='earliest')
def start(self):
for message in self.__new_emails_consumer:
value = message.value.decode('utf-8')
email = json.loads(value)
self.send_email(email['content'], email['to_email'], email['title'], email['from_email'])
print("%s:%d:%d: key=%s value=%s" % (
message.topic, message.partition, message.offset, message.key, message.value))
I wish that the service sends the emails only once. Even when the service is restarted.
I think your problem is that you don't have a GROUP ID for your Kafka-Consumer
Just add:
String groupId = "kafka-new-emails";
properties.setProperty(ConsumerConfig.GROUP_ID_CONFIG, groupId);
Your application will start read from the latest email as your
consumer group labeled where the last commit you read was. Also, if you have more than one consumer and one of them gets down, consumer group will help you in making a rebalance as to make the consumer that is online to read from the partition that was assigned to the consumer that is down.
If consumer acknowleges to Kafka that it has read the message. Then we will not have this problem.
This can be done in 2 ways.
Approach 1 : enable auto commit when once we get the messages.
For this approach we need to add property enable.auto.commit with value true.
Approach 2 : If we need programmatic control we can use commitSync() and commitAsync().

kafka-python read from last produced message after a consumer restart

i am using kafka-python to consume messages from a kafka queue (kafka version 0.10.2.0). In particular i am using KafkaConsumer type.
If the consumer stops and after a while it is restarted i would like to restart from the latest produced message, that is drop all the messages produced during the time the consumer was down.
How can i achieve this?
Thanks
Thanks,
it works!
This is a simplified versione of my code:
consumer = KafkaConsumer('mytopic', bootstrap_servers=[server], group_id=group_id, enable_auto_commit=True)
#dummy poll
consumer.poll()
#go to end of the stream
consumer.seek_to_end()
#start iterate
for message in consumer:
print(message)
consumer.close()
The documentation states that the poll() method is incompatible with the iterator interface, which i guess is the the one I use in the loop at the end of my script. However from initial testing, this code looks like to work correctly.
Is it safe to use it? Or did I misunderstood the docuementation?
Thanks
You will not to seekToEnd() to the end of the log.
Keep in mind, that you first need to subscribe to a topic before you can seek. Also, subscribing is lazy. Thus, you will need to add a "dummy poll" before you can seek, too.
consumer.subscribe(...)
consumer.poll() // dummy poll
consumer.seekToEnd()
// now enter your regular poll-loop
In response to your question in your answer:
It is my understanding that when you execute consumer.poll() a dictionary is returned. So, when I wanted to poll for information I used a loop to walk through the dictionary.
consumer = KafkaConsumer('mytopic', bootstrap_servers=[server], group_id=group_id, enable_auto_commit=True)
messages = consumer.poll()
data = []
for msg in messages:
for value in messages[msg]:
#Add just the values to the list
data.append(value[6])
I believe what you are doing is getting the iterator with consumer = KafkaConsumer('mytopic', bootstrap_servers=[server], group_id=group_id, enable_auto_commit=True) and then walking the iterator with
#start iterate
for message in consumer:
print(message)
It doesn't look like you are actually getting just the 500 results from the poll. You can confirm this by adding max_poll_records=5 to your KafkaConsumer configuration. Then when you run the code, if more than 5 messages print out you can tell that you aren't using the poll functionality.
Hope that helps!
Here is a convenient way to have all messages returned by a poll in a list:
while True:
messages = [] # Store all messages
crs = [] # Store all consumer records
tpd = consumer.poll(timeout_ms=60000, max_records=1)
[ crs.extend(tp) for tp in tpd.values() ] # List of cr's
[ messages.extend([json.loads(cr.value)]) for cr in crs ]
print messages

Tracking topic size and consumer lag with kafka 0.8.2.0

Since kafka 0.8.2.0 it seems to have gotten awfully hard to track consumer lag and topic sizes
How do you track offsets (topic size) and lag in kafka? Do you increment a counter somewhere when your producer inserts a message and increment another counter when your consumer acknowledges a message?
I'm using airbnb's kafka-statsd-metrics2 - But all the metrics regarding topic size is always 0 for some reason, it may be a bug report for them, but how do you do it?
Our consumers and producers are written in python using kafka-python, they state they don't support the ConsumerCoordinator offset APIs, so I've put together a solution querying zookeeper and sending those metrics to statsd instance (seems very awkward), but I'm still missing the topic size metric.
We're using collectd to gather system metrics, I have NO experience with JMX and configuring it in collectd seems pretty convoluted, I've tried a few times, so I've found a few ways not to do it.
If you have any input I'd love to hear it, even if it is: "This belongs on x stackexchange-site"
If I understood you correctly you could use HighwaterMarkOffset from the FetchResponse. This way you will know what is the offset in the end of partition and will be able to compare it with your current acked offset or offset of the last message in this FetchResponse for example.
Details here
Have you tried using https://github.com/quantifind/KafkaOffsetMonitor for monitoring the consumer lag. It works well for 0.8.2.0
here is the code snippet, make sure to run this in active controller.
BOOTSTRAP_SERVERS is active controller IP.
client = KafkaAdminClient(bootstrap_servers=BOOTSTRAP_SERVERS, request_timeout_ms=300)
list_groups_request = client.list_consumer_groups()
for group in list_groups_request:
if group[1] == 'consumer':
list_mebers_in_groups = client.describe_consumer_groups([group[0]])
(error_code, group_id, state, protocol_type, protocol, members) = list_mebers_in_groups[0]
if len(members) !=0:
for member in members:
(member_id, client_id, client_host, member_metadata, member_assignment) = member
member_topics_assignment = []
for (topic, partitions) in MemberAssignment.decode(member_assignment).assignment:
member_topics_assignment.append(topic)
for topic in member_topics_assignment:
consumer = KafkaConsumer(
bootstrap_servers=BOOTSTRAP_SERVERS,
group_id=group[0],
enable_auto_commit=False
)
consumer.topics()
for p in consumer.partitions_for_topic(topic):
tp = TopicPartition(topic, p)
consumer.assign([tp])
committed = consumer.committed(tp)
consumer.seek_to_end(tp)
last_offset = consumer.position(tp)
if last_offset != None and committed != None:
lag = last_offset - committed
print "group: {} topic:{} partition: {} lag: {}".format(group[0], topic, p, lag)

Pika python asynchronous publisher: how to send data from user via console?

I am using the standard asynchronous publisher example. and i noticed that the publisher will keep publishing the same message in a loop forever.
So i commented the schedule_next_message call from publish_message to stop that loop.
But what i really want is for the publissher to start and publish only when a user give it a "message_body" and "Key"
basically publisher to publish the user inputs.
i was not able to fin any examples or hints of how to make the publisher take inputs from user in real time.
I am new to raabitmq, pika, python e.t.c
here is the snippet of code i am talking about :-
def publish_message(self):
"""If the class is not stopping, publish a message to RabbitMQ,
appending a list of deliveries with the message number that was sent.
This list will be used to check for delivery confirmations in the
on_delivery_confirmations method.
Once the message has been sent, schedule another message to be sent.
The main reason I put scheduling in was just so you can get a good idea
of how the process is flowing by slowing down and speeding up the
delivery intervals by changing the PUBLISH_INTERVAL constant in the
class.
"""
if self._stopping:
return
message = {"service":"sendgrid", "sender": "nutshi#gmail.com", "receiver": "nutshi#gmail.com", "subject": "test notification", "text":"sample email"}
routing_key = "email"
properties = pika.BasicProperties(app_id='example-publisher',
content_type='application/json',
headers=message)
self._channel.basic_publish(self.EXCHANGE, routing_key,
json.dumps(message, ensure_ascii=False),
properties)
self._message_number += 1
self._deliveries.append(self._message_number)
LOGGER.info('Published message # %i', self._message_number)
#self.schedule_next_message()
#self.stop()
def schedule_next_message(self):
"""If we are not closing our connection to RabbitMQ, schedule another
message to be delivered in PUBLISH_INTERVAL seconds.
"""
if self._stopping:
return
LOGGER.info('Scheduling next message for %0.1f seconds',
self.PUBLISH_INTERVAL)
self._connection.add_timeout(self.PUBLISH_INTERVAL,
self.publish_message)
def start_publishing(self):
"""This method will enable delivery confirmations and schedule the
first message to be sent to RabbitMQ
"""
LOGGER.info('Issuing consumer related RPC commands')
self.enable_delivery_confirmations()
self.schedule_next_message()
the site does not let me add the solution .. i was able to solve my issue using raw_input()
Thanks
I know I'm a bit late to answer the question but have you looked at this one?
Seems to be a bit more related to what you need than using a full async publisher. Normally you use those with a Python Queue to pass messages between threads.

How to get messages receive count in Amazon SQS using boto library in Python?

I am using boto library in Python to get Amazon SQS messages. In exceptional cases I don't delete messages from queue in order to give a couple of more changes to recover temporary failures. But I don't want to keep receiving failed messages constantly. What I would like to do is either delete messages after receiving more than 3 times or not get message if receive count is more than 3.
What is the most elegant way of doing it?
There are at least a couple of ways of doing this.
When you read a message in boto, you receive a Message object or some subclass thereof. The Message object has an "attributes" field that is a dict containing all message attributes known by SQS. One of the things SQS tracks is the approximate # of times the message has been read. So, you could use this value to determine whether the message should be deleted or not but you would have to be comfortable with the "approximate" nature of the value.
Alternatively, you could record message ID's in some sort of database and increment a count field in the database each time you read the message. This could be done in a simple Python dict if the messages are always being read within a single process or it could be done in something like SimpleDB if you need to record readings across processes.
Hope that helps.
Here's some example code:
>>> import boto.sqs
>>> c = boto.sqs.connect_to_region()
>>> q = c.lookup('myqueue')
>>> messages = c.receive_message(q, num_messages=1, attributes='All')
>>> messages[0].attributes
{u'ApproximateFirstReceiveTimestamp': u'1365474374620',
u'ApproximateReceiveCount': u'2',
u'SenderId': u'419278470775',
u'SentTimestamp': u'1365474360357'}
>>>
Other way could be you can put an extra identifier at the end of the message in your SQS queue. This identifier can keep the count of the number of times the message has been read.
Also if you want that your service should not poll these message again and again then you can create one more queue say "Dead Message Queue" and can transfer then message which has crossed the threshold to this queue.
aws has in-built support for this, just follow the below steps:
create a dead letter queue
enable Redrive policy for the source queue by checking "Use Redrive Policy"
select the dead letter queue you created in step#1 for "Dead Letter Queue"
Set "Maximum Receives" as "3" or any value between 1 and 1000
How it works is, whenever a message is received by the worker, the receive count increments. Once it reaches "Maximum Receives" count, the message is pushed to the dead letter queue. Note, even if you access the message via aws console, the receive count increments.
Source Using Amazon SQS Dead Letter Queues
Get ApproximateReceiveCount attribute from message you read.
move it to another queue(than you can manage error messages) or just delete it.
foreach (var message in response.Messages){
try{
var notifyMessage = JsonConvert.DeserializeObject<NotificationMessage>(message.Body);
Global.Sqs.DeleteMessageFromQ(message.ReceiptHandle);
}
catch (Exception ex){
var receiveMessageCount = int.Parse(message.Attributes["ApproximateReceiveCount"]);
if (receiveMessageCount >3 )
Global.Sqs.DeleteMessageFromQ(message.ReceiptHandle);
}
}
It should be done in few steps.
create SQS connection :-
sqsconnrec = SQSConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
create queue object :-
request_q = sqsconnrec.create_queue("queue_Name")
load the queue messages :-
messages= request_q.get_messages()
now you get the array of message objects and to find the total number of messages :-
just do len(messages)
should work like charm.

Categories