I have a consumer that consumes from two topics with two different avro schemas. The number of topics can grow in the future, so unioning schemas to read from multiple topics is not scalable.
from confluent_kafka import Consumer
# assume config for group.id and bootstrap servers are set
consumer = Consumer(config)
consumer.subscribe(["randomtopic1", "randomtopic2"])
msg = consumer.poll()
msg_value = message.value() #Should be in bytes or str
How do I identify/extract which topic the message is from, without deserializing it with the avro schema? Wondering if topic name is encoded into any avro headers, I came across this link where they deserialize in the form of bytes and somehow extract the topic name. Then they deserialize the message value with the schema.
There are no "Avro headers" in the consumed records. Kafka record headers don't have the topic name (usually).
You can simply use message.topic(), then use an if statement against it to process the topics differently.
I am not too much sure about python, but as per the confluent recommendation, if you are using a single consumer that needs to subscribe to multiple topics and handles multiple data types at the same time, it is advised to use GenericRecord instead of a specific code binding type.
You can find a sample .NET code here https://github.com/confluentinc/confluent-kafka-dotnet/blob/master/examples/AvroGeneric/Program.cs
Related
I have java Kafka stream processing application and python application. Java application produces data and python consumer consumes it. When the processing.guarantee is set to exactly_once, then the python consumer is not able to deserialize the data. Deserialization fails.
I tried a java consumer and the java consumer is successfully reading the data. Then I turned back the processing.guarantee to atleast_once in java application. Now the python application is able to read without any issue.
I checked the payload from a console consumer and in both cases of exactly_once and atleast_once the payload looks same. Even the binary payload read at python consumer before deserializtion in both cases looks same. What could be the problem in this scenario.
Note: In my case the kafka doesn't have atleast 3 brokers which is suggested in the documentation for exactly_once to work. Its only one in my setup.
Can anyone throw some light into why java consumer was working but not python consumer.
Update: Looking at the python logs much deeper looks like two records are being tried to process in python consumer
Original record - which is processed perfectly fine.
An empty record - log shows as follows key = b'\x00\x00\x00\x01' and value = b'\x00\x00\x00\x00\x00\x00'. But now Iam wondering how this additional record is send when exactly_once is set.
Below is the python code used.
params = {
"bootstrap_servers": "localhost:29092",
"auto_offset_reset": "latest",
"group_id": "test",
}
def set_consumer(self):
try:
consumer = KafkaConsumer(*self.topics, **self.consumer_params)
return consumer
Exception e:
print(e)
for msg in self.consumer:
try:
event = self.decode_msg(msg)
self.logger.info("Json result : %s", str(event))
wondering how this additional record is send when exactly_once is set
It is a transaction marker. The Java consumer is able to detect these and filter them out, but in Python, your deserializer will need to handle them separately. There is a Github issue thread that suggests the consumer should already be able to filter the transaction records; maybe check the librdkafka docs if you are missing any configurations for this.
I see there is an EOS example in the confluent-kafka-python repo, but it doesn't consume after the producer sends the transaction records.
While playing with GCP Pub/Sub I need to keep an eye on my topics and retrieve the number of undelivered messages. It's working pretty well with this snippet of Google Query Monitoring : Link.
But I need to group my messages by attributes. Each message gets a body with params like : {'target':'A'} and I really need to get somethig like that :
msg.target
undelivered messages
A
34
B
42
C
42
I don't succed to access it without consuming messages.
This is my first try :
import json
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
subscriber = pubsub_v1.SubscriberClient()
project_id = "xxxx"
subscription_id = "xxxx"
subscription_path = subscriber.subscription_path(project_id, subscription_id)
response = subscriber.pull(
request={"subscription": subscription_path,"max_messages":9999}
)
ack_ids=[r.ack_id for r in response.received_messages]
subscriber.modify_ack_deadline(
request={
"subscription": subscription_path,
"ack_ids": ack_ids,
"ack_deadline_seconds": 0, ## The message will be immediatly 'pullable' again ?
}
)
messages = [ json.loads(r.message.data.decode()) for r in response.received_messages ]
for m in messages :
## Parse all messages to get my needed counts
But it's not working very well. I get a random number of messages each time so it's impossible to be sure of what I'm looking.
So here I am in my experimentations.
I see 3 ways :
Maybe it's possible to access messages body attributes directly from Google Query Monitoring ?
Maybe my method to consume / parse / release all messages is not correctly write and that's why it's not working well ?
Maybe I'm all wrong and it will be more efficient to create many topics instead of keep attributes in messages body OR there is another way to "tag" message to group them after in Monitoring ?
Do you figure how to do this ?
Thanks a lot in advance for your help !
The first thing to note is that the number of undelivered messages is a property of a subscription, not a topic. If there are multiple subscriptions to the same topic, then the number of undelivered messages could be different. There is no way in the Google Query Monitoring system to break down messages by attributes; it does not have any introspection into the contents of the backlog of messages, only to the metadata that is the number of messages.
The code as you have it has several things that make it problematic for trying to determine the number of messages remaining:
Synchronous pull can only return up to 1000 messages, so setting max_messages to 9999 messages will never give you that many messages.
Even with max_messages set to 1000, there is no guarantee that 1000 messages will be returned, even if there are 1000 messages that have not yet been delivered. You would need to issue multiple pull requests in order to fetch all of the messages. Of course, since you nack the messages (by doing a modify_ack_deadline with 0), messages could be redelivered and therefore double counted.
Even though you do the modify_ack_deadline request to nack the messages, while the messages are outstanding to this monitor, they are not available for delivery to your actual subscriber, which delays processing. Furthermore, consider the situation where your monitor crashes for some reason before it gets to perform the modify_ack_deadline. In this situation, those messages would not be delivered to your actual subscriber until the ack deadline you configured in the subscription had passed. If your application is latency-sensitive in any way, this could be a problem.
A different approach to consider would be to create a second subscription and have a monitoring application that receives all messages. For each message, it looks at the attribute and counts it as a received message for that attribute, then acknowledges the message. You could report this count per attributed breakdown via a custom metric. In your actual subscriber application, you would also create a custom metric that counts the number of messages received and processed per attribute. To compute the number of messages remaining to process per attribute, you would take the difference of these two numbers.
Alternatively, you could consider separating messages per attribute into different topics. However, there are a few things to consider:
Is the set of attributes fixed and known in advance? If not, how will your subscriber know which subscriptions to subscribe to?
How big is the set of attributes to be checked? There is a limit of 10,000 topics per project and so if you have more attributes than that, this approach will not work.
Are you using flow control to limit how many messages are being processed by your subscriber simultaneously? If so, is the number of messages per attribute uniform? If not, you may have to consider how to divide up the flow control across the subscribers on the different subscriptions.
Generic explanation: My application consumes messages from a topic and then splits them into separate topics according to their id, so the topics are named like topic_name_id. My goal is to connect those new topics to a certain sink (s3 or snowflake, haven't decided) so that the messages published in those topics will end up there. However, i've only found ways to do this using a configuration file, where you connect the sink to a topic that already exists and which you know the name of. But here the goal would be to connect the sink to the topic created during the process. Is there a way this can be achieved?
If the above is not possible, is there a way to connect to the common topic with all the messages, but create different tables (in snowflake) or s3 directories according to the message ID? Adding to that, in case of s3, the messages are added as individual json files, right? No way to combine them into one file?
Thanks
The outgoing IDs are known, right?
Kafka Connect uses a REST API that you generate a JSON HTTP body using those IDs and finalized topic names, then use requests, for example, to publish and start connectors for those topics. You can do that directly from the process directly before starting the producer, or you can send a request with the ID/topic name to a lambda job instead, which communicates with the Connect API
When using different topics with the S3 Sink connector, there will be separate S3 paths and separate files, based on the number of partitions in the topic and the other partitioner settings defined in your connector property. Most S3 processes are able to read full S3 prefixes, though, so I don't imagine that being an issue
I don't have experience with the Snowflake connector to know how it handles different topic names.
I'm new about using Kafka and elasticsearch. I've been trying to use Elastic search but I've some problem. I've grow up a docker compose file with all the images needed for building the environment then using kafka I've product into a specific topic the data and then I need to take from Kafka 's consumer data into a pub/sub system for sending data for the ingestion into elasticsearch.
I implement all this using python. I've seen that into the port and localhost as ip elasticsearch appear instead for kibana in the page appear the following sentence:
kibana server is not ready yet
the consumer python is something similar to it from which I take data from a topic:
from kafka import KafkaConsumer
# Import sys module
import sys
# Import json module to serialize data
import json
# Initialize consumer variable and set property for JSON decode
consumer = KafkaConsumer ('JSONtopic',bootstrap_servers = ['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8')))
for message in consumer:
print("Consumer records:\n")
print(message)
print("\nReading from JSON data\n")
print("Name:",message[6]['name'])
print("Email:",message[6]['email'])
# Terminate the script
sys.exit()
The goal is to use elasticsearch for doing analysis so I need to use it as backend as for visualize data into kibana. It could be really appreciate also a tutorial to follow for understanding what I should do for link this informations.
(P.s. data follow without problem from a topic to another one but the problem is to take this information and insert into elastic and have the possibility to visualize these informations)
If you're pushing data from Kafka to Elasticsearch then doing it with the Consumer API is typically not a good idea, since there are tools that exist that do it much better and handle more functionality.
For example:
Kafka Connect (e.g. 🎥 https://rmoff.dev/kafka-elasticsearch-video)
Logstash
From the docs..[1]: https://github.com/mumrah/kafka-python
# To send messages asynchronously
producer = SimpleProducer(kafka, async=True)
producer.send_messages("my-topic", "async message")
# To consume messages
consumer = SimpleConsumer(kafka, "my-group", "my-topic")
for message in consumer:
print(message)
Where did the "my-group" come from? How do I set "my-group"?
According to the docstring of SimpleConsumer, group is "a name for this consumer, used for offset storage and must be unique".
So it's just a unique name you give to a consumer.
groups are used by kafka to coordinate multiple consumers reading from the same topic. if multiple consumers use the same group name, it tells kafka that they want to share the consumer load and each message will only be delivered to one of the group members.
if you don't already know you want groups, you probably don't. just make sure the name is unique so that kafka sends your consumer all available messages for the topic.
As described before, every message should be delivered only once to every consumer group. For more information, check http://kafka.apache.org/documentation.html#introduction.
Until last version of kafka-python (0.9.3 when writing this), the consumer groups don't work properly using SimpleConsumer.