Kafka and python - I dont get how to consume

Kafka and python - I dont get how to consume - python

From the docs..[1]: https://github.com/mumrah/kafka-python
# To send messages asynchronously
producer = SimpleProducer(kafka, async=True)
producer.send_messages("my-topic", "async message")
# To consume messages
consumer = SimpleConsumer(kafka, "my-group", "my-topic")
for message in consumer:
print(message)
Where did the "my-group" come from? How do I set "my-group"?

According to the docstring of SimpleConsumer, group is "a name for this consumer, used for offset storage and must be unique".
So it's just a unique name you give to a consumer.

groups are used by kafka to coordinate multiple consumers reading from the same topic. if multiple consumers use the same group name, it tells kafka that they want to share the consumer load and each message will only be delivered to one of the group members.
if you don't already know you want groups, you probably don't. just make sure the name is unique so that kafka sends your consumer all available messages for the topic.

As described before, every message should be delivered only once to every consumer group. For more information, check http://kafka.apache.org/documentation.html#introduction.
Until last version of kafka-python (0.9.3 when writing this), the consumer groups don't work properly using SimpleConsumer.

Related

Telegram Telethon: Sharing media downloads across multiple different clients

we tried to use 1 telegram client to continuously streaming messages from a list of channels, and then produce the messages to kafka. We then have a 2nd telegram client to consume the messages and download the associated media (photos/videos) using client.download_media(). Our issue is that this only works if client 1 and 2 are the same, but not when they are different accounts. We are not sure if this has to do with the session files or access hash, or maybe something else?
Is support for our use case possible? The main thing we are trying to address is that the async media download could result in a large backlog, and the backlog may go away if our server dies. That's why we wanted to put the messages into kafka for short term storage in the first place. Would also appreciate if you have better suggestions.
this is producer side
async with client:
messages = client.iter_messages(channel_id, limit=10)
async for message in messages:
print(message)
if message.media is not None:
# orig_media = message.media
# converted_media = BinaryReader(bytes(orig_media)).tgread_object()
# print('orig, media', orig_media)
# print('converted media', converted_media)
message_bytes = bytes(message) #convert to bytes
producer.produce(topic, message_bytes)
this is consumer side with a different client
with self._client:
#telethon.errors.rpcerrorlist.FileReferenceExpiredError: The file reference has expired and is no longer valid or it belongs to self-destructing media and cannot be resent (caused by GetFileRequest)
try:
self._client.loop.run_until_complete(self._client.download_media(orig_media, in_memory))
except Exception as e:
print(e)

Media files (among many other things in Telegram) contain an access_hash. While Account-A and Account-B will both see media with ID 1234, Account-A may have a hash of 5678 and Account-B may have a hash of 8765.
This is a roundabout way of saying that every account will see an access_hash that is only valid within that account. If that same hash is attempted to be used by a different account, it will fail, because that other account needs its own hash.
There is no way to bypass this, other than giving actual access to the right media files (or whatever it is) so that it can obtain its own hash.

Redis Subscriber/Publisher with Python and Node.js

I have a basic Web API written in Node.js that writes an object as an HSET to a Redis cache. Both are running in docker containers.
I have a Python script running on the same VM which needs to watch the Redis cache and then run some code when there is a new HSET or a change to a field in the HSET.
I came across Redis Pub/Sub but I'm not sure if this is really the proper way to use it.
To test, I created two Python scripts. The first subscribes to the messaging system:
import redis
import json
print ("Redis Subscriber")
redis_conn = redis.Redis(
host='localhost',
port=6379,
password='xxx',
charset="utf-8",
decode_responses=True)
def sub():
pubsub = redis_conn.pubsub()
pubsub.subscribe("broadcast")
for message in pubsub.listen():
if message.get("type") == "message":
data = json.loads(message.get("data"))
print(data)
if __name__ == "__main__":
sub()
The second publishes to the messaging system:
import redis
import json
print ("Redis Publisher")
redis_conn = redis.Redis(
host='localhost',
port=6379,
password='xxx',
charset="utf-8",
decode_responses=True)
def pub():
data = {
"message": "id:3"
}
redis_conn.publish("broadcast", json.dumps(data))
if __name__ == "__main__":
pub()
I will rewrite the publisher in Node.js and it will simply published the HSET key, like id:3. Then the subscriber will run in Python and when it received a new message, it will use that HSET key "id:3" to look up the actual HSET and do stuff.
This doesn't seem like the right way to do this but Redis watch doesn't support HSET. Is there a better way to accomplish this?

This doesn't seem like the right way to do this but Redis watch doesn't support HSET.
Redis WATCH does support hash keys - while it does not support hash fields.
Is there a better way to accomplish this?
While I believe your approach may be acceptable for certain scenarios, pub/sub messages are fire-and-forget: your subscriber may disconnect for whatever reason right after the publisher has published a message but before having the chance to read it - and your object write will thus be lost forever, even if the subscriber automatically reconnects after that.
You may opt instead for Redis streams, which allow to add entries to a given stream (resembling the publishing process of your code) and consume them (akin your subscriber script), through a process which preserves the messages.
As an alternative, perhaps simpler, approach, you may just split your hashes into multiple keys, one per field, so that you can WATCH them.

You might want to take a look at key-space notifications. Key-space notifcations can automatically publish messages when via PubSub when a key is changed, added, deleted, etc.
You can choose to consume events, i.e. HSET was called, and be provided the keyname it was called upon. Or, you can choose to consume keys, i.e my:awesome:key, and be notified with what event happened. Or both.
You'll need to turn key-space notifications on in order to use them:
redis.cloud:6379> CONFIG SET notify-keyspace-events KEA
You can subscribe to all events and keys like this:
redis.cloud:6379> PSUBSCRIBE '__key*__:*'
"pmessage","__key*__:*","__keyspace#0__:foo","set"
"pmessage","__key*__:*","__keyevent#0__:set","foo"
Hope that helps!

Setting processing.guarentee to exactly_once creates deserialization error at python consumer kafka

I have java Kafka stream processing application and python application. Java application produces data and python consumer consumes it. When the processing.guarantee is set to exactly_once, then the python consumer is not able to deserialize the data. Deserialization fails.
I tried a java consumer and the java consumer is successfully reading the data. Then I turned back the processing.guarantee to atleast_once in java application. Now the python application is able to read without any issue.
I checked the payload from a console consumer and in both cases of exactly_once and atleast_once the payload looks same. Even the binary payload read at python consumer before deserializtion in both cases looks same. What could be the problem in this scenario.
Note: In my case the kafka doesn't have atleast 3 brokers which is suggested in the documentation for exactly_once to work. Its only one in my setup.
Can anyone throw some light into why java consumer was working but not python consumer.
Update: Looking at the python logs much deeper looks like two records are being tried to process in python consumer
Original record - which is processed perfectly fine.
An empty record - log shows as follows key = b'\x00\x00\x00\x01' and value = b'\x00\x00\x00\x00\x00\x00'. But now Iam wondering how this additional record is send when exactly_once is set.
Below is the python code used.
params = {
"bootstrap_servers": "localhost:29092",
"auto_offset_reset": "latest",
"group_id": "test",
}
def set_consumer(self):
try:
consumer = KafkaConsumer(*self.topics, **self.consumer_params)
return consumer
Exception e:
print(e)
for msg in self.consumer:
try:
event = self.decode_msg(msg)
self.logger.info("Json result : %s", str(event))

wondering how this additional record is send when exactly_once is set
It is a transaction marker. The Java consumer is able to detect these and filter them out, but in Python, your deserializer will need to handle them separately. There is a Github issue thread that suggests the consumer should already be able to filter the transaction records; maybe check the librdkafka docs if you are missing any configurations for this.
I see there is an EOS example in the confluent-kafka-python repo, but it doesn't consume after the producer sends the transaction records.

Multi Topic Consumer Identify topic

I have a consumer that consumes from two topics with two different avro schemas. The number of topics can grow in the future, so unioning schemas to read from multiple topics is not scalable.
from confluent_kafka import Consumer
# assume config for group.id and bootstrap servers are set
consumer = Consumer(config)
consumer.subscribe(["randomtopic1", "randomtopic2"])
msg = consumer.poll()
msg_value = message.value() #Should be in bytes or str
How do I identify/extract which topic the message is from, without deserializing it with the avro schema? Wondering if topic name is encoded into any avro headers, I came across this link where they deserialize in the form of bytes and somehow extract the topic name. Then they deserialize the message value with the schema.

There are no "Avro headers" in the consumed records. Kafka record headers don't have the topic name (usually).
You can simply use message.topic(), then use an if statement against it to process the topics differently.

I am not too much sure about python, but as per the confluent recommendation, if you are using a single consumer that needs to subscribe to multiple topics and handles multiple data types at the same time, it is advised to use GenericRecord instead of a specific code binding type.
You can find a sample .NET code here https://github.com/confluentinc/confluent-kafka-dotnet/blob/master/examples/AvroGeneric/Program.cs

Google Cloud Pub/Sub - Filter metrics by attributes

While playing with GCP Pub/Sub I need to keep an eye on my topics and retrieve the number of undelivered messages. It's working pretty well with this snippet of Google Query Monitoring : Link.
But I need to group my messages by attributes. Each message gets a body with params like : {'target':'A'} and I really need to get somethig like that :
msg.target
undelivered messages
A
34
B
42
C
42
I don't succed to access it without consuming messages.
This is my first try :
import json
from google.cloud import pubsub_v1
publisher = pubsub_v1.PublisherClient()
subscriber = pubsub_v1.SubscriberClient()
project_id = "xxxx"
subscription_id = "xxxx"
subscription_path = subscriber.subscription_path(project_id, subscription_id)
response = subscriber.pull(
request={"subscription": subscription_path,"max_messages":9999}
)
ack_ids=[r.ack_id for r in response.received_messages]
subscriber.modify_ack_deadline(
request={
"subscription": subscription_path,
"ack_ids": ack_ids,
"ack_deadline_seconds": 0, ## The message will be immediatly 'pullable' again ?
}
)
messages = [ json.loads(r.message.data.decode()) for r in response.received_messages ]
for m in messages :
## Parse all messages to get my needed counts
But it's not working very well. I get a random number of messages each time so it's impossible to be sure of what I'm looking.
So here I am in my experimentations.
I see 3 ways :
Maybe it's possible to access messages body attributes directly from Google Query Monitoring ?
Maybe my method to consume / parse / release all messages is not correctly write and that's why it's not working well ?
Maybe I'm all wrong and it will be more efficient to create many topics instead of keep attributes in messages body OR there is another way to "tag" message to group them after in Monitoring ?
Do you figure how to do this ?
Thanks a lot in advance for your help !

The first thing to note is that the number of undelivered messages is a property of a subscription, not a topic. If there are multiple subscriptions to the same topic, then the number of undelivered messages could be different. There is no way in the Google Query Monitoring system to break down messages by attributes; it does not have any introspection into the contents of the backlog of messages, only to the metadata that is the number of messages.
The code as you have it has several things that make it problematic for trying to determine the number of messages remaining:
Synchronous pull can only return up to 1000 messages, so setting max_messages to 9999 messages will never give you that many messages.
Even with max_messages set to 1000, there is no guarantee that 1000 messages will be returned, even if there are 1000 messages that have not yet been delivered. You would need to issue multiple pull requests in order to fetch all of the messages. Of course, since you nack the messages (by doing a modify_ack_deadline with 0), messages could be redelivered and therefore double counted.
Even though you do the modify_ack_deadline request to nack the messages, while the messages are outstanding to this monitor, they are not available for delivery to your actual subscriber, which delays processing. Furthermore, consider the situation where your monitor crashes for some reason before it gets to perform the modify_ack_deadline. In this situation, those messages would not be delivered to your actual subscriber until the ack deadline you configured in the subscription had passed. If your application is latency-sensitive in any way, this could be a problem.
A different approach to consider would be to create a second subscription and have a monitoring application that receives all messages. For each message, it looks at the attribute and counts it as a received message for that attribute, then acknowledges the message. You could report this count per attributed breakdown via a custom metric. In your actual subscriber application, you would also create a custom metric that counts the number of messages received and processed per attribute. To compute the number of messages remaining to process per attribute, you would take the difference of these two numbers.
Alternatively, you could consider separating messages per attribute into different topics. However, there are a few things to consider:
Is the set of attributes fixed and known in advance? If not, how will your subscriber know which subscriptions to subscribe to?
How big is the set of attributes to be checked? There is a limit of 10,000 topics per project and so if you have more attributes than that, this approach will not work.
Are you using flow control to limit how many messages are being processed by your subscriber simultaneously? If so, is the number of messages per attribute uniform? If not, you may have to consider how to divide up the flow control across the subscribers on the different subscriptions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.