how to delete kafka message after reading - python

I am using the below code to read messages from a topic. How do i delete a message after it is read?
from kafka import KafkaConsumer
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['localhost:9092'])
for message in consumer:
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,
message.offset, message.key,
message.value))

There is no way to delete a specific message from kafka - kafka simply is not designed to do that. The only way to delete messages is by setting log.retention.hours in kafka's config/server.properties to a value of your liking. The default is 168 - meaning that messages are not kept after 168 hours.
If you instead are looking for a way to read messages from a specific offset - i.e. not read from the beginning every time, look here http://kafka-python.readthedocs.org/en/master/apidoc/KafkaConsumer.html
commit() - committing read offsets to kafka
seek_to_end() - fast forward to consuming only newly arriving messages
seek() - moving to a given offset (presumably stored somewhere else than in kafka)

Related

How can I read all topic?

When I create the consumer
consumer = pulsar.Client(
PULSAR_URL,
authentication=AuthenticationOauth2(params)
).subscribe(
topic=PULSAR_TOPIC,
subscription_name=PULSAR_SUBSCRIPTION_NAME
)
I cannot read all messages from the beginning, or all non read messages, I only can read messages created after the consumer is created.
The questions is about how can I set the consumer in order to read all non read messages previously.
Thanks
You can specify the initial_position in the subscribe method to set the initial position of a consumer when subscribing to the topic. It could be either: InitialPosition.Earliest or InitialPosition.Latest. Default: Latest
So in your case, if you wanted to start at the oldest available message then you would want something like:
consumer = pulsar.Client(
PULSAR_URL,
authentication=AuthenticationOauth2(params)
).subscribe(
topic=PULSAR_TOPIC,
subscription_name=PULSAR_SUBSCRIPTION_NAME,
initial_position=InitialPosition.Earliest
)
Hope this helps!

Does Kafka's producer.flush() verify that all the data is sent to topic?

I'm sending about 12 million rows of data to Kafka topic, via Pythonic KafkaProducer on a daily basis.
The data needs to be JSON formatted, and each row as a 1 entry,
and does a producer.flush() every 1 million rows.
The code is provided below:
import json
count = 0
for key in json_df:
x = json.dumps(key, default=str).encode('utf-8')
producer.send('SOME_TOPIC_NAME', value=x)
count+=1
if (count % 1000000 == 0):
producer.flush()
producer.flush()
But today I have faced a problem that 250~ rows were missing from the topic, and my question is: does the execution of producer.flush() more often guarantees that no data is missing?
I also have linger_ms = 500 in KafkaProducer object, does decreasing the frequency of linger makes sure that all the data is sent to topic?
I dont think the flush is the problem. There is nothing in your code that checks if the program itself dies with less than 1M records. When you restart the producers, the count restarts a zero, and you lose all intermediate batched records.
To answer the question - yes, flush is a blocking call that empties the data that is available in the batch... Keep in mind that Kafka does have a batch.size upper limit (in bytes).
If you already had a JSON-lines formatted file, you could just do
kafka-console-producer ... < file.jsonl

Bittorrent and sockets: how to handle multiple messages?

I'm writing a bittorrent client in python, and have been using a loop to continually read messages from the peer sockets using recv().
When I run my program I look in wireshark to see what bittorrent messages I'm getting. It's pretty easy to tell what kind of message you got from the first 5 bytes of the message, since the length and message ID are specified there.
I'm running into some problems when dealing with receiving data containing multiple messages.
I've tried tackling it by writing a method like this:
def handleMultiple(self, message, peer):
total_length = len(message)
parsed = 0
while parsed < total_length:
m_len, m_id = struct.unpack(">IB", message[parsed:parsed + 5])
m_total = m_len + 4
print(m_len, total_length, parsed, m_id, peer.made_handshake, peer.ip)
self.handleMessage(message[parsed:m_total + parsed], peer)
parsed += m_total
The function just breaks down the received bytes into its constituent messages and hands it off to the message handler that knows how to deal with individual messages.
The problem is that when I printed out the length prefix and message ID from a message I received using recv(), sometimes it looks like just garbage numbers.
This is really my first time experimenting with sockets, so I lack the intuition to know what I'm really getting when calling recv(). Should I just call receive on the first 5 bytes of data I get, then do some checking to make sure that the length and ID are valid, then call recv() on the rest of the message?
How should I go about handling multiple messages incoming at a time?
Edit:
I wanted to provide some images of the results I'm seeing to see if anyone can help identify the issue I'm having.
Here's a picture of the bittorrent messages I'm receiving:
Here's a corresponding logging output:
The columns are supposed to be message length + 4, total message length, message id, and the IP from the sender:
As I can see, the length prefix for the first messages, (the ones that are multiple messages sent to me at a time) are completely too large. The fifth message I got from 95.211.212.26 is a well formed bitfield message.
Another thing I noticed is that the supposed message ID from each of the multi-message messages is 255. Also given that the total length of a bitfield message for this given torrent is 126, the total lengths (303, 328, 325) are not inconceivable for messages of a bitfield followed by several have messages.
Alright so I've managed to figure out where I was going wrong. I was reading from the socket assuming that my message would be there in full. In reality, I was reading the initial snippet of the message, and at a later time I was reading the middle of the message. The 255 values I was seeing weren't message IDs but actually the middle of the peer's bitfield (0xff).
I changed my approach to store the read in bytes from the socket to the peer's message buffer. Once the message buffer was at least as long as the expected payload, I read the message and trimmed the buffer to exclude what I just read. Now all of my messages' IDs are looking as I expect.

How to specify ">" in redis-py

I'm looking at this in the redis stream documentation, which says:
It is time to try reading something using the consumer group:
> XREADGROUP GROUP mygroup Alice COUNT 1 STREAMS mystream >
1) 1) "mystream"
2) 1) 1) 1526569495631-0
2) 1) "message"
2) "apple"
XREADGROUP replies are just like XREAD replies. Note however the GROUP
provided above, it states that I want to
read from the stream using the consumer group mygroup and I'm the
consumer Alice. Every time a consumer performs an operation with a
consumer group, it must specify its name uniquely identifying this
consumer inside the group.
There is another very important detail in the command line above,
after the mandatory STREAMS option the ID requested for the key
mystream is the special ID >. This special ID is only valid in the
context of consumer groups, and it means: messages never delivered to
other consumers so far.
I am trying to specify the ">" parameter in redis-py.
When I look at the documentation here, I don't see any parameter in streams that seems to let me do this. Specifically, I'm trying:
>>> r.xreadgroup(mygroupname,myconsumer,{mystream : ">"},1)
[] # oh no, empty. WHY?!
#
# even though
>>> r.xread({mystream: '1561950326849-0'}, count=1)
[[b'stuff-returned-successfully.]]
What am I missing? Why can't I specify a ">" to indicate unseen messages?
You had a mistaken assumption in this question that you had /unseen/ messages. That command should work, but will not if you have already seen all the messages once.
Try
# make sure you have not seen anything in your stream by resetting last seen to 0
>>> r.xgroup_setid(mystream,mygroupname,0) # RESET ALL
Now
r.xreadgroup(mygroupname,myconsumer,{mystream : ">"},1)
works fine.

Modify Hl7 messages inline using Python.

I need to be able to either modify some text within existing Hl7 message (mostly PID and OBX segments), or create a copy Hl7 message from the existing message, but alter some of the fields based on some criteria (drop PHI strings)
The OBX segment is used to transmit a single observation or
observation fragment. It represents the smallest indivisible unit of a
report. Its mission is to carry information about observations in
report messages.
HL7 messages should not be modified once received or sent or "copied". Each HL7 message indicates a movement of a transaction where several actors interact.
The HL7 messages must be generated from an episode that must be notified to other systems, in this case, it must be generated and sent or received and processed.
You can check the next library (python-hl7) as it is useful for parsing.
Use hl7apy. docs
from hl7apy.parser import parse_message
hl7 = "your hl7 message"
message = parse_message(hl7)
// you can modify whatever you want
message.MSH.MSH_3.value = "your value"

Categories