I've written some data into my one-partition topic with a KafkaProducer, I'm trying to view this data using a KafkaConsumer by either looping through the consumer or poll()
import time
from datetime import datetime, timedelta
from kafka import KafkaProducer, KafkaConsumer, TopicPartition
consumer = KafkaConsumer(bootstrap_servers='localhost:9092',group_id='my-group',enable_auto_commit=False)
tp = TopicPartition(topic_name, 0)
consumer.assign([tp])
consumer.seek_to_end(tp)
last_offset = consumer.position(tp)
producer = KafkaProducer(bootstrap_servers='localhost:9092')
stopWriting = datetime.now() + timedelta(seconds=10)
while datetime.now() < stopWriting:
producer.send(topic='my-topic',value=str(datetime.now()).encode('utf-8'))
time.sleep(1)
producer.close()
consumer.seek(tp, last_offset)
#looping through the consumer
for msg in consumer:
print(msg)
# or looping through the polled messages
for msg in consumer.poll():
print(msg)
Neither one seems to work properly, the consumer loop does print out the messages, but always ended up hanging by an infinite loop within kafka/consumer/group.py(886)_message_generator. The poll loop doesn't print anything out at all. Is there something I'm missing to read out all of the newly made messages without hanging the program? I'm using Python 3.6.1 and kafka-python version 1.3.4
I found a way with poll(), first you need a timeout with it because none of the messages are in the buffer. Next return a dict of {TopicParition:[ConsumerRecord]}, so you need to specify the topic partition that you want to read the messages from.
import sys
records = consumer.poll(timeout_ms=sys.maxsize)
for record in records[tp]:
print(record)
Related
I've a docker which fetches messages from a standard SQS. But most of the times, the code shows it received zero messages and exits. While the SQS console shows the messages under "Messages in flight", so the messages were received by some consumer.
This is my docker entry point
ENV PYTHONPATH="$PYTHONPATH:/app"
ENTRYPOINT [ "python3" ]
CMD ["multi.py"]
This is multi.py code
import multiprocessing as mp
import subprocess
def s():
subprocess.call(['python3', 'script.py'])
n_process = min(mp.cpu_count(), 8)
process = []
for i in range(n_process):
p = mp.Process(target=s)
process.append(p)
p.start()
for p in process:
p.join()
This is script.py part of the code which calls receive_messages
sqs = boto3.resource('sqs', region_name=REGION, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
queue = sqs.get_queue_by_name(QueueName=QUEUE_NAME)
def main():
while True:
m = queue.receive_messages()
for message in m:
process_message(message)
message.delete()
Also, the docker works like 60% of the time. But I'm trying to figure out why it fails.
PS: Solved
This is from the boto3 docs
Short poll is the default behavior where a weighted random set of machines is sampled on a ReceiveMessage call. Thus, only the messages on the sampled machines are returned. If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response. If this happens, repeat the request.
m = queue.receive_messages(WaitTimeSeconds=5)
This will resolve the issue because in cases where there are very less amount of messages in SQS, polling for messages will be very likely to fail.
You can read about short-polling on boto3 docs here.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sqs.html#SQS.Queue.receive_messages
I'm testing a script that runs binwalk on a file and then sends a kafka message to let the sending file know that it was completed or if it failed. It looks like this:
if __name__ == "__main__":
# finds the path of this file
scriptpath = os.path.dirname(inspect.getfile(inspect.currentframe()))
print(scriptpath)
# sets up kafka consumer on the binwalk topic and kafka producer for the bwsignature topic
consumer = KafkaConsumer('binwalk', bootstrap_servers=['localhost:9092'])
producer = KafkaProducer(bootstrap_servers = ['localhost:9092'])
# watches the binwalk kafka topic
for msg in consumer:
# load the json
job = json.loads(msg.value)
# get the filepath of the .bin
filepath = job["src"]
print(0)
try:
# runs the script
binwalkthedog(filepath, scriptpath)
# send a receipt
producer.send('bwsignature', b'accepted')
except:
producer.send('bwsignature', b'failed')
pass
producer.close()
consumer.close()
If I send in a file that doesn't give any errors in the 'binwalkthedog' function then it works fine, but if I give it a file that doesn't exist it prints a general error message and moves on to the next input, as it should. For some reason, the producer.send('bwsignature', b'failed') doesn't send unless there's something that creates a delay after the binwalkthedog call fails like time.sleep(1) or a for loop that counts to a million.
Obviously I could keep that in place but it's really gross and I'm sure there's a better way to do this.
This is the temp script I'm using to send and recieve a signal from the binwalkthedog module:
job = {
'src' : '/home/nick/Documents/summer-2021-intern-project/BinwalkModule/bo.bin',
'id' : 1
}
chomp = json.dumps(job).encode('ascii')
receipt = KafkaConsumer('bwsignature', bootstrap_servers=['localhost:9092'])
producer = KafkaProducer(bootstrap_servers = ['localhost:9092'])
future = producer.send('binwalk', chomp)
try:
record_metadata = future.get(timeout=10)
except KafkaError:
print("sucks")
pass
print(record_metadata.topic)
print(record_metadata.partition)
print(record_metadata.offset)
producer.close()
for msg in receipt:
print(msg.value)
break
Kafka producers batch many records together to reduce requests made to the server. If you want to force records to send, rather than introducing a blocking sleep call, or calling a get on the future, you should use producer.flush()
I use the kafka_python==2.0.0 library ,
With the piece of code below, if I do not receive a message for 1 hour, the next message pushed in the kafka topic are not processed by the consumer, however the loop does not stop.
I would like my listener to run 24/24 without losing the connection
consumer = KafkaConsumer(
os.environ.get('MY_TOPIC'),
bootstrap_servers=broker,
api_version=my_version,
security_protocol='SASL_PLAINTEXT',
sasl_mechanism='GSSAPI',
sasl_kerberos_service_name=service_name,
group_id='MY_GRP_ID',
max_poll_records=1
)
try:
for msg in consumer:
##PROCESS function ...
consumer.commit()
finally:
consumer.close()
I finally use the poll method :
from kafka import KafkaConsumer
# To consume latest messages and auto-commit offsets
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['localhost:9092'])
while True:
# Response format is {TopicPartiton('topic1', 1): [msg1, msg2]}
msg_pack = consumer.poll(timeout_ms=500)
for tp, messages in msg_pack.items():
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (tp.topic, tp.partition,
message.offset, message.key,
message.value))
The advantage of my point to this syntax is better visibility on how to retrieve messages. it is not in my example but I could better manage the stopping of the program by loking for a sigterm signal
from kafka import KafkaProducer, errors, admin, KafkaConsumer
SERVERS = ['localhost:9092']
TEST_TOPIC = 'test-topic'
DATA = [{'A':'A'}, {'A':'A'}, {'A':'A'}]
class TestKafkaConsumer(unittest.TestCase):
#classmethod
def setUpClass(self):
self._producer = KafkaProducer(bootstrap_servers=SERVERS, value_serializer=lambda x:dumps(x).encode('utf-8'))
def _send_data(self):
for data in DATA:
print(self._producer.send(TEST_TOPIC, value=data))
def test_basic_processing(self):
self._send_data()
received = []
consumer = KafkaConsumer(TEST_TOPIC, bootstrap_servers=SERVERS)
for msg in consumer:
message = json.loads(msg.value.decode('utf-8'))
received.append(message)
if (len(received) >= len(DATA)):
self.assertEqual(received, DATA)
This should succeed pretty quickly, as it just sends the data to the the Kafka broker in a pretty straightforward manner. However, it times out; the consumer never reads a single message. If I move the consumer portion to a different file and run it in a different terminal window, the messages are "consumed" pretty instantly. Why is the unittest not working for a consumer in this unittest?
You're producing records with your producer and then you're reading, this might be your problem.
When your consumer is started, you already had produced records, so, from the consumer point of view, there are no new messages.
You should run your consumer in a different thread, before your producer start producing.
Yannick
I am not able to send persistent message to AMQ queue using stompest and python.. Dont know what header to use???
Below is source code
from stompest.config import StompConfig
from stompest.sync import Stomp
import os
CONFIG = StompConfig('tcp://localhost:61613')
QUEUE = '/queue/myQueue'
if __name__ == '__main__':
try:
client = Stomp(CONFIG)
client.connect({'login':'#####','passcode':'#####'})
for i in range(10):
msg="Test Message" +str(i)
client.send(QUEUE,msg)
client.disconnect()
except Exception,e:
print e
If you go persistent, you may also want to send you message in a transaction.
with client.transaction(receipt='important') as transaction:
client.send(QUEUE, 'test',{'persistent':'true', StompSpec.TRANSACTION_HEADER: transaction})
This way, you can ensure all or none of a set of messages ends up on a queue. If there is an error raised within the transaction block, the message(s) won't be committed to the queue. The same goes for reading messages.
You have to change the send line to this :
client.send(QUEUE,msg, headers={'persistent' :'true'})