Sending data with kafka-python only working when briefly delaying code - python

I'm sending some data to a Kafka topic using kafka-python. I struggled with not being able to send data to my Kafka topic for a while until I found out that if I delay the code briefly it works.
from kafka import KafkaProducer
from time import sleep
producer = KafkaProducer(bootstrap_servers="localhost:9092")
producer.send("topic", "foo")
sleep(.1)
This code does not work for me without using sleep(.1). It's like sending data needs time to settle for it to work properly. Is there anything in the kafka-python client that deals with this? Or a better solution?

A year later, but to anyone seeing this, a solution is below. The issue here is race condition with the end of the script and the send call, which is why the sleep() command works.
The kafka module should better handle the python exit, or at the minimum output something to standard out/error, so this behavior isn't silent.
From the kafka-python github:
# Block until a single message is sent (or timeout)
future = producer.send('foobar', b'another_message')
result = future.get(timeout=60)
Now you can guarantee that your script will block until a message has been confirmed published.

Related

Kafka-python script does not execute properly

I have run the below code in the Python Shell:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
future = producer.send('hello-topic', b'Hello, World!')
This works perfectly in that the Kafka consumer picks up the messages.
BUT...
Running it via a script does nothing.
Am I missing something obvious?
The only way to get it working as a script is to add this line...
future.get(timeout=10)
Any help would be appreciated.
kafka send() details from the link : send() is asynchronous. When called it adds the record to a buffer of pending record sends and immediately returns. This allows the producer to batch together individual records for efficiency.
You can use flush()/poll() method to send the message immediately.

Redis-py: run_in_thread eventhandler stops getting called after few hours

I'm trying to implement a basic pubsub using redis-py client.
The idea is, the publisher is actually a callback that gets called periodically and will publish some information on channel1 in the callback function.
The subscriber will listen on that channel for this message and do some processing accordingly.
The subscriber is actually a basic bare-bones webserver that is deployed on k8s and it simply should show up the messages that it receives via the event_handler function.
subscriber.py
class Sub(object):
def __init___(self):
redis = Redis(host=...,
port=...,
password=...,
db=0)
ps = redis.pubsub(ignore_subscribe_messages=True)
ps.subscribe(**{'channel1': Sub.event_handler})
ps.run_in_thread(sleep_time=0.01, daemon=True)
#staticmethod
def event_handler(msg):
print("Hello from event handler")
if msg and msg.get('type') == 'message': # interested only in messages, not subscribe/unsubscribe/pmessages
# process the message
publisher.py
redis = Redis(host=...,
port=...,
password=...,
db=0)
def call_back(msg):
global redis
redis.publish('channel1', msg)
At the beginning, the messages are published and the subscriber event handler prints and process it correctly.
The problem is, after few hours, the subscriber stops showing up those messages. I've checked publisher logs and the messages definitely get sent out, but I'm not able to figure out why the event_handler is not getting called after few hours.
The print statement in it stops getting printed which is why I say the handler is not getting fired after few hours.
Initially I suspected the thread must have died, but on exec into the system I see it listed under the list of threads.
I've read through a lot of blogs, documentations but haven't found much help.
All I can deduce is the event handler stops getting called after sometime.
Can anyone help understand what's going on and the best way to reliably consume pubsub messages in a non blocking way?
Really appreciate any insights you guys have! :(
could you post the whole puplisher.py, please? It could be the case that call_back(msg) isn't called anymore.
To check whether a client is still subscribed, you can use the command PUBSUB CHANNELS in reds-cli.
Regards, Martin

What is the behaviour of kafka when a commit is made without reading the message?

I have code that looks like.
def message_reader(consumer):
consumed_message = consumer.consume_batch()
if consumed_message:
#do something
def run_reader():
process_consumer = get_consumer() #gets a SimpleConsumer()
message_reader(process_consumer)
process_consumer.commit()
process_consumer.close()
so, my question is , Suppose there is no message in the topic and no messages are consumed - does the commit() increase the offset?
And also, does the producer check for the latest offset before producing a message ?
Not an expert on the python client, but the java one would just re-commit the same position if it hasn't actually consumed anything between commit calls.
I'm certain, however, that all clients do the same (commit the same position) as doing otherwise would cause you to skip records. There are also entire Kafka monitoring systems that have been written to rely on this behavior - for example burrow.

Python Producer can send via shell, but not .py

I have a running and tested Kafka cluster, and am trying to use a Python script to send messages to the brokers. This works when I use the Python3 shell and call the producer method, however when I put these same commands into a python file and execute it - the script seems to hang.
I am using the kafka-python library for the consumer and producer. When I use the Python3 shell I can see the messages appear in the topic using Kafka GUI tool 2.0.4
I've tried various loops and statements in the python code, but nothing seems to make it 'run' to completion.
>>>from kafka import KafkaProducer
>>>producer = KafkaProducer(bootstrap_servers='BOOTSTRAP_SRV:9092')
>>>producer.send('MyTopic', b'Has this worked?')
>>>>>><kafka.producer.future.FutureRecordMetadata object at 0x7f7af9ece048>
And this works and bytes appears in the broker topic data.
When I put the same code as above in a python .py file and execute with Python3 it completes, but no data is sent to Kafka broker.
No error shown either.
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='BOOTSTRAP_SRV:9092')
producer.send('MyTopic', b'Some Data to Check')
As you can see, it returns a future.
Kafka clients will batch records, they don't immeadiately send one record at a time, and to make it do that, you will need to wait or flush the producer buffer so that it'll send before the app exits. In other words, the interactive terminal keeps the producer data in-memory, running in the background, and the other way discards that data
As the docs, show
future = producer.send(...)
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
Or just put producer.flush(), if you don't care about the metadata or grabbing the future.

What is the most efficient way to run independent processes from the same application in Python

I have a script that in the end executes two functions. It polls for data on a time interval (runs as daemon - and this data is retrieved from a shell command run on the local system) and, once it receives this data will: 1.) function 1 - first write this data to a log file, and 2.) function 2 - observe the data and then send an email IF that data meets certain criteria.
The logging will happen every time, but the alert may not. The issue is, in cases that an alert needs to be sent, if that email connection stalls or takes a lengthy amount of time to connect to the server, it obviously causes the next polling of the data to stall (for an undisclosed amount of time, depending on the server), and in my case it is very important that the polling interval remains consistent (for analytics purposes).
What is the most efficient way, if any, to keep the email process working independently of the logging process while still operating within the same application and depending on the same data? I was considering creating a separate thread for the mailer, but that kind of seems like overkill in this case.
I'd rather not set a short timeout on the email connection, because I want to give the process some chance to connect to the server, while still allowing the logging to be written consistently on the given interval. Some code:
def send(self,msg_):
"""
Send the alert message
:param str msg_: the message to send
"""
self.msg_ = msg_
ar = alert.Alert()
ar.send_message(msg_)
def monitor(self):
"""
Post to the log file and
send the alert message when
applicable
"""
read = r.SensorReading()
msg_ = read.get_message()
msg_ = read.get_message() # the data
if msg_: # if there is data in general...
x = read.get_failed() # store bad data
msg_ += self.write_avg(read)
msg_ += "==============================================="
self.ctlog.update_templog(msg_) # write general data to log
if x:
self.send(x) # if bad data, send...
This is exactly the kind of case you want to use threading/subprocesses for. Fork off a thread for the email, which times out after a while, and keep your daemon running normally.
Possible approaches that come to mind:
Multiprocessing
Multithreading
Parallel Python
My personal choice would be multiprocessing as you clearly mentioned independent processes; you wouldn't want a crashing thread to interrupt the other function.
You may also refer this before making your design choice: Multiprocessing vs Threading Python
Thanks everyone for the responses. It helped very much. I went with threading, but also updated the code to be sure it handled failing threads. Ran some regressions and found that the subsequent processes were no longer being interrupted by stalled connections and the log was being updated on a consistent schedule . Thanks again!!

Categories