I have a pipeline where a message is produced to topic A, which is processed by a stream processor and sends enriched data to topic B.
Topic B is consumed by 3 other stream processors which independently perform a small part of the calculation (to reduce the load on a single processor) and forward their enriched data onto a new topic. The final processor reads from all 3 new topics and send this data on to web clients via web sockets.
It all works well but if the system sits idle for 30 minutes or so with no new messages it can sometimes take up to 10 seconds to get to the end of the pipeline. When operating normally this time has been in the order of 10-20ms.
Every stream processor uses tables to refer to previous data and determine how to enrich going forward, so I'm not sure whether accessing this table slows if there's no need to access it over time?
If so, it seems a silly workaround, but it might be possible to use a timer to send a dummy dataset to trigger each worker to stay alive and alert.
Below is a print output of the time difference from the message initiation to the arrival time at the end of the pipeline:
[2022-05-23 08:52:46,445] [10340] [WARNING] 0:00:00.017999
[2022-05-23 08:53:03,469] [10340] [WARNING] 0:00:00.025995
[2022-05-23 09:09:46,774] [10340] [WARNING] 0:00:06.179146
I wonder whether using any of the settings available to either brokers or agents noted on this page will be of use here? If anyone knows, please let me know.
UPDATE
So I ran tests where i use the #app.time option to send a dummy/test message through the entire pipeline every second and never had an instance of slow send times. I also updated the way things work to directly talk to the app using the #app.page() decorator rather than a FastAPI endpoint to try send to the topic and this did mean I never saw a delay greater than 2 seconds. But the same thing did still happen where if it sat idle for a while then received a new message it took almost exactly 2 seconds (plus change) to do it's thing. This really starts to look like an agent throttles it's poll or kafka throttles an agent's connection if the throughput is low.
It appears that the issue stems from a setting on Kafka for both consumers and producers which basically closes the connection if they haven't sent/consumed messages within the designated time frame.
In Faust, you access this and set it up when you define the app like so:
app.conf.producer_connections_max_idle_ms
app.conf.consumer_connections_max_idle_ms
and set it to something appropriate. I understand that this setting is probably left low (9 mins by default) for large dynamic clusters to release resources or memory (or something) but in our use case with a small cluster that will remain static in terms or architecture and design, it's not an issue (I think) to increase this from 9 minutes to 12 or 24 hours.
Related
I am developing an automation tool that is supposed to upgrade IP network devices.
I developed 2 totally separated script for the sake of simplicity - I am not an expert developer - one for the core and aggregation nodes, and one for the access nodes.
The tool executes software upgrade on the routers, and verifies the result by executing a set of post check commands. The device role implies the "size" of the router. Bigger routers take much more to finish the upgrade. Meanwhile the smaller ones are up much earlier than the bigger ones, the post check cannot be started until the bigger ones finish the upgrade, because they are connected to each other.
I want to implement a reliable signaling between the 2 scripts. That is, the slower script(core devices) flips a switch when the core devices are up, while the other script keeps checking this value, and start the checks for the access devices.
Both script run 200+ concurrent sessions moreover, each and every access device(session) needs individual signaling, so all the sessions keep checking the same value in the DB.
First I used the keyring library, but noticed that the keys do disappear sometimes. Now I am using a txt file to manipulate the signal values. It looks pretty much armature, so I would like to use MongoDB.
Would it cause any performance issues or unexpected exception?
The script will be running for 90+ minutes. Is it OK to connect to the DB once at the beginning of the script, set the signal to False, then 20~30 minutes later keep checking for an additional 20 minutes. Or is it advised to establish a new connection for reading the value for each and every parallel session?
The server runs on the same VM as the script. What exceptions shall I expect?
Thank you!
This is probably an eisenbug so I'm not expecting hard answers but more hints on what to look for to be able to replicate the bug.
I have an event-driven, kafka-based system composed of several services. For now, they are organized in linear pipelines. One topic, one event type. Every service can be thought as a transformation from one event type to one or more event types.
Each transformation is executed as a python process, with its own consumer and its own producer. They all share the same code and configuration because this is all abstracted away from the service implementation.
Now, what's the problem. On our staging environment, sometimes (let's say one in every fifty messages), there's a message on Kafka but the consumer is not processing it at all. Even if you wait hours, it just hangs. This doesn't happen in local environments and we haven't been able to reproduce it anywhere else.
Some more relevant information:
these services get restarted often for debugging purposes but the hanging doesn't seem related to the restarting.
When the message is hanging and you restart the service, the service will process the message.
The services are completely stateless so there's no caching or other weird stuff going on (I hope)
When this happens I have evidence that they are not still processing older messages (I log when they produce an output and this happens right before the end of the consumer loop)
With the current deployment there's just a single consumer in the consumer group, so no parallel processing inside the same services, no horizontal scaling of the service
How I consume:
I use pykafka and this is the consumer loop:
def create_consumer(self):
consumer = self.client.topics[bytes(self.input_topic, "UTF-8")].get_simple_consumer(
consumer_group=bytes(self.consumer_group, "UTF-8"),
auto_commit_enable=True,
offsets_commit_max_retries=self.service_config.kafka_offsets_commit_max_retries,
)
return consumer
def run(self):
consumer = self.create_consumer()
while not self.stop_event.wait(1):
message = consumer.consume()
results = self._process_message(message)
self.output_results(results)
My assumption is that there's either some problem with the way I consume the messages or there's some inconsistent state of the consumer group offsets, but I cannot really wrap my mind around the problem.
I'm also considering to move to Faust to solve the problem. Given my codebase and my architectural decision, the transition shouldn't be too hard but before starting such a work I would like to be sure that I'm going in the right direction. Right now it would just be a blind shot hoping that some detail that is creating the problem will go away.
I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.
I am very new to AWS SQS queues and I am currently playing around with boto. I noticed that when I try to read a queue filled with messages in a while loop, I see that after 10-25 messages are read, the queue does not return any message (even though the queue has more than 1000+ messages). It starts populating another set of 10-25 messages after a few seconds or on stopping and restarting the the program.
while true:
read_queue() // connection is already established with the desired queue.
Any thoughts on this behaviour or point me in the right direction. Just reiterating I am just couple of days old to SQS !!
Thanks
That's the way that SQS queues work by default (short polling). If you haven't changed any settings after setting up your queue, the default is to get messages from a weighted random sampling of machines. If you're using more than one machine and want all the messages you can consume at that moment (across all machines), you need to use long polling. See the Amazon documentation here. I don't think boto supports that directly ATM.
Long polling is more efficient because it allows you to leave the HTTP connection open for a period of time while you wait for more results. However, you can still do your own polling in boto by just setting up a loop and waiting for some period of time between reading the queue. You can still get good overall throughput with this polling strategy.
A rather confusing sequence of events happened, according to my log-file, and I am about to put a lot of the blame on the Python logger, which is a bold claim. I thought I should get some second opinions about whether what I am saying could be true.
I am trying to explain why there is are several large gaps in my log file (around two minutes at a time) during stressful periods for my application when it is missing deadlines.
I am using Python's logging module on a remote server, and have set-up, with a configuration file, for all logs of severity of ERROR or higher to be emailed to me. Typically, only one error will be sent at a time, but during periods of sustained problems, I might get a dozen in a minute - annoying, but nothing that should stress SMTP.
I believe that, after a short spurt of such messages, the Python logging system (or perhaps the SMTP system it is sitting on) is encountering errors or congestion. The call to Python's log is then BLOCKING for two minutes, causing my thread to miss its deadlines. (I was smart enough to move the logging until after the critical path of the application - so I don't care if logging takes me a few seconds, but two minutes is far too long.)
This seems like a rather awkward architecture (for both a logging system that can freeze up, and for an SMTP system (Ubuntu, sendmail) that cannot handle dozens of emails in a minute**), so this surprises me, but it exactly fits the symptoms.
Has anyone had any experience with this? Can anyone describe how to stop it from blocking?
** EDIT # 2 : I actually counted. 170 emails in two hours. Forget the previous edit. I counted wrong. It's late here...
Stress-testing was revealing:
My logging configuration sent critical messages to SMTPHandler, and debug messages to a local log file.
For testing I created a moderately large number of threads (e.g. 50) that waited for a trigger, and then simultaneosly tried to log either a critical message or a debug message, depending on the test.
Test #1: All threads send critical messages: It revealed that the first critical message took about .9 seconds to send. The second critical message took around 1.9 seconds to send. The third longer still, quickly adding up. It seems that the messages that go to email block waiting for each other to complete the send.
Test #2: All threads send debug messages: These ran fairly quickly, from hundreds to thousands of microseconds.
Test #3: A mix of both. It was clear from the results that debug messages were also being blocked waiting for critical message's emails to go out.
So, it wasn't that 2 minutes meant there was a timeout. It was the two minutes represented a large number of threads blocked waiting in the queue.
Why were there so many critical messages being sent at once? That's the irony. There was a logging.debug() call inside a method that included a network call. I had some code monitoring the speed of the of the method (to see if the network call was taking too long). If so, it (of course) logged a critical error that sent an email. The next thread then blocked on the logging.debug() call, meaning it missed the deadline, triggering another email, triggering another thread to run slowly.
The 2 minute delay in one thread wasn't a network timeout. It was one thread waiting for another thread, that was blocked for 1 minute 57 - because it was waiting for another thread blocked for 1 minute 55, etc. etc. etc.
This isn't very pretty behaviour from SMTPHandler.
A two minute pause sounds like a timeout - mostly probably in the networking stack.
Try adding:
* - nofile 64000
to your /etc/security/limits.conf file on all of the machines involved and then rebooting all of the machines to ensure it is applied to all running services.