Why does my apache kafka consumer randomly ignores queued messages? - python

This is probably an eisenbug so I'm not expecting hard answers but more hints on what to look for to be able to replicate the bug.
I have an event-driven, kafka-based system composed of several services. For now, they are organized in linear pipelines. One topic, one event type. Every service can be thought as a transformation from one event type to one or more event types.
Each transformation is executed as a python process, with its own consumer and its own producer. They all share the same code and configuration because this is all abstracted away from the service implementation.
Now, what's the problem. On our staging environment, sometimes (let's say one in every fifty messages), there's a message on Kafka but the consumer is not processing it at all. Even if you wait hours, it just hangs. This doesn't happen in local environments and we haven't been able to reproduce it anywhere else.
Some more relevant information:
these services get restarted often for debugging purposes but the hanging doesn't seem related to the restarting.
When the message is hanging and you restart the service, the service will process the message.
The services are completely stateless so there's no caching or other weird stuff going on (I hope)
When this happens I have evidence that they are not still processing older messages (I log when they produce an output and this happens right before the end of the consumer loop)
With the current deployment there's just a single consumer in the consumer group, so no parallel processing inside the same services, no horizontal scaling of the service
How I consume:
I use pykafka and this is the consumer loop:
def create_consumer(self):
consumer = self.client.topics[bytes(self.input_topic, "UTF-8")].get_simple_consumer(
consumer_group=bytes(self.consumer_group, "UTF-8"),
auto_commit_enable=True,
offsets_commit_max_retries=self.service_config.kafka_offsets_commit_max_retries,
)
return consumer
def run(self):
consumer = self.create_consumer()
while not self.stop_event.wait(1):
message = consumer.consume()
results = self._process_message(message)
self.output_results(results)
My assumption is that there's either some problem with the way I consume the messages or there's some inconsistent state of the consumer group offsets, but I cannot really wrap my mind around the problem.
I'm also considering to move to Faust to solve the problem. Given my codebase and my architectural decision, the transition shouldn't be too hard but before starting such a work I would like to be sure that I'm going in the right direction. Right now it would just be a blind shot hoping that some detail that is creating the problem will go away.

Related

Can NATS keep messages in memory if no workers are online at the moment (producer-consumers pattern)?

I would like to use NATS to distribute tasks among several worker-processes. Everything works as expected if I have at least one worker "online", but if there are no worker-processes, messages are just thrown away, when I turn on one worker I got no messages (which were created when it was not online).
I know how to do it with RabbitMQ, but is it possible to do it with NATS?
I do project in Python, producer-process in aiohttp, worker-processes are also in Python and do some CPU-heavy tasks.
Are you familiar with JetStream? JetStream retains messages so they can be replayed. You can configure your stream to only discard the message once it's been acknowledged.
Not sure what the state of python client is in regard to JetStream, I know it it being worked on. https://github.com/nats-io/nats.py
By this day official Python connector does not support NATS/Jetsream issue-209

Long running cloud task on gae flexible terminates early without error. How to debug? What am I missing?

I am running an application on gae flexible with python and flask. I periodically dispatch cloud tasks with a cron job. These basically loop through all users and perform some cluster analysis. The tasks terminate without throwing any kind of error but don't perform all the work (meaning not all users were looped through). It doesn't seem to happen at a consistent time 276.5s - 323.3s nor does it ever stop at the same user. Has anybody experienced anything similar?
My guess is that I am breaching some type of resource limit or timeout somewhere. Things i have thought about or tried:
Cloud tasks should be allowed to run for up to an hour (as per this: https://cloud.google.com/tasks/docs/creating-appengine-handlers)
I increased the timeout of gunicorn workers to be 3600 to reflect this.
I have several workers running.
I tried to find if there are memory spikes or cpu overload but didn't see anything suspicious.
Sorry if I am too vague or am completely missing the point, I am quite confused with this problem. Thank you for any pointers.
Thank you for all the suggestions, I played around with them and have found out the root cause, although by accident reading firestore documentation. I had no indication that this had anything to do with firestore.
From here: https://googleapis.dev/python/firestore/latest/collection.html
I found out that Query.stream() (or Query.get()) has a timeout on the individual documents like so:
Note: The underlying stream of responses will time out after the
max_rpc_timeout_millis value set in the GAPIC client configuration for
the RunQuery API. Snapshots not consumed from the iterator before that
point will be lost.
So what eventually timed out was the query of all users, I came across this by chance, none of the errors I caught pointed me back towards the query. Hope this helps someone in the future!
Other than use Cloud Scheduler, you can inspect the logs to make sure the Tasks ran properly and make sure there's no deadline issues. As application logs are grouped, and after the task itself is executed, it’s sent to Stackdriver. When a task is forcibly terminated, no log may be output. Try catching the Deadline exception so that some log is output and you may see some helpful info to start troubleshooting.

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Running an infinite Python script while connected to database

I'm working on a project to learn Python, SQL, Javascript, running servers -- basically getting a grip of full-stack. Right now my basic goal is this:
I want to run a Python script infinitely, which is constantly making API calls to different services, which have different rate limits (e.g. 200/hr, 1000/hr, etc.) and storing the results (ints) in a database (PostgreSQL). I want to store these results over a period of time and then begin working with that data to display fun stuff on the front. I need this to run 24/7. I'm trying to understand the general architecture here, and searching around has proven surprisingly difficult. My basic idea in rough pseudocode is this:
database.connect()
def function1(serviceA):
while(True):
result = makeAPIcallA()
INSERT INTO tableA result;
if(hitRateLimitA):
sleep(limitTimeA)
def function2(serviceB):
//same thing, different limits, etc.
And I would ssh into my server, run python myScript.py &, shut my laptop down, and wait for the data to roll in. Here are my questions:
Does this approach make sense, or should I be doing something completely different?
Is it considered "bad" or dangerous to open a database connection indefinitely like this? If so, how else do I manage the DB?
I considered using a scheduler like cron, but the rate limits are variable. I can't run the script every hour when my limit is hit say, 5min into start time and has a wait time of 60min after that. Even running it on minute intervals seems messy: I need to sleep for persistent rate limit wait times which will keep varying. Am I correct in assuming a scheduler is not the way to go here?
How do I gracefully handle any unexpected potentially fatal errors (namely, logging and restarting)? What about manually killing the script, or editing it?
I'm interested in learning different approaches and best practices here -- any and all advice would be much appreciated!
I actually do exactly what you do for one of my personal applications and I can explain how I do it.
I use Celery instead of cron because it allows for finer adjustments in scheduling and it is Python and not bash, so it's easier to use. I have different tasks (basically a group of API calls and DB updates) to different sites running at different intervals to account for the various different rate limits.
I have the Celery app run as a service so that even if the system restarts it's trivial to restart the app.
I use the logging library in my application extensively because it is difficult to debug something when all you have is one difficult to read stack trace. I have INFO-level and DEBUG-level logs spread throughout my application, and any WARNING-level and above log gets printed to the console AND gets sent to my email.
For exception handling, the majority of what I prepare for are rate limit issues and random connectivity issues. Make sure to surround whatever HTTP request you send to your API endpoints in try-except statements and possibly just implement a retry mechanism.
As far as the DB connection, it shouldn't matter how long your connection is, but you need to make sure to surround your main application loop in a try-except statement and make sure it gracefully fails by closing the connection in the case of an exception. Otherwise you might end up with a lot of ghost connections and your application not being able to reconnect until those connections are gone.

Python "Task Server"

My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?
I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/
I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.
My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.
You can have a look at celery
It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

Categories