Gunicorn workers restart by connection reset - python

It seems to me that my gunicorn workers are restarted each time there is connection reset by a browser (e.g. by reloading a page while a request is still progress or as a result of connectivity problems).
This doesn't seem to be a sensible behaviour. Effectively I can bring down all the workers just by refreshing a page in a browser a few times.
Questions:
What are the possible causes for a gunicorn worker restart?
What would be the right way to debug this behaviour?
I'm using Pyramid 1.4, Gunicorn (tried eventlet, gevent and sync workers - all demonstrate the same behaviour). The server runs behind nginx.

I mis-diagnosed the problem. It seems that both Firefox and Chrome perform certain optimizations when loading the same page address multiple times. I thought the server was becoming irresponsive, but in fact there were no requests generated to serve.

Related

Multiple gunicorn workers prevents flask app from making https calls

A simple flask app accepts requests and then makes calls to https endpoints. Using gunicorn with multiple worker processes leads to ssl failures.
Using flask run works perfectly, albeit slowly.
Using gunicorn --preload --workers 1 also works perfectly, albeit slowly.
Changing to gunicorn --preload --workers 10 very frequently fails with [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] which leads me to think that there's some per-connection state that is being messed up. But, gunicorn is supposed to fork before beginning service of requests.
Ideas?
I was using --preload to avoid having each worker retrieve initial oauth context for use in some of the https webapi calls. Rule of thumb should be that when doing fork() (in gunicorn), you really need to understand what is happening with the ssl state.
Solution was to disable the preload and do the oauth individually in each worker.

flask socket-io, sometimes client calls freeze the server

I occasionally have a problem with flask socket-io freezing, and I have no clue how to fix it.
My client connects to my socket-io server and performs some chat sessions. It works nicely. But for some reason, sometimes from the client side, there is some call that blocks the whole server (The server is stuck in the process, and all other calls are frozen). What is strange is that the server can be blocked as long as the client side app is not totally shutdown.This is an ios-app / web page, and I must totally close the app or the safari page. Closing the socket itself, and even deallocating it doesn't resolve the problem. When the app is in the background, the sockets are closed and deallocated but the problem persists.
This is a small server, and it deals with both html pages and the socket-server so I have no idea if it is the socket itself or the html that blocks the process. But each time the server was freezing, the log showed some socket calls.
Here is how I configured my server:
socketio = SocketIO(app, ping_timeout=5)
socketio.run(app, host='0.0.0.0', port=5001, debug=True, ssl_context=context)
So my question is:
What can freeze the server (this seems to happen when I leave the app or web-site open for a long time while doing nothing). If I use the services normally the server never freezes. And how I can prevent it from happening (Even if I don't know what causing this, is there a way to blindly stop my server from being stuck at a call?
Thanks you for the answers
According to your comment above, you are using the Flask development web server, without the help of an asynchronous framework such as eventlet or gevent. Besides this option being highly inefficient, you should know that this web server is not battle tested, it is meant for short lived tests during development. I'm not sure it is able to run for very long, specially under the unusual conditions Flask-SocketIO puts it through, which regular Flask apps do not exercise. I think it is quite possible that you are hitting some obscure bug in Werkzeug that causes it to hang.
My recommendation is that you install eventlet and try again. All you need to do is pip install eventlet, and assuming you are not passing an explicit async_mode argument, then just by installing this package Flask-SocketIO should configure itself to use it.
I would also remove the explicit timeout setting. In almost all cases, the defaults are sufficient to maintain a healthy connection.

Apache process idling and eating memory

I'm stuck trying to debug an Apache process that keeps growing in memory size. I'm running Apache 2.4.6 with MPM Prefork on virtual Ubuntu host with 4GB of RAM, serving a Django app with mod_wsgi. The app is heavy with AJAX calls and Apache is getting between 300-1000 requests per minute. Here's what I'm seeing:
As soon as I restart Apache, the first child process (with lowest PID) will keep growing its memory usage, reaching over a gig in 6 or 7 minutes. All the other Apache process will keep memory usage between 10MB-50MB per process.
CPU usage for the troublesome process will fluctuate, sometimes dipping down very low, other times hovering at 20% or sometimes spiking higher.
The troublesome process will run indefinitely until I restart Apache.
I can see in my Django logs that the troublesome process is serving some requests to multiple remote IPs (I'm seeing reports of caught exceptions for URLs my app doesn't like, primarily).
Apache error logs will often (but not always) show "IOError: failed to write data" for the PID, sometimes across multiple IPs.
Apache access logs do not show any requests completed associated with this PID.
Running strace on the PID gets no results other than 'restart_syscall(<... resuming interrupted call ...>' even when I can see that PID mentioned in my app logs at a time when strace was running.
I've tried setting low values of MaxRequestsPerChild and MaxMemFree and neither has seemed to have any effect.
What could this be or how could I debug further? The fact that I see no output of strace makes me that my application has an infinite loop. If that were the case, how could I go about tracing the PID back to the code path it executed or the request that started the trouble?
Instead of restarting Apache, stop and start Apache. There is a known no-fix memory leak issue with Apache.
Also, consider using nginx and gunicorn—this setup is a lightweight, faster, and often recommended alternative to serving your django app, and static files.
References:
Performance
Memory Usage
Apache/Nginx Comparison

Is there any way to restart gunicorn workers on error

Is there any way to capture a python error and restart the gunicorn worker that it came from?
I get intermittent InterfaceError: connection already closed errors that essentially stop a worker from processing further requests that require the database. Restarting the worker manually (or via newrelic http hooks) gets rid of the issue.
Stack is heroku + newrelic.
Obviously there's an underlying issue with the code somewhere, but whilst we try to find it, it'd be good to know that the workers are re-starting reliably.
Thanks

Error R14 (Memory quota exceeded) Not visible in New Relic

Keep getting a Error R14 (Memory quota exceeded) on Heroku.
Profiling the memory on the django app locally I don't see any issues. We've installed New Relic, and things seem to be fine there, except for one oddity:
http://screencast.com/t/Uv1W3bjd
Memory use hovers around 15mb per dyno, but for some reason the 'dynos running' thing quickly scales up to 10+. Not sure how that makes any sense since we are currently only running on web dyno.
We are also running celery, and things seem to look normal there as well (around 15mb). Although it is suspect because I believe we started having the error around when this launched.
Some of our requests do take awhile, as they do a soap request to echosign, which can take 6-10 seconds to respond sometimes. Is this somehow blocking and causing new dyno's to spin up?
Here is my proc file:
web: python manage.py collectstatic --noinput; python manage.py compress; newrelic-admin run-program python manage.py run_gunicorn -b "0.0.0.0:$PORT" -w 9 -k gevent --max-requests 250
celeryd: newrelic-admin run-program python manage.py celeryd -E -B --loglevel=INFO
The main problem is the memory error though.
I BELIEVE I may have found the issue.
Based on posts like these I thought that I should have somewhere in the area of 9-10 gunicorn workers. I believe this is incorrect (or at least, it is for the work my app is doing).
I had been running 9 gunicorn workers, and finally realized that was the only real difference between heroku and local (as far as configuration).
According to the gunicorn design document the advice for workers goes something like this:
DO NOT scale the number of workers to the number of clients you expect
to have. Gunicorn should only need 4-12 worker processes to handle
hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load
balancing when handling requests. Generally we recommend (2 x
$num_cores) + 1 as the number of workers to start off with. While not
overly scientific, the formula is based on the assumption that for a
given core, one worker will be reading or writing from the socket
while the other worker is processing a request.
And while the information out there about Heroku Dyno CPU abilities, I've now read that each dyno is running on something around 1/4 of a Core. Not super powerful, but powerful enough I guess.
Dialing my workers down to 3 (which is even high according to their rough formula) appears to have stopped my memory issues, at least for now. When I think about it, the interesting thing about the memory warning I would get is it would never go up. It got to around 103% and then just stayed there, whereas if it was actually a leak, it should have kept rising until being shut down. So my theory is my workers were eventually consuming just enough memory to go above 512mb.
HEROKU SHOULD ADD THIS INFORMATION SOMEWHERE!! And at the very least I should be able to top into my running dyno to see what's going on. Would have saved me hours and days.

Categories