I'm stuck trying to debug an Apache process that keeps growing in memory size. I'm running Apache 2.4.6 with MPM Prefork on virtual Ubuntu host with 4GB of RAM, serving a Django app with mod_wsgi. The app is heavy with AJAX calls and Apache is getting between 300-1000 requests per minute. Here's what I'm seeing:
As soon as I restart Apache, the first child process (with lowest PID) will keep growing its memory usage, reaching over a gig in 6 or 7 minutes. All the other Apache process will keep memory usage between 10MB-50MB per process.
CPU usage for the troublesome process will fluctuate, sometimes dipping down very low, other times hovering at 20% or sometimes spiking higher.
The troublesome process will run indefinitely until I restart Apache.
I can see in my Django logs that the troublesome process is serving some requests to multiple remote IPs (I'm seeing reports of caught exceptions for URLs my app doesn't like, primarily).
Apache error logs will often (but not always) show "IOError: failed to write data" for the PID, sometimes across multiple IPs.
Apache access logs do not show any requests completed associated with this PID.
Running strace on the PID gets no results other than 'restart_syscall(<... resuming interrupted call ...>' even when I can see that PID mentioned in my app logs at a time when strace was running.
I've tried setting low values of MaxRequestsPerChild and MaxMemFree and neither has seemed to have any effect.
What could this be or how could I debug further? The fact that I see no output of strace makes me that my application has an infinite loop. If that were the case, how could I go about tracing the PID back to the code path it executed or the request that started the trouble?
Instead of restarting Apache, stop and start Apache. There is a known no-fix memory leak issue with Apache.
Also, consider using nginx and gunicorn—this setup is a lightweight, faster, and often recommended alternative to serving your django app, and static files.
References:
Performance
Memory Usage
Apache/Nginx Comparison
Related
Gist
When I start use copy-on-write to start up a bunch of wsgi workers, they freak out and use a ton of CPU and memory whenever I need to restart the main process. This causes OOM errors and I'd like to avoid it. I've tested in uwsgi and gunicorn and I see the same behavior.
Problem
I have a webapp with somewhat memory intensive process-based workers, so I use --preload in gunicorn and the default behavior in uwsgi so the application is fully loaded before forking in order to enable copy-on-write between the processes -- to save on memory usage.
When I shut down the main process (e.g. via a SIGINT or SIGTERM), all of the worker processes spike in CPU usage and my machine (Ubuntu 16.04, but also tested on Debian 10) loses a huge chunk of available memory. This often causes an OOM error. I don't see any rise in RES for each of the workers, but the drop in available mem roughly corresponds to what I would expect if all of the memory was being fully copied for each of the workers before being immediately de-allocated during their shut down. I'd like to avoid this sudden spike in memory usage and fully enjoy the benefits of copy-on-write.
Test environment
I have a really simple Flask app that you can use to test this:
from flask import Flask
application = Flask(__name__)
my_data = {"data{0}".format(i): "value{0}".format(i) for i in range(2000000)}
#application.route("/")
def index():
return "I have {0} data items totalling {1} characters".format(
len(my_data), sum(len(k) + len(v) for k, v in my_data.items()))
You can start the app with either of the following commands:
$ gunicorn --workers=16 --preload app:application
$ uwsgi --http :8080 --processes=16 --wsgi-file app.py
When I do ^C on the main process in my terminal and track the "free" KiB Mem reported by top, that's when I see the huge drop in available memory and the spike in CPU usage. Note that there is no change in memory usage reported for each worker. Is there a way to safely restart uwsgi/gunicorn so that this memory and CPU spike doesn't happen?
Steps to reproduce:
Set up app.py as described above
Run either gunicorn or uwsgi with the arguments provided above.
Observe free memory and CPU usage (using top)
5.7GB free on my machine before startup
5.3GB free on my machine after startup
Ctrl-C on the main gunicorn/uwsgi process
1.3GB free while processes are shutting down (and CPU usage spikes)
5.7GB free after all processes actually shut down (2-5 seconds later)
It seems to me that my gunicorn workers are restarted each time there is connection reset by a browser (e.g. by reloading a page while a request is still progress or as a result of connectivity problems).
This doesn't seem to be a sensible behaviour. Effectively I can bring down all the workers just by refreshing a page in a browser a few times.
Questions:
What are the possible causes for a gunicorn worker restart?
What would be the right way to debug this behaviour?
I'm using Pyramid 1.4, Gunicorn (tried eventlet, gevent and sync workers - all demonstrate the same behaviour). The server runs behind nginx.
I mis-diagnosed the problem. It seems that both Firefox and Chrome perform certain optimizations when loading the same page address multiple times. I thought the server was becoming irresponsive, but in fact there were no requests generated to serve.
I am currently trying to run a long running python script on Ubuntu 12.04. The machine is running on a Digital Ocean droplet. It has no visible memory leaks (top shows constant memory). After running without incident (there are no uncaught exceptions and the used memory does not increase) for about 12 hours, the script gets killed.
The only messages present in syslog relating to the script are
Sep 11 06:35:06 localhost kernel: [13729692.901711] select 19116 (python), adj 0, size 62408, to kill
Sep 11 06:35:06 localhost kernel: [13729692.901713] send sigkill to 19116 (python), adj 0, size 62408
I've encountered similar problems before (with other scripts) in Ubuntu 12.04 but the logs then contained the additional information that the scripts were killed by oom-killer.
Those scripts, as well as this one, occupy a maximum of 30% of available memory.
Since i can't find any problems with the actual code, could this be an OS problem? If so, how do i go about fixing it?
Your process was indeed killed by the oom-killer. The log message “select … to kill“ hints to that.
Probably your script didn’t do anything wrong, but it was selected to be killed because it used the most memory.
You have to provide more free memory, by adding more (virtual) RAM if you can, by moving other services from this machine to a different one, or by trying to optimize memory usage in your script.
See e.g. Debug out-of-memory with /var/log/messages for debugging hints. You could try to spare your script from being killed: How to set OOM killer adjustments for daemons permanently? But often killing some process at random may leave the whole machine in an unstable state. In the end you will have to sort out the memory requirements and then make sure enough memory for peak loads is available.
Keep getting a Error R14 (Memory quota exceeded) on Heroku.
Profiling the memory on the django app locally I don't see any issues. We've installed New Relic, and things seem to be fine there, except for one oddity:
http://screencast.com/t/Uv1W3bjd
Memory use hovers around 15mb per dyno, but for some reason the 'dynos running' thing quickly scales up to 10+. Not sure how that makes any sense since we are currently only running on web dyno.
We are also running celery, and things seem to look normal there as well (around 15mb). Although it is suspect because I believe we started having the error around when this launched.
Some of our requests do take awhile, as they do a soap request to echosign, which can take 6-10 seconds to respond sometimes. Is this somehow blocking and causing new dyno's to spin up?
Here is my proc file:
web: python manage.py collectstatic --noinput; python manage.py compress; newrelic-admin run-program python manage.py run_gunicorn -b "0.0.0.0:$PORT" -w 9 -k gevent --max-requests 250
celeryd: newrelic-admin run-program python manage.py celeryd -E -B --loglevel=INFO
The main problem is the memory error though.
I BELIEVE I may have found the issue.
Based on posts like these I thought that I should have somewhere in the area of 9-10 gunicorn workers. I believe this is incorrect (or at least, it is for the work my app is doing).
I had been running 9 gunicorn workers, and finally realized that was the only real difference between heroku and local (as far as configuration).
According to the gunicorn design document the advice for workers goes something like this:
DO NOT scale the number of workers to the number of clients you expect
to have. Gunicorn should only need 4-12 worker processes to handle
hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load
balancing when handling requests. Generally we recommend (2 x
$num_cores) + 1 as the number of workers to start off with. While not
overly scientific, the formula is based on the assumption that for a
given core, one worker will be reading or writing from the socket
while the other worker is processing a request.
And while the information out there about Heroku Dyno CPU abilities, I've now read that each dyno is running on something around 1/4 of a Core. Not super powerful, but powerful enough I guess.
Dialing my workers down to 3 (which is even high according to their rough formula) appears to have stopped my memory issues, at least for now. When I think about it, the interesting thing about the memory warning I would get is it would never go up. It got to around 103% and then just stayed there, whereas if it was actually a leak, it should have kept rising until being shut down. So my theory is my workers were eventually consuming just enough memory to go above 512mb.
HEROKU SHOULD ADD THIS INFORMATION SOMEWHERE!! And at the very least I should be able to top into my running dyno to see what's going on. Would have saved me hours and days.
Is there a canonical code deployment strategy for tornado-based web application deployment. Our current configuration is 4 tornado processes running behind NginX? (Our specific use case is behind EC2.)
We've currently got a solution that works well enough, whereby we launch the four tornado processes and save the PIDs to a file in /tmp/. Upon deploying new code, we run the following sequence via fabric:
Do a git pull from the prod branch.
Remove the machine from the load balancer.
Wait for all in flight connections to finish with a sleep.
Kill all the tornadoes in the pid file and remove all *.pyc files.
Restart the tornadoes.
Attach the machine back to the load balancer.
We've taken some inspiration from this: http://agiletesting.blogspot.com/2009/12/deploying-tornado-in-production.html
Are there any other complete solutions out there?
We run Tornado+Nginx with supervisord as the supervisor.
Sample configuration (names changed)
[program:server]
process_name = server-%(process_num)s
command=/opt/current/vrun.sh /opt/current/app.py --port=%(process_num)s
stdout_logfile=/var/log/server/server.log
stderr_logfile=/var/log/server/server.err
numprocs = 6
numprocs_start = 7000
I've yet to find the "best" way to restart things, what I'll probably finally do is have Nginx have a "active" file which is updated letting HAProxy know that we're messing with configuration then wait a bit, swap things around, then re-enable everything.
We're using Capistrano (we've got a backlog task to move to Fabric), but instead of dealing with removing *.pyc files we symlink /opt/current to the release identifier.
I haven't deployed Tornado in production, but I've been playing with Gevent + Nginx and have been using Supervisord for process management - start/stop/restart, logging, monitoring - supervisorctl is very handy for this. Like I said, not a deployment solution, but maybe a tool worth using.