Error R14 (Memory quota exceeded) Not visible in New Relic

Error R14 (Memory quota exceeded) Not visible in New Relic - python

Keep getting a Error R14 (Memory quota exceeded) on Heroku.
Profiling the memory on the django app locally I don't see any issues. We've installed New Relic, and things seem to be fine there, except for one oddity:
http://screencast.com/t/Uv1W3bjd
Memory use hovers around 15mb per dyno, but for some reason the 'dynos running' thing quickly scales up to 10+. Not sure how that makes any sense since we are currently only running on web dyno.
We are also running celery, and things seem to look normal there as well (around 15mb). Although it is suspect because I believe we started having the error around when this launched.
Some of our requests do take awhile, as they do a soap request to echosign, which can take 6-10 seconds to respond sometimes. Is this somehow blocking and causing new dyno's to spin up?
Here is my proc file:
web: python manage.py collectstatic --noinput; python manage.py compress; newrelic-admin run-program python manage.py run_gunicorn -b "0.0.0.0:$PORT" -w 9 -k gevent --max-requests 250
celeryd: newrelic-admin run-program python manage.py celeryd -E -B --loglevel=INFO
The main problem is the memory error though.

I BELIEVE I may have found the issue.
Based on posts like these I thought that I should have somewhere in the area of 9-10 gunicorn workers. I believe this is incorrect (or at least, it is for the work my app is doing).
I had been running 9 gunicorn workers, and finally realized that was the only real difference between heroku and local (as far as configuration).
According to the gunicorn design document the advice for workers goes something like this:
DO NOT scale the number of workers to the number of clients you expect
to have. Gunicorn should only need 4-12 worker processes to handle
hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load
balancing when handling requests. Generally we recommend (2 x
$num_cores) + 1 as the number of workers to start off with. While not
overly scientific, the formula is based on the assumption that for a
given core, one worker will be reading or writing from the socket
while the other worker is processing a request.
And while the information out there about Heroku Dyno CPU abilities, I've now read that each dyno is running on something around 1/4 of a Core. Not super powerful, but powerful enough I guess.
Dialing my workers down to 3 (which is even high according to their rough formula) appears to have stopped my memory issues, at least for now. When I think about it, the interesting thing about the memory warning I would get is it would never go up. It got to around 103% and then just stayed there, whereas if it was actually a leak, it should have kept rising until being shut down. So my theory is my workers were eventually consuming just enough memory to go above 512mb.
HEROKU SHOULD ADD THIS INFORMATION SOMEWHERE!! And at the very least I should be able to top into my running dyno to see what's going on. Would have saved me hours and days.

Related

Long Script Stops Running When Deployed - Django on nginx/gunicorn

I have a very long script that ingests a pdf, does a lot of processing, then returns a result. It runs perfectly when running through port 8000 through either
python manage.py runserver 0.0.0.0:8000
or
gunicorn --bind 0.0.0.0:8000 myproject.wsgi
However when I run it via port 80 in "production" the script stops running at a certain point with no errors and seemingly no holes in the logic. What's really causing confusion is that it stops in different places depending on the length/complexity of the processed document. Short/simple ones complete with no issue but a longer one will stop in the middle.
I tried adding a very detailed log file to debug the issue. If I process one document, it stops running in the same loop but at different places within the loop (seemingly random), indicating that this isn't a logical flaw (note I'm writing and flushing). Furthermore, if I use a longer/more complex document it mysteriously stops earlier in the process.
I'm deploying this using Django via gunicorn/nginx on DigitalOcean
Is there some sort of built in protection that stops processes after a certain number of CPU cycles or time as protection against infinite loops in any of the above? That's the only thing that I can think of because I'm otherwise out of ideas.
I'd really appreciate any help!

Figured it out. Gunicorn has a built in timer that kills workers after a set amount of time. The default (30 seconds per gunicorn's documentation) was too short for my process. To solve, add the "timeout" variable in "ExecStart" in the gunicorn configuration file; standard setup on Ubuntu 20.4:
sudo nano /etc/systemd/system/gunicorn.service
then add the timeout variable to the ExecStart (I used 120 seconds in this example):
ExecStart=/home/sammy/myprojectdir/myprojectenv/bin/gunicorn \
--access-logfile - \
--workers 3 \
--timeout 120 \
--bind unix:/run/gunicorn.sock \
myproject.wsgi:application
I determined this by looking at the "journalctl", which records the stdout. To view the most recent 50 lines of the stream, enter the following into your terminal:
journaltctl | tail -50
In my case, I noticed an entry containing "[CRITICAL] WORKER TIMEOUT (pid:xxxxxx)"

uWSGI workers stuck: why

I'm using uwsgi version 2.0.13.1 with the following config:
bin/uwsgi -M -p 5 -C -A 4 -m -b 8192 -s :3031 --wsgi-file bin/django.wsgi --pidfile var/run/uwsgi.pid --touch-reload=var/run/reload-uwsgi.touch --max-requests=1000 --reload-on-rss=450 --py-tracebacker var/run/pytrace --auto-procname --stats 127.0.0.1:3040 --threads 40 --reload-mercy 600 --listen 200
(absolute path names cut)
When I run uwsgitop, all 5 workers appear as busy. When I try to get the stack trace for each worker / thread, using the py-tracebacker, I get no answer. The processes just hang.
How could I research what exact fact is what makes uwsgi processes hang?
How could I prevent this situation?
I know the harakiri parameter but I'm not sure if the process is killed if it has other active threads.
PD: "reload mercy" is set to a very high value avoid the killing of workers with still active threads (seems to be a bug). We have some Web requests which still take a long long time (which are in the way to be converted to jobs).
Thanks in advance.

Although I added already a comment, here goes a longer description.
Warning: the problem only arose when using more than one worker process AND more than one thread (-p --threads).
Short version: In Python 2.7.x some modules are not 100% thread safe during imports (logging, implicit imports of codecs). Try to import all of such problematic modules in the wsgi file given to uwsgi (i.e, before uwsgi forks).
Long version: In https://github.com/unbit/uwsgi/issues/1599 I analysed the problem and found that it could be related to python bug with the logging module. The problem can be solved importing and initializing any critical modules BEFORE uwsgi forks which happens after the wsgi script given to uwsgi is executed.
I finally resolved my problem importing the django settings.ROOT_URLCONF directly or indirectly from the wsgi file. This also has the benefits of reducing memory consumption (because the code base shared between workers is much bigger) and per-worker initialization time.

Using maxtasksperchild with eventlet

We have a python application with some celery workers.
We use the next command to start celery worker:
python celery -A proj worker --queue=myqueue -P prefork --maxtasksperchild=500
We have two issues with our celery workers.
We have a memory leak
We have pretty big load and we need a lot of workers to process everything fast
We're still looking into memory leak, but since it's legacy code it's pretty hard to find a cause and it will take some time to resolve this issue. To prevent leaks we're using --maxtasksperchild, so each worker after processing 500 events restarts itself. And it works ok, memory grows just to some level.
Second issue is a bit harder. To process all events from our celery queue we have to start more workers. But with prefork each process eats a lot of memory (about 110M in our case) so we either need a lot of servers to start right number of workers or we have to switch from prefork to eventlet:
python celery -A proj worker --queue=myqueue -P eventlet --concurrency=10
In this case we'll use the same amount of memory (about 110M per process) but each process will have 10 workers which is much more memory efficient. But the issue with this is that we still have issue #1 (memory leak), and we can't use --maxtasksperchild because it doesn't work with eventlet.
Any thoughts how can use something like --maxtasksperchild with eventlet?

Upgrade Celery, I've just quick scanned master code, they promise max-memory-per-child. Hope it would work with all concurrency models. I haven't tried it yet.
Set up process monitoring, send graceful terminate signal to workers above memory threshold. Works for me.
Run Celery in control group with limited memory. Works for me.

Apache process idling and eating memory

I'm stuck trying to debug an Apache process that keeps growing in memory size. I'm running Apache 2.4.6 with MPM Prefork on virtual Ubuntu host with 4GB of RAM, serving a Django app with mod_wsgi. The app is heavy with AJAX calls and Apache is getting between 300-1000 requests per minute. Here's what I'm seeing:
As soon as I restart Apache, the first child process (with lowest PID) will keep growing its memory usage, reaching over a gig in 6 or 7 minutes. All the other Apache process will keep memory usage between 10MB-50MB per process.
CPU usage for the troublesome process will fluctuate, sometimes dipping down very low, other times hovering at 20% or sometimes spiking higher.
The troublesome process will run indefinitely until I restart Apache.
I can see in my Django logs that the troublesome process is serving some requests to multiple remote IPs (I'm seeing reports of caught exceptions for URLs my app doesn't like, primarily).
Apache error logs will often (but not always) show "IOError: failed to write data" for the PID, sometimes across multiple IPs.
Apache access logs do not show any requests completed associated with this PID.
Running strace on the PID gets no results other than 'restart_syscall(<... resuming interrupted call ...>' even when I can see that PID mentioned in my app logs at a time when strace was running.
I've tried setting low values of MaxRequestsPerChild and MaxMemFree and neither has seemed to have any effect.
What could this be or how could I debug further? The fact that I see no output of strace makes me that my application has an infinite loop. If that were the case, how could I go about tracing the PID back to the code path it executed or the request that started the trouble?

Instead of restarting Apache, stop and start Apache. There is a known no-fix memory leak issue with Apache.
Also, consider using nginx and gunicorn—this setup is a lightweight, faster, and often recommended alternative to serving your django app, and static files.
References:
Performance
Memory Usage
Apache/Nginx Comparison

Tornado code deployment

Is there a canonical code deployment strategy for tornado-based web application deployment. Our current configuration is 4 tornado processes running behind NginX? (Our specific use case is behind EC2.)
We've currently got a solution that works well enough, whereby we launch the four tornado processes and save the PIDs to a file in /tmp/. Upon deploying new code, we run the following sequence via fabric:
Do a git pull from the prod branch.
Remove the machine from the load balancer.
Wait for all in flight connections to finish with a sleep.
Kill all the tornadoes in the pid file and remove all *.pyc files.
Restart the tornadoes.
Attach the machine back to the load balancer.
We've taken some inspiration from this: http://agiletesting.blogspot.com/2009/12/deploying-tornado-in-production.html
Are there any other complete solutions out there?

We run Tornado+Nginx with supervisord as the supervisor.
Sample configuration (names changed)
[program:server]
process_name = server-%(process_num)s
command=/opt/current/vrun.sh /opt/current/app.py --port=%(process_num)s
stdout_logfile=/var/log/server/server.log
stderr_logfile=/var/log/server/server.err
numprocs = 6
numprocs_start = 7000
I've yet to find the "best" way to restart things, what I'll probably finally do is have Nginx have a "active" file which is updated letting HAProxy know that we're messing with configuration then wait a bit, swap things around, then re-enable everything.
We're using Capistrano (we've got a backlog task to move to Fabric), but instead of dealing with removing *.pyc files we symlink /opt/current to the release identifier.

I haven't deployed Tornado in production, but I've been playing with Gevent + Nginx and have been using Supervisord for process management - start/stop/restart, logging, monitoring - supervisorctl is very handy for this. Like I said, not a deployment solution, but maybe a tool worth using.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.