Why Python does not release memory (under mod_wsgi + Django)

Why Python does not release memory (under mod_wsgi + Django) - python

I have Apache + mod_wsgi + Django app. mod_wsgi runs in daemon mode.
I have one view that fetches significant queryset from the DB and additionally allocates array by computing results of the queryset and then returns this array. I'm not using thread local storage, global variables or anything alike.
The problem is that my app eats memory relatively to the number threads I set for mod_wsgi.
I've made small experiment by setting various number of threads in mod_wsgi and then hitting my view by curl checking how far wsgi process can memory-climb.
It goes like this:
1 thread - 256Mb
2 threads - 400Mb
3 threads - 535Mb
4 threads - 650Mb
So each thread add about 120-140Mb to the top memory usage.
I seems like the initial memory allocated for first request is never freed up. In single-thread scenario, its reused when second request (to the same view) is arrived. With that I can leave.
But when I use multiple threads, then when request is processed by a thread that never run this request before, this thread "saves" another 140mb somewhere locally.
How can fix this?
Probably Django saves some data in TSL. If that is
the case, how can I disable it?
Alternatively, as a workaround, is it
possible to bind request execution to a certain thread in mod_wsgi?
Thanks.
PS. DEBUG is set to False in settings.py

In this sort of situation, what you should do is vertically partition your web application so that it runs across multiple mod_wsgi daemon process groups. That way you can tailor the configuration of the mod_wsgi daemon processes to the requirements of the subsets of URLs that you delegate to each. As the admin interface URLs of a Django application often have high transient memory usage requirements, yet aren't used very often, it can be recommended to do:
WSGIScriptAlias / /my/path/site/wsgi.py
WSGIApplicationGroup %{GLOBAL}
WSGIDaemonProcess main processes=3 threads=5
WSGIProcessGroup main
WSGIDaemonProcess admin threads=2 inactivity-timeout=60
<Location /admin>
WSGIProcessGroup admin
</Location>
So what this does is create two daemon process groups. By default URLs will be handled in the main daemon process group where the processes are persistent.
For the URLs for the admin interface however, they will be directed to the admin daemon process group, which can be set up with a single process with reduced number of threads, plus an inactivity timeout so that the process will be restarted automatically if the admin interface isn't used after 60 seconds, thereby reclaiming any excessive transient memory usage.
This will mean that submitting a requests to the admin interface can be slowed slightly if the processes had been recycled since the last time, as everything has to be loaded again, but since it is the admin interface and not a public URL, this is generally acceptable.

Related

Apache/Flask/mod_wsgi: Increase speed of request on flask app running on apache server

This is most likely an issue with the server configuration rather than Flask itself. I have a function that uses a Flask library called streaming-form-data that allows me to upload big files (roughly 70mb each) without waiting an absurd amount of time per upload. Here is the core of it (the part that matters for this question anyway) which receives the data in chunks until there are no more.
while True:
chunk = request.stream.read(8192)
if not chunk:
break
parser.data_received(chunk)
When I run this on my local machine the files are stored almost instantly, within seconds. However on an EC2 server with apache2 and mod_wsgi it takes well over a minute which is not ideal. By refreshing FileZilla I can see the files being saved chunk by chunk but it's far slower than when ran on my local machine. Here is the part of the config that could reasonably affect it:
WSGIDaemonProcess metadata user=www-data group=www-data processes=5 threads=25 python-path=/var/www/html/APPNAME:/var/www/html/APPNAME/virtualenv/lib/python3.6/site-packages
WSGIScriptAlias / /var/www/html/APPNAME/app.wsgi
I've already messed around with processes and threads with no visible effect really. I'm not much of a server guy, which part of the config or code could actually improve performance? The EC2 metrics show that the usage or memory don't really spike to dangerous levels either so I'm guessing there's a way to change the config around.

Flask app using locks and multiprocessing

Exposing a database though Flask-based API, I use locks in view functions to avoid issues with non atomic database operations.
E.g. dealing with a PUT, if the database driver does not provide an atomic upsert feature, I just grab the lock, read, update, write, then release the lock.
AFAIU, this works in a multi-threaded environment, but since the lock belong to the Flask app, it fails if multiple processes are used. Is this correct?
If so, how do people deal with locks when using multiple processes? Do they use an external base such as Redis to store the locks?
Subsidiary question. My apache config is
WSGIDaemonProcess app_name threads=5
Can I conclude that I'm on the safe side until I don't throw some processes=N in there?

Flask get request have inconsistent return

I used Flask to build a mini server on Heroku. The server side code looks something like this:
from flask import Flask
from flask_cors import CORS, cross_origin
app = Flask(__name__)
schedule = {'Basketball': 'old value'}
#app.route("/")
#cross_origin()
def get_all_schedule():
return json.dumps(schedule)
#app.route("/update", method=['post'])
def update_basketball_schedule():
globle schedule
schedule['Basketball'] = 'new value'
if __name__ == "__main__":
app.run(host='0.0.0.0')
I have one global dictionary schedule to store the schedule data. I use the post /update URL to update this schedule, and use the / URL to get the data, seems pretty straight forward.
I am testing this application on my Chrome browser. I called the post URL once. And then When I calling /, sometimes it returns the dictionary with "new value" and sometimes it returns the the dictionary with "old value". What is the reason for this behavior?
I am using a free dyno on Heroku.
My Procfile contains:
web: gunicorn server:app

Heroku dynos occasionally reset, die, or are otherwise disabled. Because of this, the values of all variables stored in memory are lost. To combat this, you can use redis, or another key/value store to hold your data.

I have one global dictionary schedule to store the schedule data
You can't rely on variables to maintain state like this.
For starters, Gunicorn will run with multiple processes by default:
Gunicorn forks multiple system processes within each dyno to allow a Python app to support multiple concurrent requests without requiring them to be thread-safe. In Gunicorn terminology, these are referred to as worker processes (not to be confused with Heroku worker processes, which run in their own dynos).
Each forked system process consumes additional memory. This limits how many processes you can run in a single dyno. With a typical Django application memory footprint, you can expect to run 2–4 Gunicorn worker processes on a free, hobby or standard-1x dyno. Your application may allow for a variation of this, depending on your application’s specific memory requirements.
We recommend setting a configuration variable for this setting. Gunicorn automatically honors the WEB_CONCURRENCY environment variable, if set.
heroku config:set WEB_CONCURRENCY=3
The WEB_CONCURRENCY environment variable is automatically set by Heroku, based on the processes’ Dyno size. This feature is intended to be a sane starting point for your application. We recommend knowing the memory requirements of your processes and setting this configuration variable accordingly.
Each request you make could be handled by any of the Gunicorn workers. And setting WEB_CONCURRENCY to 1 isn't the right solution for a variety of reasons. For example, as Jake says, Heroku dynos restart frequently (at least once per day) and your state will be lost then as well.
Fortunately, Heroku offers a number of data store addons, including in-memory stores like Redis that might be a good fit here. This would let you share state across all of your Gunicorn workers and across dyno restarts. It would even work across dynos in case you ever need to scale your application that way.

Preventing slow queries from exausting gunicorn worker pool

Let's say that we have a rather typical Django web application:
there is an Nginx in front of the app doing proxy stuff and serving
static content
there is gunicorn starting workers to handle Django requests
there is Django-based web app doing all kinds of fun stuff
there is a Redis server for sessions/cache
there is a MySQL database serving queries from Django
Some URLs have basically just a rendered Django template with almost no queries, some pages incorporate some info from Redis. But there are a few pages that do some rather involved database queries, which can (after all possible optimizations) take several seconds to execute on MySQL side.
And here my problem - each time a gunicorn worker gets a request for such heavy URL it no longer serves other requests for a while - it just sits there idle waiting for the database to reply. If there are enough such queries then eventually all workers just sit idle and wait on the heavy URLs leaving none to serve the other, faster pages.
It there a way to either allow worker to do other work while it is waiting on a database reply? Or to somehow scale up worker pool in such situation (preferably without also scaling RAM usage and database connection count :))? At least is there a way to find out any statistics on how many workers are busy in a gunicorn pool and for how long each of them has been processing a request?

A simple way that might work in your case would be to increase the number of workers. The recommended number of workers is 2-4 x {NUM CPUS}. Depending on load and the type of requests to the site this might be enough.
The next step to look into if increasing number of workers isn't enough, would be to look into using async workers (docs about it here). More detailed configuration options are described here. Note that depending on what type of async worker you choose to use, you will have to install either eventlet, gevent or tornado.

Python, WSGI, multiprocessing and shared data

I am a bit confused about multiproessing feature of mod_wsgi and about a general design of WSGI applications that would be executed on WSGI servers with multiprocessing ability.
Consider the following directive:
WSGIDaemonProcess example processes=5 threads=1
If I understand correctly, mod_wsgi will spawn 5 Python (e.g. CPython) processes and any of these processes can receive a request from a user.
The documentation says that:
Where shared data needs to be visible to all application instances, regardless of which child process they execute in, and changes made to
the data by one application are immediately available to another,
including any executing in another child process, an external data
store such as a database or shared memory must be used. Global
variables in normal Python modules cannot be used for this purpose.
But in that case it gets really heavy when one wants to be sure that an app runs in any WSGI conditions (including multiprocessing ones).
For example, a simple variable which contains the current amount of connected users - should it be process-safe read/written from/to memcached, or a DB or (if such out-of-the-standard-library mechanisms are available) shared memory?
And will the code like
counter = 0
#app.route('/login')
def login():
...
counter += 1
...
#app.route('/logout')
def logout():
...
counter -= 1
...
#app.route('/show_users_count')
def show_users_count():
return counter
behave unpredictably in multiprocessing environment?
Thank you!

There are several aspects to consider in your question.
First, the interaction between apache MPM's and mod_wsgi applications. If you run the mod_wsgi application in embedded mode (no WSGIDaemonProcess needed, WSGIProcessGroup %{GLOBAL}) you inherit multiprocessing/multithreading from the apache MPM's. This should be the fastest option, and you end up having multiple processes and multiple threads per process, depending on your MPM configuration. On the contrary if you run mod_wsgi in daemon mode, with WSGIDaemonProcess <name> [options] and WSGIProcessGroup <name>, you have fine control on multiprocessing/multithreading at the cost of a small overhead.
Within a single apache2 server you may define zero, one, or more named WSGIDaemonProcesses, and each application can be run in one of these processes (WSGIProcessGroup <name>) or run in embedded mode with WSGIProcessGroup %{GLOBAL}.
You can check multiprocessing/multithreading by inspecting the wsgi.multithread and wsgi.multiprocess variables.
With your configuration WSGIDaemonProcess example processes=5 threads=1 you have 5 independent processes, each with a single thread of execution: no global data, no shared memory, since you are not in control of spawning subprocesses, but mod_wsgi is doing it for you. To share a global state you already listed some possible options: a DB to which your processes interface, some sort of file system based persistence, a daemon process (started outside apache) and socket based IPC.
As pointed out by Roland Smith, the latter could be implemented using a high level API by multiprocessing.managers: outside apache you create and start a BaseManager server process
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.get_server().serve_forever()
and inside you apps you connect:
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.connect()
The example above is dummy, since m has no useful method registered, but here (python docs) you will find how to create and proxy an object (like the counter in your example) among your processes.
A final comment on your example, with processes=5 threads=1. I understand that this is just an example, but in real world applications I suspect that performance will be comparable with respect to processes=1 threads=5: you should go into the intricacies of sharing data in multiprocessing only if the expected performance boost over the 'single process many threads' model is significant.

From the docs on processes and threading for wsgi:
When Apache is run in a mode whereby there are multiple child processes, each child process will contain sub interpreters for each WSGI application.
This means that in your configuration, 5 processes with 1 thread each, there will be 5 interpreters and no shared data. Your counter object will be unique to each interpreter. You would need to either build some custom solution to count sessions (one common process you can communicate with, some kind of persistence based solution, etc.) OR, and this is definitely my recommendation, use a prebuilt solution (Google Analytics and Chartbeat are fantastic options).
I tend to think of using globals to share data as a big form of global abuse. It's a bug well and portability issue in most of the environments I've done parallel processing in. What if suddenly your application was to be run on multiple virtual machines? This would break your code no matter what the sharing model of threads and processes.

If you are using multiprocessing, there are multiple ways to share data between processes. Values and Arrays only work if processes have a parent/child relation (they are shared by inheriting). If that is not the case, use a Manager and Proxy objects.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.