Exposing a database though Flask-based API, I use locks in view functions to avoid issues with non atomic database operations.
E.g. dealing with a PUT, if the database driver does not provide an atomic upsert feature, I just grab the lock, read, update, write, then release the lock.
AFAIU, this works in a multi-threaded environment, but since the lock belong to the Flask app, it fails if multiple processes are used. Is this correct?
If so, how do people deal with locks when using multiple processes? Do they use an external base such as Redis to store the locks?
Subsidiary question. My apache config is
WSGIDaemonProcess app_name threads=5
Can I conclude that I'm on the safe side until I don't throw some processes=N in there?
Related
This question already has answers here:
Are global variables thread-safe in Flask? How do I share data between requests?
(4 answers)
Closed 4 years ago.
This seems like a pretty obvious question, but a lot of the document around this is very confusing, and warn me not to keep a global state instead of telling me how to.
For example, if I need to have a database connection pool (I'm not using SQLAlchemy), or a pool of object instances (both of which need to be global pools, centrally managed), how do I do that?
If I use flask.g, that's not shared between threads, and if I use a python global, that's not shared between multiple processes of the same application (which, as I understand, can be spawned in the case of large production flask servers). Do I use flask.current_app? Do I make the pool a separate process itself? Something else?
The warnings about "not keeping a per-process global state" in a web backend app (you'll have the very same issue with Django or any wsgi app) only applies to state that you expect to be shared between requests AND processes.
If it's ok for you to have per-process state (for example db connection is typically a per-process state) then it's not an issue. wrt/ connections pooling, you could (or not) decide that having distinct pools per server process is ok.
For anything else - any state that needs to be shared amongst processes -, this is usually handled by some using some external database or cache process, so if you want to have one single connection pool for all your Flask processes you will have to use a distinct server process for maintaining the pool indeed.
Also note that:
multiple processes of the same application (which, as I understand, can be spawned in the case of large production flask servers)
Actually this has nothing to do with being "large". With a traditional "blocking" server, you can only handle concurrent requests by using either multithreading or multiprocessing. The unix philosophy traditionnally favors multiprocessing ("prefork" model) for various reasons, and Python's multithreading is bordering on useless anyway (at least in this context) so you don't have much choice if you hope to serve one more one single request at a time.
To make a long story short, consider that just any production setup for a wsgi app will run multiple processes in the background, period.
I am developing a web app using SQLAlchemy's expression language, not its orm. I want to use multiple threads in my app, but I'm not sure about thread safety. I am using this section of the documentation to make a connection. I think this is thread safe because I reference a specific connection in each request. Is this thread safe?
The docs for connections and sessions say that neither is thread safe or intended to be shared between threads.
The Connection object is not thread-safe. While a Connection can be shared among threads using properly synchronized access, it is still possible that the underlying DBAPI connection may not support shared access between threads. Check the DBAPI documentation for details.
The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time.
The Session should be used in such a way that one instance exists for a single series of operations within a single transaction.
The bigger point is that you should not want to use the session with multiple concurrent threads.
There is no guarantee when using the same connection (and transaction context) in more than one thread that the behavior will be correct or consistent.
You should use one connection or session per thread. If you need guarantees about the data, you should set the isolation level for the engine or session. For web applications, SQLAlchemy suggests using one connection per request cycle.
This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends.
I think you are confusing atomicity with isolation.
Atomicity is usually handled through transactions and concerns integrity.
Isolation is about concurrent read/write to a database table (thus thread safety). For example: if you want to increment an int field of a table's record, you will have to select the record's field, increment the value and update it. If multiple threads are doing this concurrently the result will depend on the order of the reads/writes.
http://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=isolation#engine-creation-api
I am a bit confused about multiproessing feature of mod_wsgi and about a general design of WSGI applications that would be executed on WSGI servers with multiprocessing ability.
Consider the following directive:
WSGIDaemonProcess example processes=5 threads=1
If I understand correctly, mod_wsgi will spawn 5 Python (e.g. CPython) processes and any of these processes can receive a request from a user.
The documentation says that:
Where shared data needs to be visible to all application instances, regardless of which child process they execute in, and changes made to
the data by one application are immediately available to another,
including any executing in another child process, an external data
store such as a database or shared memory must be used. Global
variables in normal Python modules cannot be used for this purpose.
But in that case it gets really heavy when one wants to be sure that an app runs in any WSGI conditions (including multiprocessing ones).
For example, a simple variable which contains the current amount of connected users - should it be process-safe read/written from/to memcached, or a DB or (if such out-of-the-standard-library mechanisms are available) shared memory?
And will the code like
counter = 0
#app.route('/login')
def login():
...
counter += 1
...
#app.route('/logout')
def logout():
...
counter -= 1
...
#app.route('/show_users_count')
def show_users_count():
return counter
behave unpredictably in multiprocessing environment?
Thank you!
There are several aspects to consider in your question.
First, the interaction between apache MPM's and mod_wsgi applications. If you run the mod_wsgi application in embedded mode (no WSGIDaemonProcess needed, WSGIProcessGroup %{GLOBAL}) you inherit multiprocessing/multithreading from the apache MPM's. This should be the fastest option, and you end up having multiple processes and multiple threads per process, depending on your MPM configuration. On the contrary if you run mod_wsgi in daemon mode, with WSGIDaemonProcess <name> [options] and WSGIProcessGroup <name>, you have fine control on multiprocessing/multithreading at the cost of a small overhead.
Within a single apache2 server you may define zero, one, or more named WSGIDaemonProcesses, and each application can be run in one of these processes (WSGIProcessGroup <name>) or run in embedded mode with WSGIProcessGroup %{GLOBAL}.
You can check multiprocessing/multithreading by inspecting the wsgi.multithread and wsgi.multiprocess variables.
With your configuration WSGIDaemonProcess example processes=5 threads=1 you have 5 independent processes, each with a single thread of execution: no global data, no shared memory, since you are not in control of spawning subprocesses, but mod_wsgi is doing it for you. To share a global state you already listed some possible options: a DB to which your processes interface, some sort of file system based persistence, a daemon process (started outside apache) and socket based IPC.
As pointed out by Roland Smith, the latter could be implemented using a high level API by multiprocessing.managers: outside apache you create and start a BaseManager server process
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.get_server().serve_forever()
and inside you apps you connect:
m = multiprocessing.managers.BaseManager(address=('', 12345), authkey='secret')
m.connect()
The example above is dummy, since m has no useful method registered, but here (python docs) you will find how to create and proxy an object (like the counter in your example) among your processes.
A final comment on your example, with processes=5 threads=1. I understand that this is just an example, but in real world applications I suspect that performance will be comparable with respect to processes=1 threads=5: you should go into the intricacies of sharing data in multiprocessing only if the expected performance boost over the 'single process many threads' model is significant.
From the docs on processes and threading for wsgi:
When Apache is run in a mode whereby there are multiple child processes, each child process will contain sub interpreters for each WSGI application.
This means that in your configuration, 5 processes with 1 thread each, there will be 5 interpreters and no shared data. Your counter object will be unique to each interpreter. You would need to either build some custom solution to count sessions (one common process you can communicate with, some kind of persistence based solution, etc.) OR, and this is definitely my recommendation, use a prebuilt solution (Google Analytics and Chartbeat are fantastic options).
I tend to think of using globals to share data as a big form of global abuse. It's a bug well and portability issue in most of the environments I've done parallel processing in. What if suddenly your application was to be run on multiple virtual machines? This would break your code no matter what the sharing model of threads and processes.
If you are using multiprocessing, there are multiple ways to share data between processes. Values and Arrays only work if processes have a parent/child relation (they are shared by inheriting). If that is not the case, use a Manager and Proxy objects.
Why do people need a distributed lock?
When the shared resource is protected by it's local machine, does this mean that we do not need a distributed lock?
I mean when the shared resource is exposed to others by using some kind of api or service, and this api or service is protected using it's local locks; then we do not need this kind of distributed lock; am I right?
After asking people on quora. I think I got the answer.
Let's say N worker server access a database server. There are two parts here:
The database has its own lock methods to protect the data from corrupted by concurrent access of other clients (N work server). This is where a local lock in the database comes out.
N worker servers may need some coronation to make sure that they are doing the right thing, and this is application specific. This is where a distributed lock comes out. Say, if two worker server running a same process that drops a table of the database and add some record to the table. The database server can sure guarantee that its internal data is right, but this two process needs a distributed lock to coordinate each other, otherwise, one process will drop another process's table.
Yes and no, if you're exposing your information from the local API through a lock to prevent mutex depending on how the lock is setup your implementation might be that of exactly what a distributed lock is trying to accomplish, but if you haven't developed the API then you'll have to dig into the source of the API to find out if it's a localized or distributed locking system. Honestly a lock is a lock is a lock, it's attempting to do the same thing no matter what. The benefit of the distributed lock over a localized one is that you're already accounting for queueing to prevent over access to from clients on expensive cache points.
My django app saves django models to a remote database. Sometimes the saves are bursty. In order to free the main thread (*thread_A*) of the application from the time toll of saving multiple objects to the database, I thought of transferring the model objects to a separate thread (*thread_B*) using collections.deque and have *thread_B* save them sequentially.
Yet I'm unsure regarding this scheme. save() returns the id of the new database entry, so it "ends" only after the database responds, which is at the end of the transaction.
Does django.db.models.Model.save() really block GIL-wise and release other python threads during the transaction?
Django's save() does nothing special to the GIL. In fact, there is hardly anything you can do with the GIL in Python code -- when it is executed, the thread must hold the GIL.
There are only two ways the GIL could get released in save():
Python decides to switch threads (after sys.getcheckinterval() instructions)
Django calls a database interface routine that is implemented to release the GIL
The second point could be what you are looking for -- a SQL COMMITis executed and during that execution, the SQL backend releases the GIL. However, this depends on the SQL interface, and I'm not sure if the popular ones actually release the GIL*.
Moreover, save() does a lot more than just running a few UPDATE/INSERT statements and a COMMIT; it does a lot of bookkeeping in Python, where it has to hold the GIL. In summary, I'm not sure that you will gain anything from moving save() to a different thread.
UPDATE: From looking at the sources, I learned that both the sqlite module and psycopg do release the GIL when they are calling database routines, and I guess that other interfaces do the same.
Generally you should never have to worry about threads in a Django application. If you're serving your application with Apache, gunicorn or nearly any other server other than the development server, the server will spawn multiple processes and evade the GIL entirely. The exception is if you're using gunicorn with gevent, in which case there will be multiple processes but also microthreads inside those processes -- in that case concurrency helps a bit, but you don't have to manage the threads yourself to take advantage of that. The only case where you need to worry about the GIL is if you're trying to spawn multiple threads to handle a single request, which is not usually a good idea.
The Django save() method does not release the GIL itself, but the database backend will (in most cases the bulk of the time spent in save() will be doing database I/O). However, it's almost impossible to properly take advantage of this in a well-designed web application. Responses from your view should be fast even when done synchronously -- if they are doing too much work to be fast, then use a delayed job with Celery or another taskmaster to finish up the extra work. If you try to thread in your view, you'll have to finish up that thread before sending a response to the client, which in most cases won't help anything and will just add extra overhead.
I think python dont lock anything by itself, but database does.