I am developing a web app using SQLAlchemy's expression language, not its orm. I want to use multiple threads in my app, but I'm not sure about thread safety. I am using this section of the documentation to make a connection. I think this is thread safe because I reference a specific connection in each request. Is this thread safe?
The docs for connections and sessions say that neither is thread safe or intended to be shared between threads.
The Connection object is not thread-safe. While a Connection can be shared among threads using properly synchronized access, it is still possible that the underlying DBAPI connection may not support shared access between threads. Check the DBAPI documentation for details.
The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time.
The Session should be used in such a way that one instance exists for a single series of operations within a single transaction.
The bigger point is that you should not want to use the session with multiple concurrent threads.
There is no guarantee when using the same connection (and transaction context) in more than one thread that the behavior will be correct or consistent.
You should use one connection or session per thread. If you need guarantees about the data, you should set the isolation level for the engine or session. For web applications, SQLAlchemy suggests using one connection per request cycle.
This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends.
I think you are confusing atomicity with isolation.
Atomicity is usually handled through transactions and concerns integrity.
Isolation is about concurrent read/write to a database table (thus thread safety). For example: if you want to increment an int field of a table's record, you will have to select the record's field, increment the value and update it. If multiple threads are doing this concurrently the result will depend on the order of the reads/writes.
http://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=isolation#engine-creation-api
Related
I'm planning to use SQLite and Peewee (ORM) for a light duty internal web service (<20 requests per second). The web service can handle multiple simultaneous requests on multiple threads. During each request the database will be both read from and written to. This means I will need to have the ability for both concurrent reads AND writes. It doesn't matter to this application if the data changes between reads and writes.
The SQLite FAQ says that concurrent reads are permitted but concurrent writes from multiple threads require acquiring the file lock. My question is: Does Peewee take care of this locking for me or is there something I need to do in my code to make this possible?
The Peewee database object is shared between threads. I assume this means that the database connection is shared too.
I can't find a Peewee specific answer to this so I'm asking here.
Sqlite is the one doing the locking, although I can see how you might be confused -- the FAQ wording is a bit ambiguous:
When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. Other embedded SQL database engines typically only allow a single process to connect to the database at once.
So if you have two threads, each with their own connection, and one acquires the write lock, the other thread will have to wait for the lock to be released before it can start writing.
Looking at pysqlite, the default busy timeout looks to be 5 seconds, so the second thread should wait up to 5 seconds before raising an OperationalError.
Also, I'd suggest instantiating your SqliteDatabase with threadlocals=True. That will store a connection-per-thread.
Consider to run all writing operations within 1 async process. This made the Javascript server programming nowadays so famous (although this idea is know far longer). It just needs that you a bit familiar with asynchronous programming concept of callbacks:
For SQLITE:
Async concept directly in Sqlite: https://www.sqlite.org/asyncvfs.html
APSW (Another Sqlite Wrapper) which better supports SQlite extentions in Peewee: http://peewee.readthedocs.org/en/latest/peewee/playhouse.html#apsw
For ANY DB.
Consider to write your own thin async handler in python,
as solved here e.g.
SQLAlchemy + Requests Asynchronous Pattern
I would recommend you the last approach, as this allows you more code portability, control, independance from the backend database engine and scalability.
I have a multi threaded app in python, wherein I create multiple producer threads and they extract the data from DB. Data is extracted in chunks. So the part where a thread creates sql statement with limit values is kept within lock. And to let threads execute queries simultaneously, query() function is kept outside the lock. Then the result fetching part is again kept under the lock. Below is the code snippet:
with UserAgent.lock:
sqlGeoTarget = "call sp_ax_ari_select_user_agent_list('0'," + str(self.chunkStart) + "," + str(self.chunkSize) + ",1);"
self.chunkStart += self.chunkSize
self.dbObj.query(sqlGeoTarget)
print "query executed. Processing data now..."+sqlGeoTarget
with UserAgent.lock:
result = self.dbObj.fetchAll()
self.dbObj.dbCursor.close()
But this code generates fatal error segmentation fault (core dumped). Because if I put all the code under lock, it executes fine. I explicitly close the cursor after fetching the data, it is reopened when query() function fired again.
This code is inside a class named UserAgent and it's a shared resource for a class named Producer. Thus, database object is shared. So the problem area 99% must be that as the db object is shared hitting query simultaneously and closing cursor then must be messing up with result set. But then how to solve this problem and achieve concurrent db query execution?
Do not reuse connections across threads. Create a new connection for each thread instead.
From the MySQLdb User Guide:
The MySQL protocol can not handle multiple threads using the same connection at once. Some earlier versions of MySQLdb utilized locking to achieve a threadsafety of 2. While this is not terribly hard to accomplish using the standard Cursor class (which uses mysql_store_result()), it is complicated by SSCursor (which uses mysql_use_result(); with the latter you must ensure all the rows have been read before another query can be executed. It is further complicated by the addition of transactions, since transactions start when a cursor execute a query, but end when COMMIT or ROLLBACK is executed by the Connection object. Two threads simply cannot share a connection while a transaction is in progress, in addition to not being able to share it during query execution. This excessively complicated the code to the point where it just isn't worth it.
The general upshot of this is: Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
Emphasis mine.
Use thread local storage or a dedicated connection pooling library instead.
Why do people need a distributed lock?
When the shared resource is protected by it's local machine, does this mean that we do not need a distributed lock?
I mean when the shared resource is exposed to others by using some kind of api or service, and this api or service is protected using it's local locks; then we do not need this kind of distributed lock; am I right?
After asking people on quora. I think I got the answer.
Let's say N worker server access a database server. There are two parts here:
The database has its own lock methods to protect the data from corrupted by concurrent access of other clients (N work server). This is where a local lock in the database comes out.
N worker servers may need some coronation to make sure that they are doing the right thing, and this is application specific. This is where a distributed lock comes out. Say, if two worker server running a same process that drops a table of the database and add some record to the table. The database server can sure guarantee that its internal data is right, but this two process needs a distributed lock to coordinate each other, otherwise, one process will drop another process's table.
Yes and no, if you're exposing your information from the local API through a lock to prevent mutex depending on how the lock is setup your implementation might be that of exactly what a distributed lock is trying to accomplish, but if you haven't developed the API then you'll have to dig into the source of the API to find out if it's a localized or distributed locking system. Honestly a lock is a lock is a lock, it's attempting to do the same thing no matter what. The benefit of the distributed lock over a localized one is that you're already accounting for queueing to prevent over access to from clients on expensive cache points.
My django app saves django models to a remote database. Sometimes the saves are bursty. In order to free the main thread (*thread_A*) of the application from the time toll of saving multiple objects to the database, I thought of transferring the model objects to a separate thread (*thread_B*) using collections.deque and have *thread_B* save them sequentially.
Yet I'm unsure regarding this scheme. save() returns the id of the new database entry, so it "ends" only after the database responds, which is at the end of the transaction.
Does django.db.models.Model.save() really block GIL-wise and release other python threads during the transaction?
Django's save() does nothing special to the GIL. In fact, there is hardly anything you can do with the GIL in Python code -- when it is executed, the thread must hold the GIL.
There are only two ways the GIL could get released in save():
Python decides to switch threads (after sys.getcheckinterval() instructions)
Django calls a database interface routine that is implemented to release the GIL
The second point could be what you are looking for -- a SQL COMMITis executed and during that execution, the SQL backend releases the GIL. However, this depends on the SQL interface, and I'm not sure if the popular ones actually release the GIL*.
Moreover, save() does a lot more than just running a few UPDATE/INSERT statements and a COMMIT; it does a lot of bookkeeping in Python, where it has to hold the GIL. In summary, I'm not sure that you will gain anything from moving save() to a different thread.
UPDATE: From looking at the sources, I learned that both the sqlite module and psycopg do release the GIL when they are calling database routines, and I guess that other interfaces do the same.
Generally you should never have to worry about threads in a Django application. If you're serving your application with Apache, gunicorn or nearly any other server other than the development server, the server will spawn multiple processes and evade the GIL entirely. The exception is if you're using gunicorn with gevent, in which case there will be multiple processes but also microthreads inside those processes -- in that case concurrency helps a bit, but you don't have to manage the threads yourself to take advantage of that. The only case where you need to worry about the GIL is if you're trying to spawn multiple threads to handle a single request, which is not usually a good idea.
The Django save() method does not release the GIL itself, but the database backend will (in most cases the bulk of the time spent in save() will be doing database I/O). However, it's almost impossible to properly take advantage of this in a well-designed web application. Responses from your view should be fast even when done synchronously -- if they are doing too much work to be fast, then use a delayed job with Celery or another taskmaster to finish up the extra work. If you try to thread in your view, you'll have to finish up that thread before sending a response to the client, which in most cases won't help anything and will just add extra overhead.
I think python dont lock anything by itself, but database does.
HI,i got a multi-threading program which all threads will operate on oracle
DB. So, can sqlalchemy support parallel operation on oracle?
tks!
OCI (oracle client interface) has a parameter OCI_THREADED which has the effect of connections being mutexed, such that concurrent access via multiple threads is safe. This is likely the setting the document you saw was referring to.
cx_oracle, which is essentially a Python->OCI bridge, provides access to this setting in its connection function using the keyword argument "threaded", described at http://cx-oracle.sourceforge.net/html/module.html#cx_Oracle.connect . The docs state that it is False by default due to its resulting in a "10-15% performance penalty", though no source is given for this information (and performance stats should always be viewed suspiciously as a rule).
As far as SQLAlchemy, the cx_oracle dialect provided with SQLAlchemy sets this value to True by default, with the option to set it back to False when setting up the engine via create_engine() - so at that level there's no issue.
But beyond that, SQLAlchemy's recommended usage patterns (i.e. one Session per thread, keeping connections local to a pool where they are checked out by a function as needed) prevent concurrent access to a connection in any case. So you can likely turn off the "threaded" setting on create_engine() and enjoy the possibly-tangible performance increases provided regular usage patterns are followed.
As long as each concurrent thread has it's own session you should be fine. Trying to use one shared session is where you'll get into trouble.