I am using turbogears2 with MySQL db. With the same code, single thread case can update/write to the tables. But thread thread has no error, however, no write is successful.
Outside turbogears2, multi threads can write to the tables no problems.
No error or complaints with multi thread with tg2. Just no successful write to the table.
I will be very grateful if anyone using tg2 can advise.
With default configuration settings, in a regular request/response cycle, TuborGears2 enables a transaction manager to automatically commit changes to the database when a controller has finished processing a request.
This is introduced in the Wiki in 20 Minutes tutorial:
[...] you would usually need to flush the SQLAlchemy Unit of Work and commit the currently running transaction, those are operations that TurboGears2 transaction management will automatically do for us.
You don’t have to do anything to use this transaction management system, it should just work.
However, for everything that is outside a regular request/response cycle, for example a stream, or a different thread like a scheduler, manually flushing the session and committing the transaction is required. This is performed with DBSession.flush() and transaction.commit().
Related
TL;DR:
How do I prevent DB access issues when calling a PostgreSQL database from multiple threads in Python using SQLAlchemy?
The details:
I am developing a Python software that uses Multithreading (concurrent.futures ThreadPool) - but I am by no means an expert in anything.
I also use SQLAlchemy to communicate with a PostgreSQL database (using pg8000).
Because I wanted to keep all my database stuff separate from all the rest, all the SQLAlchemy code sits in a Python module that I called db_manager.py. In here you will find the declarative base, the create_engine() call but also loads of methods to get stuff or store stuff in the database. Each method here ends with:
session.commit()
(unless I just query the database).
Each thread then would call the db_manager module to interact with the database, e.g.:
db_manager.getSomethingFromDB(...)
I created a little drawing to illustrate the architecture.
The problem:
Now the problem I run into is that these database calls seem to clash sometimes.
What is the best way of dealing with multithreading, SQLAlchemy, and PostgreSQL?
Some ideas:
Currently, my db_manager accesses the PostgreSQL as a specific user (pg8000 appears to require this). Is that a problem? Should each thread be its own user? Or can this not be causing problems? If each thread needs to be its own database user, I would probably no longer be able to have all my database stuff in one single module.
I failed to define rollbacks for each commit. I just noticed this is causing problems this any error prevents any further database access.
I am developing a web app using SQLAlchemy's expression language, not its orm. I want to use multiple threads in my app, but I'm not sure about thread safety. I am using this section of the documentation to make a connection. I think this is thread safe because I reference a specific connection in each request. Is this thread safe?
The docs for connections and sessions say that neither is thread safe or intended to be shared between threads.
The Connection object is not thread-safe. While a Connection can be shared among threads using properly synchronized access, it is still possible that the underlying DBAPI connection may not support shared access between threads. Check the DBAPI documentation for details.
The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time.
The Session should be used in such a way that one instance exists for a single series of operations within a single transaction.
The bigger point is that you should not want to use the session with multiple concurrent threads.
There is no guarantee when using the same connection (and transaction context) in more than one thread that the behavior will be correct or consistent.
You should use one connection or session per thread. If you need guarantees about the data, you should set the isolation level for the engine or session. For web applications, SQLAlchemy suggests using one connection per request cycle.
This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends.
I think you are confusing atomicity with isolation.
Atomicity is usually handled through transactions and concerns integrity.
Isolation is about concurrent read/write to a database table (thus thread safety). For example: if you want to increment an int field of a table's record, you will have to select the record's field, increment the value and update it. If multiple threads are doing this concurrently the result will depend on the order of the reads/writes.
http://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=isolation#engine-creation-api
I'm planning to use SQLite and Peewee (ORM) for a light duty internal web service (<20 requests per second). The web service can handle multiple simultaneous requests on multiple threads. During each request the database will be both read from and written to. This means I will need to have the ability for both concurrent reads AND writes. It doesn't matter to this application if the data changes between reads and writes.
The SQLite FAQ says that concurrent reads are permitted but concurrent writes from multiple threads require acquiring the file lock. My question is: Does Peewee take care of this locking for me or is there something I need to do in my code to make this possible?
The Peewee database object is shared between threads. I assume this means that the database connection is shared too.
I can't find a Peewee specific answer to this so I'm asking here.
Sqlite is the one doing the locking, although I can see how you might be confused -- the FAQ wording is a bit ambiguous:
When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. Other embedded SQL database engines typically only allow a single process to connect to the database at once.
So if you have two threads, each with their own connection, and one acquires the write lock, the other thread will have to wait for the lock to be released before it can start writing.
Looking at pysqlite, the default busy timeout looks to be 5 seconds, so the second thread should wait up to 5 seconds before raising an OperationalError.
Also, I'd suggest instantiating your SqliteDatabase with threadlocals=True. That will store a connection-per-thread.
Consider to run all writing operations within 1 async process. This made the Javascript server programming nowadays so famous (although this idea is know far longer). It just needs that you a bit familiar with asynchronous programming concept of callbacks:
For SQLITE:
Async concept directly in Sqlite: https://www.sqlite.org/asyncvfs.html
APSW (Another Sqlite Wrapper) which better supports SQlite extentions in Peewee: http://peewee.readthedocs.org/en/latest/peewee/playhouse.html#apsw
For ANY DB.
Consider to write your own thin async handler in python,
as solved here e.g.
SQLAlchemy + Requests Asynchronous Pattern
I would recommend you the last approach, as this allows you more code portability, control, independance from the backend database engine and scalability.
My django app saves django models to a remote database. Sometimes the saves are bursty. In order to free the main thread (*thread_A*) of the application from the time toll of saving multiple objects to the database, I thought of transferring the model objects to a separate thread (*thread_B*) using collections.deque and have *thread_B* save them sequentially.
Yet I'm unsure regarding this scheme. save() returns the id of the new database entry, so it "ends" only after the database responds, which is at the end of the transaction.
Does django.db.models.Model.save() really block GIL-wise and release other python threads during the transaction?
Django's save() does nothing special to the GIL. In fact, there is hardly anything you can do with the GIL in Python code -- when it is executed, the thread must hold the GIL.
There are only two ways the GIL could get released in save():
Python decides to switch threads (after sys.getcheckinterval() instructions)
Django calls a database interface routine that is implemented to release the GIL
The second point could be what you are looking for -- a SQL COMMITis executed and during that execution, the SQL backend releases the GIL. However, this depends on the SQL interface, and I'm not sure if the popular ones actually release the GIL*.
Moreover, save() does a lot more than just running a few UPDATE/INSERT statements and a COMMIT; it does a lot of bookkeeping in Python, where it has to hold the GIL. In summary, I'm not sure that you will gain anything from moving save() to a different thread.
UPDATE: From looking at the sources, I learned that both the sqlite module and psycopg do release the GIL when they are calling database routines, and I guess that other interfaces do the same.
Generally you should never have to worry about threads in a Django application. If you're serving your application with Apache, gunicorn or nearly any other server other than the development server, the server will spawn multiple processes and evade the GIL entirely. The exception is if you're using gunicorn with gevent, in which case there will be multiple processes but also microthreads inside those processes -- in that case concurrency helps a bit, but you don't have to manage the threads yourself to take advantage of that. The only case where you need to worry about the GIL is if you're trying to spawn multiple threads to handle a single request, which is not usually a good idea.
The Django save() method does not release the GIL itself, but the database backend will (in most cases the bulk of the time spent in save() will be doing database I/O). However, it's almost impossible to properly take advantage of this in a well-designed web application. Responses from your view should be fast even when done synchronously -- if they are doing too much work to be fast, then use a delayed job with Celery or another taskmaster to finish up the extra work. If you try to thread in your view, you'll have to finish up that thread before sending a response to the client, which in most cases won't help anything and will just add extra overhead.
I think python dont lock anything by itself, but database does.
Yesterday I was working with some sqlalchemy stuff that needed a "select ... for update" concept to avoid a race condition. Adding .with_lockmode('update') to the query works a treat on InnoDB and Postgres, but for sqlite I end up having to sneak in a
if session.bind.name == 'sqlite':
session.execute('begin immediate transaction')
before doing the select.
This seems to work for now, but it feels like cheating. Is there a better way to do this?
SELECT ... FOR UPDATE OF ... is not supported. This is understandable
considering the mechanics of SQLite in that row locking is redundant
as the entire database is locked when updating any bit of it. However,
it would be good if a future version of SQLite supports it for SQL
interchageability reasons if nothing else. The only functionality
required is to ensure a "RESERVED" lock is placed on the database if
not already there.
excerpt from
https://www2.sqlite.org/cvstrac/wiki?p=UnsupportedSql
[EDIT] also see https://sqlite.org/isolation.html thanks #michauwilliam.
i think you have to synchronize the access to the whole database. normal synchronization mechanism should also apply here file lock, process synchronization etc
I think a SELECT FOR UPDATE is relevant for SQLite. There is no way to lock the database BEFORE I start to write. By then it's too late. Here is the scenario:
I have two servers and one database queue table. Each server is looking for work and when it picks up a job, it updates the queue table with an "I got it” so the other server doesn’t also pick it up the same work. I need to leave the record in the queue in case of recovery.
Server 1 reads the first unclaimed item and has it in memory. Server 2 reads the same record and now has it in memory too. Server 1 then updates the record, locking the database, updates, then unlocks. Server 2 then locks the database, updates, and unlocks. The result is both servers now work on the same job. The table shows Server 2 has it and the Server 1 update is lost.
I solved this by creating a lock database table. Server 1 begins a transaction, writes to the lock table which locks the database for writing. Server 2 now tries to begin a transaction and write to the lock table, but is prevented. Server 1 now reads the first queue record and then updates it with the “I got it” code. Then deletes the record it just wrote to the lock table, commits and releases the lock. Now server 2 is able to begin its transaction, write to the lock table, read the 2nd queue record, update it with its “I got it” code, delete it’s lock record, commits and the database is available for the next server looking for work.