Multithreading and SQLAlchemy: How to prevent DB access errors?

Multithreading and SQLAlchemy: How to prevent DB access errors? - python

TL;DR:
How do I prevent DB access issues when calling a PostgreSQL database from multiple threads in Python using SQLAlchemy?
The details:
I am developing a Python software that uses Multithreading (concurrent.futures ThreadPool) - but I am by no means an expert in anything.
I also use SQLAlchemy to communicate with a PostgreSQL database (using pg8000).
Because I wanted to keep all my database stuff separate from all the rest, all the SQLAlchemy code sits in a Python module that I called db_manager.py. In here you will find the declarative base, the create_engine() call but also loads of methods to get stuff or store stuff in the database. Each method here ends with:
session.commit()
(unless I just query the database).
Each thread then would call the db_manager module to interact with the database, e.g.:
db_manager.getSomethingFromDB(...)
I created a little drawing to illustrate the architecture.
The problem:
Now the problem I run into is that these database calls seem to clash sometimes.
What is the best way of dealing with multithreading, SQLAlchemy, and PostgreSQL?
Some ideas:
Currently, my db_manager accesses the PostgreSQL as a specific user (pg8000 appears to require this). Is that a problem? Should each thread be its own user? Or can this not be causing problems? If each thread needs to be its own database user, I would probably no longer be able to have all my database stuff in one single module.
I failed to define rollbacks for each commit. I just noticed this is causing problems this any error prevents any further database access.

Related

Sqlite3 DB is locked by an unknown process

I am trying to read an sqlite3 db into memory for further processing using python like so:
con = sqlite3.connect(':memory:')
print('Reading sqlite file into memory... ', end='', flush=True)
src_con.backup(con) # <---- This line seems to hang, SOMETIMES
print('Completed')
src_con.close()
cur = con.cursor()
While the code works fine most of the times, occasionally I observe a hang in the line src_con.backup(con).
This db is dumped by a process that I dont own on a shared network disk.
Here are my observations based on the advice I have found elsewhere on the internet:
fuser filename.db does not show any processes from my user account
sqlite3 filename.db "pragma integrity_check;" returns Error: database is locked
the md5sum of the filename.db (the db that hangs) and its copy filename2.db (doesn't hang) are identical. So is the OS locking the db - because the lock info is not in the DB file itself?
This locked DB appears to occur when the process creating it did not exit cleanly.
Copying the db out (into say, filename2.db) and reading it is a workaround - but I want to understand whats going on and why - and if there is a way to read the locked DB in read-only mode.

The stance of the SQLite developers about using a database stored in a networked file system is, essentially, "Don't. Use a client-server database instead."
This simple, "remote database" approach is usually not the best way to use a single SQLite database from multiple systems, (even if it appears to "work"), as it often leads to various kinds of trouble and grief. Because these problems are inevitable with some usages, but not frequent or repeatable, it behooves application developers to not rely on early testing success to decide that their remote database use will work as desired.
Sounds like you're running into one of those problems. What I suspect is happening, because of
This locked DB appears to occur when the process creating it did not exit cleanly.
is that the networked file system you're using isn't detecting that the creating process's lock on the database is gone due to the process no longer existing, so the backup is just waiting for the lock to be released so that it can acquire it.
Even if you figure out how to broadcast that the lock is available from the creating process's computer to others with that file mounted, there's bound to be other subtle and not-so-subtle problems that pop up now and then. So your best bet is to follow the official advice:
If your data lives on a different machine from your application, then you should consider a client/server database.

You've to close the database before you backup or restore. Btw, do you have right permissions to the source (src)? This related link might help:
How to back up a SQLite db?

Turbogears2 with MySQL db not update tables in multi threads

I am using turbogears2 with MySQL db. With the same code, single thread case can update/write to the tables. But thread thread has no error, however, no write is successful.
Outside turbogears2, multi threads can write to the tables no problems.
No error or complaints with multi thread with tg2. Just no successful write to the table.
I will be very grateful if anyone using tg2 can advise.

With default configuration settings, in a regular request/response cycle, TuborGears2 enables a transaction manager to automatically commit changes to the database when a controller has finished processing a request.
This is introduced in the Wiki in 20 Minutes tutorial:
[...] you would usually need to flush the SQLAlchemy Unit of Work and commit the currently running transaction, those are operations that TurboGears2 transaction management will automatically do for us.
You don’t have to do anything to use this transaction management system, it should just work.
However, for everything that is outside a regular request/response cycle, for example a stream, or a different thread like a scheduler, manually flushing the session and committing the transaction is required. This is performed with DBSession.flush() and transaction.commit().

SSL error: bad record mac - Managing multiple connections with PostgreSQL and Python multiprocessing

I recently switched some of the background functions of a software I am working on from using the Python threading module the multiprocessing module in order to take advantage of more CPU cores. Most of the transfer went without a hitch, but the database integration has been giving me significant issue.
Originally, I was using a single SimpleConnectionPool object from Python psycopg2 which sat as a global variable in a module called db that also handles some boilerplate database operations. From my understanding, creating a second Python process merely copies the current memory stack to the new process location. Because this causes issues with the database connections, I added a function called init() to my db module which simply re-initializes the SimpleConnectionPool and sets it to the global variable. My thinking was that if I called this init function from within a second process, it would create a new set of connections for only the pool on the secondary process. The main process would, therefore, maintain its own set of connections, separate from the second process.
However, using this method I was frequently getting the following exception:
OperationalError: SSL error: decryption failed or bad record mac
This originated directly from "state = conn.poll()" in psycopg2_patcher.py. I did a little digging and from what I can tell, the error is only thrown if both the main process and secondary process attempt to execute a query at the same time. I was thinking of just reverting back to one connection pool in the main process and using Queues to communicate queries from the secondary process to the main process for execution. This comes with a lot of headache though that I would rather avoid.
I also tried moving away from connection pools on the secondary process and used a single connection that is only established when a query needs to be executed, and then closed directly after. The same error occurred when the main process was trying to execute a query around the same time.
What do I need to do to the PGSQL server or my implementation to allow different processes to post queries simultaneously with the same credentials? I get the feeling I am going about the database connections between processes in a wholly unnecessary and convoluted way.

I have an elegant answer for this problem here. Where you can just specify the connection count and all of the async connections are handled for you. This is a modified version of the ThreadedConnectionPool.
Python Postgres psycopg2 ThreadedConnectionPool exhausted

Django testing - use separate database for TransactionTestCase

I need to test Django application behavior for concurrent requests. I need to test is database data correct after that. As a cnclusion, I need to test and transactions mechanism. So let's use TransactionTestCase for that.
I spawned requests to database using threading and got 'DatabaseError: no such table: app_modelname exception' in threads because of automatic switching to in-memory database type (for SQLite by Django when running test).
Of course, I can specify 'TEST_NAME' for 'default' key in settings.DATABASES, and test will pass as expected. But this also used for all tests too, so tests running takes a much more time.
I thought about custom database router, but it seems to be very
hackish way to do that since I need to patch / mock a lot of
attributes in models to define is this executed in
TransactionTestCase or not.
There was also idea to use
override_settings but unfortunately it does not work (see
issue for details)
How I can specify (or create) not in-memory database (on hard drive) for just a few test cases (TransactionTestCase) and left others to run with in-memory database?
Any thoughts, ideas or code samples will be appreciated.
Thanks!

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)

You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.

The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.

Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.

You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.

Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.

You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.

Use threading.Lock()

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!

Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.

You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety

The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.