I am developing a high-throughput multiprocessing/clustered natural language preprocessing system, and I would like to send an UPDATE statement to my database (this is a very large update request, it is updating 100s of rows with data), and move on to the next set of pre-processing without waiting for the UPDATE to complete. The database will handle the update, but I want to move on to the next set of NLP work.
I think it might be possible to do it with some sort of asynchronous request, but one which isn't blocking.
How can I do this? I am using psycopg2 to talk to my database.
You specified in the comments that you aren't relying on the results on the UPDATE. With that I mind, I think the most efficient but straightforward solution is to use the COPY FROM helper in conjunction with something like CSV Writer to bulk insert the data into a temporary table. From this temporary table you can run the update after a certain number of rows have been written to it. Temporary tables are only visible per connection, so if using a connection pool the temporary table will be removed when the connection is closed.
Downside of this approach is that you will lose some data if the application crashes before the bulk update operation completes. If you aren't waiting for each UPDATE to succeed before moving on to the next operation then that can happen regardless. Upside is that it will dramatically improve speed without needing to worry about all the complexities of trying to use psycopg2 asynchronously but still avoiding the excessive network roundtrips.
I'd like to be able to tell someone (django, pgBouncer, or whoever can provide me with this service) to always hand me the same connection to the database (PostgreSQL in this case) on a per client/session basis, instead of getting a random one each time (or creating a new one for that matter).
To my knowledge:
Django's CONN_MAX_AGE can control the lifetime of connections, so
far so good. This will also have a positive impact on performance
(no connection setup penalties).
Some pooling package (pgBouncer for example) can hold the connections and hand them to me as I need them. We're almost there.
The only bit I'm missing is the possibility to ask pgBouncer (or any other similar tool for that matter) to give me a specific db connection, instead of "one from the pool". This is important because I want to have control over the lifetime of the transaction. I want to be able to open a transaction, then send a series of commands, and then manually commit all the work, or roll everything back should something fail.
Many years ago, I've implemented something very similar to what I'm looking for now. It was a simple connection pool made in C which would hold as many connections to oracle as clients needed on one hand, while on the other it would give these clients the chance to recover these exact connections based on some ID, which could have been for example a PHP session ID. That way users could acquire a lock on some database object/row, and the lock would persist even after the apache process died. From that point on the session owner was in total control of that row until he decided it was time to commit it, or until the backend decided it was time to let the transaction go by idleness.
I have a MySQLdb installation for Python 2.7.6. I have created a MySQLdb cursor once and would like to reuse the cursor for every incoming request. If 100 users are simultaneously active and doing a db query, does the cursor serve each request one by one and block others?
If that is the case, is there way to avoid that? Will having a connection pool will do the job in a threadsafe manner or should I look at Gevent/monkey patching?
Your responses are welcome.
You will want to use a connection pool.
The mysql driver in python is not thread-safe meaning multiple requests/threads cannot use it at the same time. See more here:
Here is a link on how to implement a connection-pool:
It essentially works by keeping a number of connections (a pool) ready, and gives one out to each thread. When the thread is done, it returns the connection to the pool and another request/thread can use it.
For this purpose you can use Persistence Connection or Connection Pool.
Persistence Connection - very very very bad idea. Don't use use it! Just don't! Especially when you are talking about web programming.
Connection Pool - Better then Persistence Connection, but with no deep understanding of how it works, you will end with the same problems of Persistence Connection.
Don't do optimization unless you really have performance problems. In web, its common to open/close connection per page request. It works really fast. You better think about optimizing sql queries, indexes, caches.
I got a lot scripts running: scrappers, checkers, cleaners, etc. They have some things in common:
they are forever running;
they have no time constrain to finish their job;
they all access the same MYSQL DB, writting and reading.
Accumulating them, it's starting to slow down the website, which runs on the same system, but depends on these scripts.
I can use queues with Kombu to inline all writtings.
But do you know a way to make the same with reading ?
E.G: if one script need to read from the DB, his request is sent to a blocking queue, et it resumes when it got the answer ? This way everybody is making request to one process, and the process is the only one talking to the DB, making one request at the time.
I have no idea how to do this.
Of course, in the end I may have to add more servers to the mix, but before that, is there something I can do at the software level ?
You could use a connection pooler and make the connections from the scripts go through it. It would limit the number of real connections hitting your DB while being transparent to your scripts (their connections would be held in a "wait" state until a real connections is freed).
I don't know what DB you use, but for Postgres I'm using PGBouncer for similiar reasons, see http://pgfoundry.org/projects/pgbouncer/
You say that your dataset is <1GB, the problem is CPU bound.
Now start analyzing what is eating CPU cycles:
Which queries are really slow and executed often. MySQL can log those queries.
What about the slow queries? Can they be accelerated by using an index?
Are there unused indices? Drop them!
Nothing helps? Can you solve it by denormalizing/precomputing stuff?
You could create a function that each process must call in order to talk to the DB. You could re-write the scripts so that they must call that function rather than talk directly to the DB. Within that function, you could have a scope-based lock so that only one process would be talking to the DB at a time.
I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?
Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)
You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.
The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.
Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.
You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.
Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.
You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.
Use threading.Lock()
I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!
Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)
I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.
I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).
Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety
The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!