Concurrent writes with SQLite and Peewee - python

I'm planning to use SQLite and Peewee (ORM) for a light duty internal web service (<20 requests per second). The web service can handle multiple simultaneous requests on multiple threads. During each request the database will be both read from and written to. This means I will need to have the ability for both concurrent reads AND writes. It doesn't matter to this application if the data changes between reads and writes.
The SQLite FAQ says that concurrent reads are permitted but concurrent writes from multiple threads require acquiring the file lock. My question is: Does Peewee take care of this locking for me or is there something I need to do in my code to make this possible?
The Peewee database object is shared between threads. I assume this means that the database connection is shared too.
I can't find a Peewee specific answer to this so I'm asking here.

Sqlite is the one doing the locking, although I can see how you might be confused -- the FAQ wording is a bit ambiguous:
When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. Other embedded SQL database engines typically only allow a single process to connect to the database at once.
So if you have two threads, each with their own connection, and one acquires the write lock, the other thread will have to wait for the lock to be released before it can start writing.
Looking at pysqlite, the default busy timeout looks to be 5 seconds, so the second thread should wait up to 5 seconds before raising an OperationalError.
Also, I'd suggest instantiating your SqliteDatabase with threadlocals=True. That will store a connection-per-thread.

Consider to run all writing operations within 1 async process. This made the Javascript server programming nowadays so famous (although this idea is know far longer). It just needs that you a bit familiar with asynchronous programming concept of callbacks:
For SQLITE:
Async concept directly in Sqlite: https://www.sqlite.org/asyncvfs.html
APSW (Another Sqlite Wrapper) which better supports SQlite extentions in Peewee: http://peewee.readthedocs.org/en/latest/peewee/playhouse.html#apsw
For ANY DB.
Consider to write your own thin async handler in python,
as solved here e.g.
SQLAlchemy + Requests Asynchronous Pattern
I would recommend you the last approach, as this allows you more code portability, control, independance from the backend database engine and scalability.

Related

How to make MySQL Database calls in Tornado for an application with highly scalable infrastructure which makes a high number of database queries?

So which will be a better method to make database calls in Tornado for a high performance application with highly scalable infrastructure which makes a high number of database queries?
Method 1 : So I have come across async database drivers/clients like TorMySQL, Tornado-Mysql, asynctorndb,etc which can perform async database calls.
Method 2 : Use the general and common MySQLdb or mysqlclient (by PyMySQL) drivers and make the database calls to separate backend services and load-balance the calls with nginx.
Similar to what the original Friendfeed guys did which they mentioned in one of the groups,
Bret Taylor, one of the original authors, writes :
groups.google.com/group/python-tornado/browse_thread/thread/9a08f5f08cdab108
We experimented with different async DB approaches, but settled on
synchronous at FriendFeed because generally if our DB queries were
backlogging our requests, our backends couldn't scale to the load
anyway. Things that were slow enough were abstracted to separate
backend services which we fetched asynchronously via the async HTTP
module.
Other supporting links for Method 2 for better understanding what I am trying to say are:
StackOverFlow Answer
Method 3 : By making use of threads and IOLoop and using the general sync libraries.
Supporting example links to explain what I mean,
What the Tornado wiki says, https://github.com/tornadoweb/tornado/wiki/Threading-and-concurrency
Do it synchronously and block the IOLoop. This is most appropriate for
things like memcache and database queries that are under your control
and should always be fast. If it's not fast, make it fast by adding
the appropriate indexes to the database, etc.
Another sample Method :
For sync database(mysqldb), we can
executor = ThreadPoolExecutor(4)
result = yield executor.submit(mysqldb_operation)
Method 4 : Just use sync drivers like MySQLdb and start enough Tornado Instances and load balance using nginx that the application remains asynchronous on a broader level with some calls being blocked but other requests benefit the asynchronous nature via the large number of tornado instances.
'Explanation' :
For details follow this link - www.jjinux.com/2009/12/python-asynchronous-networking-apis-and.html, which says :
They allow MySQL queries to block the entire process. However, they
compensate in two ways. They lean heavily on their asynchronous web
client wherever possible. They also make use of multiple Python
processes. Hence, if you're handling 500 simultaneous request, you
might use nginx to split them among 10 different Tornado Web
processes. Each process is handling 50 simultaneous requests. If one
of those requests needs to make a call to the database, only 50
(instead of 500) requests are blocked.
FriendFeed used what you call "method 4": there was no separate backend service; the process was just blocked during the database call.
Using a completely asynchronous driver ("method 1") is generally best. However, this means that you can't use synchronous libraries that wrap the database operations like SQLAlchemy. In this case you may have to use threads ("method 3"), which is almost as good (and easier than "method 2").
The advantage of "method 4" is that it is easy. In the past this was enough to recommend it because introducing callbacks everywhere was tedious; with the advent of coroutines method 3 is almost as easy so it is usually better to use threads than to block the process.

Are transactions in SQLAlchemy thread safe?

I am developing a web app using SQLAlchemy's expression language, not its orm. I want to use multiple threads in my app, but I'm not sure about thread safety. I am using this section of the documentation to make a connection. I think this is thread safe because I reference a specific connection in each request. Is this thread safe?
The docs for connections and sessions say that neither is thread safe or intended to be shared between threads.
The Connection object is not thread-safe. While a Connection can be shared among threads using properly synchronized access, it is still possible that the underlying DBAPI connection may not support shared access between threads. Check the DBAPI documentation for details.
The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time.
The Session should be used in such a way that one instance exists for a single series of operations within a single transaction.
The bigger point is that you should not want to use the session with multiple concurrent threads.
There is no guarantee when using the same connection (and transaction context) in more than one thread that the behavior will be correct or consistent.
You should use one connection or session per thread. If you need guarantees about the data, you should set the isolation level for the engine or session. For web applications, SQLAlchemy suggests using one connection per request cycle.
This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends.
I think you are confusing atomicity with isolation.
Atomicity is usually handled through transactions and concerns integrity.
Isolation is about concurrent read/write to a database table (thus thread safety). For example: if you want to increment an int field of a table's record, you will have to select the record's field, increment the value and update it. If multiple threads are doing this concurrently the result will depend on the order of the reads/writes.
http://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=isolation#engine-creation-api

What are the Tornado and Mongodb blocking and asynchronous considerations?

I am running the Tornado web server in conjunction with Mongodb (using the pymongo driver). I am trying to make architectural decisions to maximize performance.
I have several subquestions regarding the blocking/non-blocking and asynchronous aspects of the resulting application when using Tornado and pymongo together:
Question 1: Connection Pools
It appears that the pymongo.mongo_client.MongoClient object automatically implements a pool of connections. Is the intended purpose of a "connection pool" so that I can access mongodb simultaneously from different threads? Is it true that if run with a single MongoClient instance from a single thread that there is really no "pool" since there would only be one connection open at any time?
Question 2: Multi-threaded Mongo Calls
The following FAQ:
http://api.mongodb.org/python/current/faq.html#does-pymongo-support-asynchronous-frameworks-like-gevent-tornado-or-twisted
states:
Currently there is no great way to use PyMongo in conjunction with
Tornado or Twisted. PyMongo provides built-in connection pooling, so
some of the benefits of those frameworks can be achieved just by
writing multi-threaded code that shares a MongoClient.
So I assume that I just pass a single MongoClient reference to each thread? Or is there more to it than that? What is the best way to trigger a callback when each thread produces a result? Should I have one thread running who's job it is to watch a queue (python's Queue.Queue) to handle each result and then calling finish() on the left open RequestHandler object in Tornado? (of course using the tornado.web.asynchronous decorator would be needed)
Question 3: Multiple Instances
Finally, is it possible that I am just creating work? Should I just shortcut things by running a single threaded instance of Tornado and then start 3-4 instances per core? (The above FAQ reference seems to suggest this)
After all doesn't the GIL in python result in effectively different processes anyway? Or are there additional performance considerations (plus or minus) by the "non-blocking" aspects of Tornado? (I know that this is non-blocking in terms of I/O as pointed out here: Is Tornado really non-blocking?)
(Additional Note: I am aware of asyncmongo at: https://github.com/bitly/asyncmongo but want to use pymongo directly and not introduce this additional dependency.)
As i understand, there is two concepts of webservers:
Thread Based (apache)
Event Driven (tornado)
And you've the GIL with python, GIL is not good with threads, and event driven is a model that uses only one thread, so go with event driven.
Pymongo will block tornado, so here is suggestions:
Using Pymongo: use it, and make your database calls faster, by making indexes, but be aware; indexes dont work with operation that will scan lot of values for example: gte
Using AsyncMongo, it seems that has been updated, but still not all mongodb features.
Using Mongotor, this one is a like an update for Asynchmongo, and it has ODM (Object Document Mapper), has all what you need from MongoDB (aggregation, replica set..) and the only feature that you really miss is GridFS.
Using Motor, this is one, is the complete solution to use with Tornado, it has GridFS support, and it is the officialy Mongodb asynchronous driver for Tornado, it uses a hack using Greenlet, so the only downside is not to use with PyPy.
And now, if you decide other solution than Tornado, if you use Gevent, then you can use Pymongo, because it is said:
The only async framework that PyMongo fully supports is Gevent.
NB: sorry if going out of topic, but the sentence:
Currently there is no great way to use PyMongo in conjunction with Tornado
should be dropped from the documentation, Mongotor and Motor works in a perfect manner (Motor in particular).
While the question is old, I felt the answers given don't completely address all the queries asked by the user.
Is it true that if run with a single MongoClient instance from a single thread that there is really no "pool" since there would only be one connection open at any time?
This is correct if your script does not use threading. However if your script is multi-threaded then there would be multiple connections open at a given time
Finally, is it possible that I am just creating work? Should I just shortcut things by running a single threaded instance of Tornado and then start 3-4 instances per core?
No you are not! creating multiple threads is less resource intensive than multiple forks.
After all doesn't the GIL in python result in effectively different processes anyway?
The GIL only prevents multiple threads from accessing the interpreter at the same time. Does not prevent multiple threads from carrying out I/O simultaneously. In fact that's exactly how motor with asyncio achieves asynchronicity.
It uses a thread pool executor to spawn a new thread for each query, returns the result when the thread completes.
are you also aware of motor ? : http://emptysquare.net/blog/introducing-motor-an-asynchronous-mongodb-driver-for-python-and-tornado/
it is written by Jesse Davis who is coauthor of pymongo

Segmentation fault error in a multi threaded app in python

I have a multi threaded app in python, wherein I create multiple producer threads and they extract the data from DB. Data is extracted in chunks. So the part where a thread creates sql statement with limit values is kept within lock. And to let threads execute queries simultaneously, query() function is kept outside the lock. Then the result fetching part is again kept under the lock. Below is the code snippet:
with UserAgent.lock:
sqlGeoTarget = "call sp_ax_ari_select_user_agent_list('0'," + str(self.chunkStart) + "," + str(self.chunkSize) + ",1);"
self.chunkStart += self.chunkSize
self.dbObj.query(sqlGeoTarget)
print "query executed. Processing data now..."+sqlGeoTarget
with UserAgent.lock:
result = self.dbObj.fetchAll()
self.dbObj.dbCursor.close()
But this code generates fatal error segmentation fault (core dumped). Because if I put all the code under lock, it executes fine. I explicitly close the cursor after fetching the data, it is reopened when query() function fired again.
This code is inside a class named UserAgent and it's a shared resource for a class named Producer. Thus, database object is shared. So the problem area 99% must be that as the db object is shared hitting query simultaneously and closing cursor then must be messing up with result set. But then how to solve this problem and achieve concurrent db query execution?
Do not reuse connections across threads. Create a new connection for each thread instead.
From the MySQLdb User Guide:
The MySQL protocol can not handle multiple threads using the same connection at once. Some earlier versions of MySQLdb utilized locking to achieve a threadsafety of 2. While this is not terribly hard to accomplish using the standard Cursor class (which uses mysql_store_result()), it is complicated by SSCursor (which uses mysql_use_result(); with the latter you must ensure all the rows have been read before another query can be executed. It is further complicated by the addition of transactions, since transactions start when a cursor execute a query, but end when COMMIT or ROLLBACK is executed by the Connection object. Two threads simply cannot share a connection while a transaction is in progress, in addition to not being able to share it during query execution. This excessively complicated the code to the point where it just isn't worth it.
The general upshot of this is: Don't share connections between threads. It's really not worth your effort or mine, and in the end, will probably hurt performance, since the MySQL server runs a separate thread for each connection. You can certainly do things like cache connections in a pool, and give those connections to one thread at a time. If you let two threads use a connection simultaneously, the MySQL client library will probably upchuck and die. You have been warned.
Emphasis mine.
Use thread local storage or a dedicated connection pooling library instead.

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?
Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)
You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.
The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.
Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.
You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.
Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.
You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.
Use threading.Lock()
I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!
Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)
I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.
I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.
You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).
Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety
The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

Categories