Python, sqlalchemy: how to improve performance of encrypted sqlite database? - python

I have a simple service application: python, tornado web server, sqlite database. The database is encrypted.
The problem is that processing even very simple http request takes about 300msec.
From logs I can see that most of that time takes processing of the very first sql request, no matter how simple this first request is. Subsequent sql requests are processed much faster. But then server starts processing next http request, and again the first sql request is very slow.
If I turn off the database encryption the problem is gone: processing time of sql requests does not depend on if the request is first or not and my server response time decreases by factor 10 to 15.
I do not quite understand what's going on. Looks like sqlalchemy reads and decrypts the database file each time it starts new session. Is there any way to workaround this problem?

Due to how pysqlite, or the sqlite3 module, works SQLAlchemy defaults to using a NullPool with file-based databases. This explains why your database is decrypted per each request: a NullPool discards connections as they are closed. The reason why this is done is that pysqlite's default behaviour is to disallow using a connection in more than one thread, and without encryption creating new connections is very fast.
Pysqlite does have an undocumented flag check_same_thread that can be used to disable the check, but sharing connections between threads should be handled with care and the SQLAlchemy documentation makes a passing mention that the NullPool works well with SQLite's file locking.
Depending on your web server you could use a SingletonThreadPool, which means that all connections in a thread are the same connection:
engine = create_engine('sqlite:///my.db',
poolclass=SingletonThreadPool)
If you feel adventurous and your web server does not share connections / sessions between threads while in use (for example using a scoped session), then you could try using a different pooling strategy paired with check_same_thread=False:
engine = create_engine('sqlite:///my.db',
poolclass=QueuePool,
connect_args={'check_same_thread':False})

To encrypt database sqlcipher creates a key from the passphrase I provided. This operation is resource consuming by design.
But it is possible to use not a passphrase, but 256-bit raw key. In this case sqlcipher would not have to generate the encryption key.
Originally my code was:
session.execute('PRAGMA KEY = "MY_PASSPHRASE";')
To use raw key I changed this line to:
session.execute('''PRAGMA KEY = "x'<the key>'";''')
where <the key> is 64 characters long string of hexadecimals.
Result is 20+ times speed up on small requests.
Just for reference: to convert database to use new encryption key the following commands should be executed:
PRAGMA KEY = ""MY_PASSPHRASE";
PRAGMA REKEY = "x'<the key>'";
Related question: python, sqlite, sqlcipher: very poor performance processing first request
Some info about sqlcipher commands and difference between keys and raw keys: https://www.zetetic.net/sqlcipher/sqlcipher-api/

Related

How to be informed that some database information has been changed in Python

I'm working on a code wrote in Python 2.7 that connects to a MariaDB database to read data.
This database receives data from different external resources. My code only read it.
My service read the data once at the beginning and keep everything in memory to avoid I/O.
I would like to know if there is someway to create some 'function callback' in my code to receive some kind of alert of new update/insert, so I can reload my memory data from the database every time that any external resource change or save new data.
I have thought of creating a sql trigger to a new table to insert some "flag" there and put my service to check that new table periodically if the flag is present.
If so, reload the data and delete the flag.
But it sounds like a wrong workaround...
I'm using:
Python 2.7
MariaDB Ver 15.1 Distrib 10.3.24-MariaDB
lib mysql-connector 2.1.6
The better solution for MariaDB is streaming with the CDC API: https://mariadb.com/resources/blog/how-to-stream-change-data-through-mariadb-maxscale-using-cdc-api/
The plan you have now, with using a flag table, means your client has to poll the flag table for presence of the flag. You have to run a query against that table at intervals, and keep doing it 24/7. Depending on how quickly your client needs to be notified of a change, you might need to run this type of polling query very frequently, which puts a burden on the MariaDB server just to respond to the polling queries, even when there is no change to report.
The CDC solution is better because the client can just request to be notified the next time a change occurs, then the client waits. It does not put extra load on the MariaDB server, any more than if you had simply added a replica server.

Python async MySQL lib for caching duplicate read queries in transaction

Suppose I have web request handler in python which processes some complex logic using MySQL queries. I wrap request in some readable methods, for ex:
START TRANSACTION
get_some_users_in_range("select users where id>1 and id<24")
get_user("select users where id=10")
get_user("select users where id=10")
get_user("select users where id=12")
END TRANSACTION
All I want some smart caching application layer which understand what in context of transaction where is no need to do DB request after first query (because all needed rows already fetched by first query). Is where solutions for such problem in modern python (async preferable).
ps. raw SQL lib preferred (not ORM)
You can wrap your function with functools.lru_cache: https://docs.python.org/3/library/functools.html#functools.lru_cache
Here is the async version of the same functionality: https://github.com/aio-libs/async-lru
An excellent library that has more caching strategies (check their readme, they also have async support via separate lib): https://github.com/tkem/cachetools
But keep in mind that using such caching will work only in the scope of a single process. If you want the cache to be shared between processes/instances - consider some external cache service such as Redis for example

Flask-SQLAlchemy "MySQL server has gone away" when using HAproxy

I've built a small python REST service using Flask, with Flask-SQLAlchemy used for talking to the MySQL DB.
If I connect directly to the MySQL server everything is good, no problems at all. If I use HAproxy (handles HA/failover, though in this dev environment there is only one DB server) then I constantly get MySQL server has gone away errors if the application doesn't talk to the DB frequently enough.
My HAproxy client timeout is set to 50 seconds, so what I think is happening is it cuts the stream, but the application isn't aware and tries to make use of an invalid connection.
Is there a setting I should be using when using services like HAproxy?
Also it doesn't seem to reconnect automatically, but if I issue a request manually I get Can't reconnect until invalid transaction is rolled back, which is odd since it is just a select() call I'm making, so I don't think it is a commit() I'm missing - or should I be calling commit() after every ORM based query?
Just to tidy up this question with an answer I'll post what I (think I) did to solve the issues.
Problem 1: HAproxy
Either increase the HAproxy client timeout value (globally, or in the frontend definition) to a value longer than what MySQL is set to reset on (see this interesting and related SF question)
Or set SQLALCHEMY_POOL_RECYCLE = 30 (30 in my case was less than HAproxy client timeout) in Flask's app.config so that when the DB is initialised it will pull in those settings and recycle connections before HAproxy cuts them itself. Similar to this issue on SO.
Problem 2: Can't reconnect until invalid transaction is rolled back
I believe I fixed this by tweaking the way the DB is initialised and imported across various modules. I basically now have a module that simply has:
from flask.ext.sqlalchemy import SQLAlchemy
db = SQLAlchemy()
Then in my main application factory I simply:
from common.database import db
db.init_app(app)
Also since I wanted to easily load table structures automatically I initialised the metadata binds within the app context, and I think it was this which cleanly handled the commit() issue/error I was getting, as I believe the database sessions are now being correctly terminated after each request.
with app.app_context():
# Setup DB binding
db.metadata.bind = db.engine

SQLite over a network

I am creating a Python application that uses embedded SQLite databases. The programme creates the db files and they are on a shared network drive. At this point there will be no more than 5 computers on the network running the programme.
My initial thought was to ask the user on startup if they are the server or client. If they are the server then they create the database. If they are the client they must find a server instance on the network. The one way I suppose is to send all db commands from client to server and server implements in the database. Will that solve the shared db issue?
Alternatively, is there some way to create a SQLite "server". I presume this would be the quicker option if available?
Note: I can't use a server engine such as MySQL or PostgreSQL at this point but I am implementing using ORM and so when this becomes viable, it should be easy to change over.
Here's a "SQLite Server", http://sqliteserver.xhost.ro/, but it looks like not in maintain for years.
SQLite supports concurrency itself, multiple processes can read data at one time and only one can write data into it. Also, When some process is writing, it'll lock the whole database file for a few seconds and others have to wait in the mean time according official document.
I guess this is sufficient for 5 processes as yor scenario. Just you need to write codes to handle the waiting.

How to manage db connections using webpy with SQLObject?

Web.py has its own database API, web.db. It's possible to use SQLObject instead, but I haven't been able to find documentation describing how to do this properly. I'm especially interested in managing database connections. It would be best to establish a connection at the wsgi entry point, and reuse it. Webpy cookbook contains an example how to do this with SQLAlchemy. I'd be interested to see how to properly do a similar thing using SQLObject.
This is how I currently do it:
class MyPage(object):
def GET(self):
ConnectToDatabase()
....
return render.MyPage(...)
This is obviously inefficient, because it establishes a new database connection on each query. I'm sure there's a better way.
As far as I understand the SQLAlchemy example given, a processor is used, that is, a session is created for each connection and committed when the handler is complete (or rolled back if an error has occurred).
I don't see any simple way to do what you propose, i.e. open a connection at the WSGI entry point. You will probably need a connection pool to serve multiple clients at the same time. (I have no idea what are the requirements for efficiency, code simplicity and so on, though. Please comment.)
Inserting ConnectToDatabase calls into each handler is of course ugly. I suggest that you adapt the cookbook example replacing the SQLAlchemy session with a SQLObject connection.

Categories