I wish to run some benchmarks on different databases that I have. I repeat every query n times so that I can report average query times. I am aware that SQLite caches statements, as per the documentation:
The sqlite3 module internally uses a statement cache to avoid SQL
parsing overhead. If you want to explicitly set the number of
statements that are cached for the connection, you can set the
cached_statements parameter. The currently implemented default is to
cache 100 statements.
However, it is unclear to me whether this cache persists: so in short, does the SQLite cache persists (1) within a Python session even after closing the connection? (2) Across Python sessions (so cache written to disk?)?
My code looks something like this:
times = []
for i in range(n_repeat):
start = time.perf_counter()
conn = sqlite3.connect(dbpath)
# do query
conn.commit()
conn.close()
times.append(time.perf_counter() - start)
return timedelta(seconds=sum(times) / n_repeat)
My assumption was that whenever I close the connection, any and all caching would be discarded and garbage-collected immediately. I find little variance in the n runs (no difference between 1st and nth iteration), so I would think that my assumption is correct. But I'd rather be sure so I am asking here.
tl;dr: does SQLite cache queries even after a connection has closed? And does it cache queries across Python sessions?
After some digging I found that apart from caching within Python/SQLite (as Shawn mentions in the comments) a huge impactful factor was the OS-wide caching that occurs. I am not sure what exactly is cached (the database, the indices, the query itself) but if I delete system caches after every iteration (so no caching should occur) then I find that consecutive calls are much slower (in the order of 100x or more!).
So, yes SQLite is caching statements through Python but after closing the database these are released. But in addition, system caches will play a huge role.
Related
I have an application which was running very quickly. Let's say it took 10 seconds to run. All it does is read a csv, parse it, and store some of the info in sqlalchemy objects which are written to the database. (We never attempt to read the database, only to write).
After adding a many to many relationship to the entity we are building and relating it to an address entity which we now build, the time to process the file has increased by an order of magnitude. We are doing very little additional work: just instantiating an address and storing it in the relationship collection on our entity using append.
Most of the time appears to be lost in _load_for_state as can be seen in the attached profiling screenshot:
I'm pretty sure this is unnecessary lost time, because it looks like it is trying to do some loading even though we never make any queries of the database (we always instantiate new objects and save them in this app).
Anyone have an idea how to optimize sqlalchemy here?
update
I tried setting SQLALCHEMY_ECHO = True just to see if it is doing a bunch of database reads, or maybe some extra writes. Bizarrely, it only accesses the database itself at the same times it did before (following a db.session.commit()). I'm pretty sure all this extra time is not being spent due to database access.
I am using python and sqlalchemy to manage a sqlite database (in the future I plan to replace sqlite with postgres).
The operations I do are INSERT, SELECT and DELETE and all these operations are part of a python script that runs every hour.
Each one of these operation can take a considerable amount of time due to the large amount of data.
Now in certain circumstances the python script may be killed by an external process. How can I make sure that my database is not corrupted if the script is killed while reading / writing from the DB?
Well, you use a database.
Databases implement ACID properties (see here). To the extent possible, these guarantee the integrity of the data, even when transactions are not complete.
The issue that you are focusing on is dropped connections. I think dropped connections usually result in a transaction being rolled back (I'm not sure if there are exceptions). That is, the database ignores everything since the last commit.
So, the database protects you against internal corruption. Your data model might become invalid, if the sequence of operations is stopped at an arbitrary place. The solution to this is to wrap such operations into a transaction, so the transaction is rolled back.
There is a (small) danger of databases getting corrupted when the hardware or software they are running on suddenly "disappears". This is rare and there are safeguards. And, this is not the problem that you are concerned with (unless your SQLite instance is part of your python process).
I've been struggling with "sqlite3.OperationalError database is locked" all day....
Searching around for answers to what seems to be a well known problem I've found that it is explained most of the time by the fact that sqlite does not work very nice in multithreading where a thread could potentially timeout waiting for more than 5 (default timeout) seconds to write into the db because another thread has the db lock .
So having more threads that play with the db , one of them using transactions and frequently writing I've began measuring the time it takes for transactionns to complete. I've found that no transaction takes more than 300 ms , thus rendering as not plausible the above explication. Unless the thread that uses transactions makes ~21 (5000 ms / 300 ms) consecutive transactions while any other thread desiring to write gets ignored all this time
So what other hypothesis could potentially explain this behavior ?
I have had a lot of these problems with Sqlite before. Basically, don't have multiple threads that could, potentially, write to the db. If you this is not acceptable, you should switch to Postgres or something else that is better at concurrency.
Sqlite has a very simple implementation that relies on the file system for locking. Most file systems are not built for low-latency operations like this. This is especially true for network-mounted filesystems and the virtual filesystems used by some VPS solutions (that last one got me BTW).
Additionally, you also have the Django layer on top of all this, adding complexity. You don't know when Django releases connections (although I am pretty sure someone here can give that answer in detail :) ). But again, if you have multiple concurrent writers, you need a database layer than can do concurrency. Period.
I solved this issue by switching to postgres. Django makes this very simple for you, even migrating the data is a no-brainer with very little downtime.
In case anyone else might find this question via Google, here's my take on this.
SQLite is a database engine that implements the "serializable" isolation level (see here). By default, it implements this isolation level with a locking strategy (although it seems to be possible to change this to a more MVCC-like strategy by enabling the WAL mode described in that link).
But even with its fairly coarse-grained locking, the fact that SQLite has separate read and write locks, and uses deferred transactions (meaning it doesn't take the locks until necessary), means that deadlocks might still occur. It seems SQLite can detect such deadlocks and fail the transaction almost immediately.
Since SQLite does not support "select for update", the best way to grab the write lock early, and therefore avoid deadlocks, would be to start transactions with "BEGIN IMMEDIATE" or "BEGIN EXCLUSIVE" instead of just "BEGIN", but Django currently only uses "BEGIN" (when told to use transactions) and does not currently have a mechanism for telling it to use anything else. Therefore, locking failures become almost unavoidable with the combination of Django, SQLite, transactions, and concurrency (unless you issue the "BEGIN IMMEDIATE" manually, but that's pretty ugly and SQLite-specific).
But anyone familiar with databases knows that when you're using the "serializable" isolation level with many common database systems, then transactions can typically fail with a serialization error anyway. That happens in exactly the kind of situation this deadlock represents, and when a serialization error occurs, then the failing transaction must simply be retried. And, in fact, that works fine for me.
(Of course, in the end, you should probably use a less "lite" kind of database engine anyway if you need a lot of concurrency.)
I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?
Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.
Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session
I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.
TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs
What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit
If u had added the incorrect model to the session, u can do:
db.session.rollback()
I've noticed this behavior in sqlite. When I re-use the cursor object, the working set memory in the task manager keeps increasing until my program throws a out of memory exception.
I refactored the code such that each time I query I open a connection to the sqlite file query what I want and then close the connection.
The latter somehow seems to be not so memory-hungry. It doesn't increase beyond a certain point.
All I do with my sqlite db is simple select (which contains two aggregations) against a table.
Is this a behavior we can somehow control? I'd want to reuse my cursor object, but not want memory to be eaten up...
See SQLite: PRAGMA cache_size
By default the cache size is fairly big (~2MB) and that is per-connection. You can set it smaller with the SQL statement:
PRAGMA cache_size=-KB
Use the negative '-' KB value to set it as a size in KiloBytes, otherwise it set the number of pages to use.
Also, if using multiple connections you might want to employ a shared cache to save memory:
SQLITE: Shared Cache
Yes, the sqlite3 module for Python uses a statement cache.
You can read about the cached_statements parameter here.
More info on this issue.