As a pet project I have been writing my own ORM to help me better understand the decisions made by production grade ORMs like Peewee or the more complex sqlalchemy.
In line with my titles question, is it better to spawn one cursor and reuse it for multiple SQL executions or spawn a new cursor for each transaction?
I've already guessed about avoid state issues (transactions with no commit) but is there another reason why it would be better to have one cursor for each operation (insert, update, select, delete, or create)?
Have you profiled and found that the creation of cursors is a significant source of overhead?
Cursors are a DB-API 2.0 artifact, not necessarily an actual "thing" that exists. They are designed to provide a common interface for executing queries and handling results/iteration. How they are implemented under-the-hood is up to the database driver. If you're aiming to support DB-API 2.0 compatible drivers, I suggest just use the cursor() method to create a cursor for every query execution. I would recommend to NEVER have a singleton or shared cursor.
In SQLite, for example, a cursor is essentially a wrapper around a sqlite3_stmt object, as there's no such thing as a "sqlite3_cursor". The stdlib sqlite3 driver maintains an internal cache of sqlite3_stmt objects to avoid the cost of compiling queries that are frequently used.
Related
In the case of sqlite, it is not clear whether we can easily commit right after each dataframe insert. (Assuming that auto-commit is off by default, following the python database wrapping convention).
Using the simplest sqlalchemy api flow ―
db_engine = db.create_engine()
for .....
# slowly compute some_df, takes a lot of time
some_df.to_sql(con = db_engine)
How can we make sure that every .to_sql is committed?
For motivation, imagine the particular use case being that each write reflects the result of a potentially very long computation and we do not want to lose a huge batch of such computations nor any single one of them, in case a machine goes down or in case the python sqlalchemy engine object is garbage collected before all its writes have actually drained in the database.
I believe auto-commit is off by default, and for sqlite, there is no way of changing that in the create_engine command. What might be the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write ― when using the simplistic .to_sql api?
Or must the code be refactored to use a different api flow to accomplish that?
You can set the connection to autocommit by:
db_engine = db_engine.execution_options(autocommit=True)
From https://docs.sqlalchemy.org/en/13/core/connections.html#understanding-autocommit:
The “autocommit” feature is only in effect when no Transaction has otherwise been declared. This means the feature is not generally used with the ORM, as the Session object by default always maintains an ongoing Transaction.
In your code you have not presented any explicit transactions, and so the engine used as the con is in autocommit mode (as implemented by SQLA).
Note that SQLAlchemy implements its own autocommit that is independent from the DB-API driver's possible autocommit / non-transactional features.
Hence the "the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write" is what you already had, unless to_sql() emits some funky statements that SQLA does not recognize as data changing operations, which it has not, at least of late.
It might be that the SQLA autocommit feature is on the way out in the next major release, but we'll have to wait and see.
TL;DR:
How do I prevent DB access issues when calling a PostgreSQL database from multiple threads in Python using SQLAlchemy?
The details:
I am developing a Python software that uses Multithreading (concurrent.futures ThreadPool) - but I am by no means an expert in anything.
I also use SQLAlchemy to communicate with a PostgreSQL database (using pg8000).
Because I wanted to keep all my database stuff separate from all the rest, all the SQLAlchemy code sits in a Python module that I called db_manager.py. In here you will find the declarative base, the create_engine() call but also loads of methods to get stuff or store stuff in the database. Each method here ends with:
session.commit()
(unless I just query the database).
Each thread then would call the db_manager module to interact with the database, e.g.:
db_manager.getSomethingFromDB(...)
I created a little drawing to illustrate the architecture.
The problem:
Now the problem I run into is that these database calls seem to clash sometimes.
What is the best way of dealing with multithreading, SQLAlchemy, and PostgreSQL?
Some ideas:
Currently, my db_manager accesses the PostgreSQL as a specific user (pg8000 appears to require this). Is that a problem? Should each thread be its own user? Or can this not be causing problems? If each thread needs to be its own database user, I would probably no longer be able to have all my database stuff in one single module.
I failed to define rollbacks for each commit. I just noticed this is causing problems this any error prevents any further database access.
I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?
Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.
Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session
I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.
TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs
What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit
If u had added the incorrect model to the session, u can do:
db.session.rollback()
Is opening/closing db cursor costly operation? What is the best practice, to use a different cursor or to reuse the same cursor between different sql executions? Does it matter if a transaction consists of executions performed on same or different cursors belonging to same connection?
Thanks.
This will depend a lot on your database as well as your chose python implementation - have you tried profiling a few short test operations?
I have been using python with RDBMS' (MySQL and PostgreSQL), and I have noticed that I really do not understand how to use a cursor.
Usually, one have his script connect to the DB via a client DB-API (like psycopg2 or MySQLdb):
connection = psycopg2.connect(host='otherhost', etc)
And then one creates a cursor:
cursor = connection.cursor()
And then one can issue queries and commands:
cursor.execute("SELECT * FROM etc")
Now where is the result of the query, I wonder? is it on the server? or a little on my client and a little on my server? And then, if we need to access some results, we fetch 'em:
rows = cursor.fetchone()
or
rows = cursor.fetchmany()
Now lets say, I do not retrieve all the rows, and decide to execute another query, what will happen to the previous results? Is their an overhead.
Also, should I create a cursor for every form of command and continuously reuse it for those same commands somehow; I head psycopg2 can somehow optimize commands that are executed many times but with different values, how and is it worth it?
Thx
ya, i know it's months old :P
DB-API's cursor appears to be closely modeled after SQL cursors. AFA resource(rows) management is concerned, DB-API does not specify whether the client must retrieve all the rows or DECLARE an actual SQL cursor. As long as the fetchXXX interfaces do what they're supposed to, DB-API is happy.
AFA psycopg2 cursors are concerned(as you may well know), "unnamed DB-API cursors" will fetch the entire result set--AFAIK buffered in memory by libpq. "named DB-API cursors"(a psycopg2 concept that may not be portable), will request the rows on demand(fetchXXX methods).
As cited by "unbeknown", executemany can be used to optimize multiple runs of the same command. However, it doesn't accommodate for the need of prepared statements; when repeat executions of a statement with different parameter sets is not directly sequential, executemany() will perform just as well as execute(). DB-API does "provide" driver authors with the ability to cache executed statements, but its implementation(what's the scope/lifetime of the statement?) is undefined, so it's impossible to set expectations across DB-API implementations.
If you are loading lots of data into PostgreSQL, I would strongly recommend trying to find a way to use COPY.
Assuming you're using PostgreSQL, the cursors probably are just implemented using the database's native cursor API. You may want to look at the source code for pg8000, a pure Python PostgreSQL DB-API module, to see how it handles cursors. You might also like to look at the PostgreSQL documentation for cursors.
When you look here at the mysqldb documentation you can see that they implemented different strategies for cursors. So the general answer is: it depends.
Edit: Here is the mysqldb API documentation. There is some info how each cursor type is behaving. The standard cursor is storing the result set in the client. So I assume there is a overhead if you don't retrieve all result rows, because even the rows you don't fetch have to be transfered to the client (potentially over the network). My guess is that it is not that different from postgresql.
When you want to optimize SQL statements that you call repeatedly with many values, you should look at cursor.executemany(). It prepares a SQL statement so that it doesn't need to be parsed every time you call it:
cur.executemany('INSERT INTO mytable (col1, col2) VALUES (%s, %s)',
[('val1', 1), ('val2', 2)])