Is opening/closing db cursor costly operation? What is the best practice, to use a different cursor or to reuse the same cursor between different sql executions? Does it matter if a transaction consists of executions performed on same or different cursors belonging to same connection?
Thanks.
This will depend a lot on your database as well as your chose python implementation - have you tried profiling a few short test operations?
Related
In the case of sqlite, it is not clear whether we can easily commit right after each dataframe insert. (Assuming that auto-commit is off by default, following the python database wrapping convention).
Using the simplest sqlalchemy api flow ―
db_engine = db.create_engine()
for .....
# slowly compute some_df, takes a lot of time
some_df.to_sql(con = db_engine)
How can we make sure that every .to_sql is committed?
For motivation, imagine the particular use case being that each write reflects the result of a potentially very long computation and we do not want to lose a huge batch of such computations nor any single one of them, in case a machine goes down or in case the python sqlalchemy engine object is garbage collected before all its writes have actually drained in the database.
I believe auto-commit is off by default, and for sqlite, there is no way of changing that in the create_engine command. What might be the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write ― when using the simplistic .to_sql api?
Or must the code be refactored to use a different api flow to accomplish that?
You can set the connection to autocommit by:
db_engine = db_engine.execution_options(autocommit=True)
From https://docs.sqlalchemy.org/en/13/core/connections.html#understanding-autocommit:
The “autocommit” feature is only in effect when no Transaction has otherwise been declared. This means the feature is not generally used with the ORM, as the Session object by default always maintains an ongoing Transaction.
In your code you have not presented any explicit transactions, and so the engine used as the con is in autocommit mode (as implemented by SQLA).
Note that SQLAlchemy implements its own autocommit that is independent from the DB-API driver's possible autocommit / non-transactional features.
Hence the "the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write" is what you already had, unless to_sql() emits some funky statements that SQLA does not recognize as data changing operations, which it has not, at least of late.
It might be that the SQLA autocommit feature is on the way out in the next major release, but we'll have to wait and see.
Using Postgres and sqlalchemy.
I have a job scans a large table and for each row does some calculation and updates some related tables. I am told that I should issue periodic commits inside the loop in order not to keep a large amount of in-memory data. I wonder such commits have a performance penalty, e.g. restarting a transaction, taking db snapshot perhaps etc.
Would using a flush() be better in this case?
An open transaction won't keep a lot of data in memory.
The advice you got was probably from somebody who is used to Oracle, where large transactions cause problems with UNDO.
The question is how you scan the large table:
If you snarf the large table to the client and then update the related tables, it won't matter much if you commit in between or not.
If you use a cursor to scan the large table (which is normally better), you'd have to create a WITH HOLD cursor if you want the cursor to work across transactions. Such a cursor is materialized on the database server side and so will use more resources on the database.
The alternative would be to use queries for the large table that fetch only part of the table and chunk the operation that way.
That said, there are reasons why one big transaction might be better or worse than many smaller ones:
Reasons speaking for a big transaction:
You can use a normal cursor to scan the big table and don't have to bother with WITH HOLD cursors or the alternative as indicated above.
You'd have transactional guarantees for the whole operation. For example, you can simply restart the operation after an error and rollback.
Reasons speaking for operation in batches:
Shorter transactions reduce the risk of deadlocks.
Shorter transactions allow autovacuum to clean up the effects of previous batches while later batches are being processed. This is a notable advantage if there is a lot of data churn due to the updates, as it will help keep table bloat small.
The best choice depends on the actual situation.
As a pet project I have been writing my own ORM to help me better understand the decisions made by production grade ORMs like Peewee or the more complex sqlalchemy.
In line with my titles question, is it better to spawn one cursor and reuse it for multiple SQL executions or spawn a new cursor for each transaction?
I've already guessed about avoid state issues (transactions with no commit) but is there another reason why it would be better to have one cursor for each operation (insert, update, select, delete, or create)?
Have you profiled and found that the creation of cursors is a significant source of overhead?
Cursors are a DB-API 2.0 artifact, not necessarily an actual "thing" that exists. They are designed to provide a common interface for executing queries and handling results/iteration. How they are implemented under-the-hood is up to the database driver. If you're aiming to support DB-API 2.0 compatible drivers, I suggest just use the cursor() method to create a cursor for every query execution. I would recommend to NEVER have a singleton or shared cursor.
In SQLite, for example, a cursor is essentially a wrapper around a sqlite3_stmt object, as there's no such thing as a "sqlite3_cursor". The stdlib sqlite3 driver maintains an internal cache of sqlite3_stmt objects to avoid the cost of compiling queries that are frequently used.
Based on the following context:
open connection with the Oracle database
execute one SELECT query
close the connection
Is it useful to use threads to handle the database connection and query execution?
Does the library cx_Oracle already manage enough to trust its execution?
For example, in the case of the database being over loaded (by other connections), it will keep the connection open undefinetly?
Other scenario case: the query has WHERE conditions, which limit the number of rows. But if the number of rows increase drastically, the query will take more time to execute. Is it useful to thread the query execution, like the following post.
I have seen that cx_Oracle can be combine with threads for multiple insertions, or more complex software. In a simpler case, I wonder...
Thanks a lot.
I have been using python with RDBMS' (MySQL and PostgreSQL), and I have noticed that I really do not understand how to use a cursor.
Usually, one have his script connect to the DB via a client DB-API (like psycopg2 or MySQLdb):
connection = psycopg2.connect(host='otherhost', etc)
And then one creates a cursor:
cursor = connection.cursor()
And then one can issue queries and commands:
cursor.execute("SELECT * FROM etc")
Now where is the result of the query, I wonder? is it on the server? or a little on my client and a little on my server? And then, if we need to access some results, we fetch 'em:
rows = cursor.fetchone()
or
rows = cursor.fetchmany()
Now lets say, I do not retrieve all the rows, and decide to execute another query, what will happen to the previous results? Is their an overhead.
Also, should I create a cursor for every form of command and continuously reuse it for those same commands somehow; I head psycopg2 can somehow optimize commands that are executed many times but with different values, how and is it worth it?
Thx
ya, i know it's months old :P
DB-API's cursor appears to be closely modeled after SQL cursors. AFA resource(rows) management is concerned, DB-API does not specify whether the client must retrieve all the rows or DECLARE an actual SQL cursor. As long as the fetchXXX interfaces do what they're supposed to, DB-API is happy.
AFA psycopg2 cursors are concerned(as you may well know), "unnamed DB-API cursors" will fetch the entire result set--AFAIK buffered in memory by libpq. "named DB-API cursors"(a psycopg2 concept that may not be portable), will request the rows on demand(fetchXXX methods).
As cited by "unbeknown", executemany can be used to optimize multiple runs of the same command. However, it doesn't accommodate for the need of prepared statements; when repeat executions of a statement with different parameter sets is not directly sequential, executemany() will perform just as well as execute(). DB-API does "provide" driver authors with the ability to cache executed statements, but its implementation(what's the scope/lifetime of the statement?) is undefined, so it's impossible to set expectations across DB-API implementations.
If you are loading lots of data into PostgreSQL, I would strongly recommend trying to find a way to use COPY.
Assuming you're using PostgreSQL, the cursors probably are just implemented using the database's native cursor API. You may want to look at the source code for pg8000, a pure Python PostgreSQL DB-API module, to see how it handles cursors. You might also like to look at the PostgreSQL documentation for cursors.
When you look here at the mysqldb documentation you can see that they implemented different strategies for cursors. So the general answer is: it depends.
Edit: Here is the mysqldb API documentation. There is some info how each cursor type is behaving. The standard cursor is storing the result set in the client. So I assume there is a overhead if you don't retrieve all result rows, because even the rows you don't fetch have to be transfered to the client (potentially over the network). My guess is that it is not that different from postgresql.
When you want to optimize SQL statements that you call repeatedly with many values, you should look at cursor.executemany(). It prepares a SQL statement so that it doesn't need to be parsed every time you call it:
cur.executemany('INSERT INTO mytable (col1, col2) VALUES (%s, %s)',
[('val1', 1), ('val2', 2)])