I have been using python with RDBMS' (MySQL and PostgreSQL), and I have noticed that I really do not understand how to use a cursor.
Usually, one have his script connect to the DB via a client DB-API (like psycopg2 or MySQLdb):
connection = psycopg2.connect(host='otherhost', etc)
And then one creates a cursor:
cursor = connection.cursor()
And then one can issue queries and commands:
cursor.execute("SELECT * FROM etc")
Now where is the result of the query, I wonder? is it on the server? or a little on my client and a little on my server? And then, if we need to access some results, we fetch 'em:
rows = cursor.fetchone()
or
rows = cursor.fetchmany()
Now lets say, I do not retrieve all the rows, and decide to execute another query, what will happen to the previous results? Is their an overhead.
Also, should I create a cursor for every form of command and continuously reuse it for those same commands somehow; I head psycopg2 can somehow optimize commands that are executed many times but with different values, how and is it worth it?
Thx
ya, i know it's months old :P
DB-API's cursor appears to be closely modeled after SQL cursors. AFA resource(rows) management is concerned, DB-API does not specify whether the client must retrieve all the rows or DECLARE an actual SQL cursor. As long as the fetchXXX interfaces do what they're supposed to, DB-API is happy.
AFA psycopg2 cursors are concerned(as you may well know), "unnamed DB-API cursors" will fetch the entire result set--AFAIK buffered in memory by libpq. "named DB-API cursors"(a psycopg2 concept that may not be portable), will request the rows on demand(fetchXXX methods).
As cited by "unbeknown", executemany can be used to optimize multiple runs of the same command. However, it doesn't accommodate for the need of prepared statements; when repeat executions of a statement with different parameter sets is not directly sequential, executemany() will perform just as well as execute(). DB-API does "provide" driver authors with the ability to cache executed statements, but its implementation(what's the scope/lifetime of the statement?) is undefined, so it's impossible to set expectations across DB-API implementations.
If you are loading lots of data into PostgreSQL, I would strongly recommend trying to find a way to use COPY.
Assuming you're using PostgreSQL, the cursors probably are just implemented using the database's native cursor API. You may want to look at the source code for pg8000, a pure Python PostgreSQL DB-API module, to see how it handles cursors. You might also like to look at the PostgreSQL documentation for cursors.
When you look here at the mysqldb documentation you can see that they implemented different strategies for cursors. So the general answer is: it depends.
Edit: Here is the mysqldb API documentation. There is some info how each cursor type is behaving. The standard cursor is storing the result set in the client. So I assume there is a overhead if you don't retrieve all result rows, because even the rows you don't fetch have to be transfered to the client (potentially over the network). My guess is that it is not that different from postgresql.
When you want to optimize SQL statements that you call repeatedly with many values, you should look at cursor.executemany(). It prepares a SQL statement so that it doesn't need to be parsed every time you call it:
cur.executemany('INSERT INTO mytable (col1, col2) VALUES (%s, %s)',
[('val1', 1), ('val2', 2)])
Related
In the case of sqlite, it is not clear whether we can easily commit right after each dataframe insert. (Assuming that auto-commit is off by default, following the python database wrapping convention).
Using the simplest sqlalchemy api flow ―
db_engine = db.create_engine()
for .....
# slowly compute some_df, takes a lot of time
some_df.to_sql(con = db_engine)
How can we make sure that every .to_sql is committed?
For motivation, imagine the particular use case being that each write reflects the result of a potentially very long computation and we do not want to lose a huge batch of such computations nor any single one of them, in case a machine goes down or in case the python sqlalchemy engine object is garbage collected before all its writes have actually drained in the database.
I believe auto-commit is off by default, and for sqlite, there is no way of changing that in the create_engine command. What might be the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write ― when using the simplistic .to_sql api?
Or must the code be refactored to use a different api flow to accomplish that?
You can set the connection to autocommit by:
db_engine = db_engine.execution_options(autocommit=True)
From https://docs.sqlalchemy.org/en/13/core/connections.html#understanding-autocommit:
The “autocommit” feature is only in effect when no Transaction has otherwise been declared. This means the feature is not generally used with the ORM, as the Session object by default always maintains an ongoing Transaction.
In your code you have not presented any explicit transactions, and so the engine used as the con is in autocommit mode (as implemented by SQLA).
Note that SQLAlchemy implements its own autocommit that is independent from the DB-API driver's possible autocommit / non-transactional features.
Hence the "the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write" is what you already had, unless to_sql() emits some funky statements that SQLA does not recognize as data changing operations, which it has not, at least of late.
It might be that the SQLA autocommit feature is on the way out in the next major release, but we'll have to wait and see.
As a pet project I have been writing my own ORM to help me better understand the decisions made by production grade ORMs like Peewee or the more complex sqlalchemy.
In line with my titles question, is it better to spawn one cursor and reuse it for multiple SQL executions or spawn a new cursor for each transaction?
I've already guessed about avoid state issues (transactions with no commit) but is there another reason why it would be better to have one cursor for each operation (insert, update, select, delete, or create)?
Have you profiled and found that the creation of cursors is a significant source of overhead?
Cursors are a DB-API 2.0 artifact, not necessarily an actual "thing" that exists. They are designed to provide a common interface for executing queries and handling results/iteration. How they are implemented under-the-hood is up to the database driver. If you're aiming to support DB-API 2.0 compatible drivers, I suggest just use the cursor() method to create a cursor for every query execution. I would recommend to NEVER have a singleton or shared cursor.
In SQLite, for example, a cursor is essentially a wrapper around a sqlite3_stmt object, as there's no such thing as a "sqlite3_cursor". The stdlib sqlite3 driver maintains an internal cache of sqlite3_stmt objects to avoid the cost of compiling queries that are frequently used.
Based on the following context:
open connection with the Oracle database
execute one SELECT query
close the connection
Is it useful to use threads to handle the database connection and query execution?
Does the library cx_Oracle already manage enough to trust its execution?
For example, in the case of the database being over loaded (by other connections), it will keep the connection open undefinetly?
Other scenario case: the query has WHERE conditions, which limit the number of rows. But if the number of rows increase drastically, the query will take more time to execute. Is it useful to thread the query execution, like the following post.
I have seen that cx_Oracle can be combine with threads for multiple insertions, or more complex software. In a simpler case, I wonder...
Thanks a lot.
Is opening/closing db cursor costly operation? What is the best practice, to use a different cursor or to reuse the same cursor between different sql executions? Does it matter if a transaction consists of executions performed on same or different cursors belonging to same connection?
Thanks.
This will depend a lot on your database as well as your chose python implementation - have you tried profiling a few short test operations?
I'm using SQLAlchemy as the ORM within an application i've been building for some time.
So far, it's been quite a painless ORM to implement and use, however, a recent feature I'm working on requires a persistent & distributed queue (list & worker) style implementation, which I've built in MySQL and Python.
It's all worked quite well until I tested it in a scaled environment.
I've used InnoDB row level locking to ensure each row is only read once, while the row is locked, I update an 'in_use' value, to make sure that others don't grab at the entry.
Since MySQL doesn't offer a "NOWAIT" method like Postgre or Oracle does, I've run into locking issues where worker threads hang and wait for the locked row to become available.
In an attempt to overcome this limitation, I've tried to put all the required processing into a single statement, and run it through the ORM's execute() method, although, SQLAlchemy is refusing to return the query result.
Here's an example.
My SQL statement is:
SELECT id INTO #update_id FROM myTable WHERE in_use=0 ORDER BY id LIMIT 1 FOR UPDATE;
UPDATE myTable SET in_use=1 WHERE id=#update_id;
SELECT * FROM myTable WHERE id=#update_id;
And I run this code in the console:
engine = create_engine('mysql://<user details>#<server details>/myDatabase', pool_recycle=90, echo=True)
result = engine.execute(sqlStatement)
result.fetchall()
Only to get this result
[]
I'm certain the statement is running since I can see the update take effect in the database, and if I execute through the mysql terminal or other tools, I get the modified row returned.
It just seems to be SQLAlchemy that doesn't want to acknowledge the returned row.
Is there anything specific that needs to be done to ensure that the ORM picks up the response?
Cheers
You have executed 3 queries and MySQLdb creates a result set for each. You have to fetch first result, then call cursor.nextset(), fetch second and so on.
This answers your question, but won't be useful for you, because it won't solve locking issue. You have to understand how FOR UPDATE works first: it locks returned rows till the end of transaction. To avoid long lock wait you have to make it as short as possible: SELECT ... FOR UPDATE, UPDATE SET in_use=1 ..., COMMIT. You actually don't need to put them into single SQL statement, 3 execute() calls will be OK too. But you have have to commit before long computation, otherwise lock will be held too long and updating in_use (offline lock) is meaningless. And sure you can do the same thing using ORM too.