getting the id of a created record in SQLAlchemy

getting the id of a created record in SQLAlchemy - python

How can I get the id of the created record in SQLAlchemy?
I'm doing:
engine.execute("insert into users values (1,'john')")

When you execute a plain text statement, you're at the mercy of the DBAPI you're using as to whether or not the new PK value is available and via what means. With SQlite and MySQL DBAPIs you'll have it as result.lastrowid, which just gives you the value of .lastrowid for the cursor. With PG, Oracle, etc., there's no ".lastrowid" - as someone else said you can use "RETURNING" for those in which case results are available via result.fetchone() (although using RETURNING with oracle, again not taking advantage of SQLAlchemy expression constructs, requires several awkward steps), or if RETURNING isn't available you can use direct sequence access (NEXTVAL in pg), or a "post fetch" operation (CURRVAL in PG, ##identity or scope_identity() in MSSQL).
Sounds complicated right ? That's why you're better off using table.insert(). SQLAlchemy's primary system of providing newly generated PKs is designed to work with these constructs. One you're there, the result.last_inserted_ids() method gives you the newly generated (possibly composite) PK in all cases, regardless of backend. The above methods of .lastrowid, sequence execution, RETURNING etc. are all dealt with for you (0.6 uses RETURNING when available).

There's an extra clause you can add: RETURNING
ie
INSERT INTO users (name, address) VALUES ('richo', 'beaconsfield') RETURNING id
Then just retrieve a row like your insert was a SELECT statement.

Related

Effective insert-only permissions for peewee tables

I'm wondering what the best strategy is for using insert-only permissions to a postgres db with Peewee. I'd like this in order to be certain that a specific user can't read any data back out of the database.
I granted INSERT permissions to my table, 'test', in postgres. But I've run into the problem that when I try to save new rows with something like:
thing = Test(value=1)
thing.save()
The sql actually contains a RETURNING clause that needs more permissions (namely, SELECT) than just insert:
INSERT INTO "test" ("value") VALUES (1) RETURNING "test"."id"
Seems like the same sql is generated when I try to use query = test.insert(value=1)' query.execute() as well.
From looking around, it seems like you need either grant SELECT privileges, or use a more exotic feature like "row level security" in Postgres. Is there any way to go about this with peewee out of the box? Or another suggestion of how to add new rows with truly write-only permissions?

You can omit the returning clause by explicitly writing your INSERT query and supplying a blank RETURNING. Peewee uses RETURNING whenever possible so that the auto-generated PK can be recovered in a single operation, but it is possible to disable it:
# Empty call to returning will disable the RETURNING clause:
iq = Test.insert(value=1).returning()
iq.execute()
You can also override this for all INSERT operations by setting the returning_clause attribute on the DB to False:
db = PostgresqlDatabase(...)
db.returning_clause = False
This is not an officially supported approach, though, and may have unintended side-effects or weird behavior - caveat emptor.

SQLAlchemy "default" vs "server_default" performance

Is there a performance advantage (or disadvantage) when using default instead of server_default for mapping table column default values when using SQLAlchemy with PostgreSQL?
My understanding is that default renders the expression in the INSERT (usually) and that server_default places the expression in the CREATE TABLE statement. Seems like server_default is analogous to typical handling of defaults directly in the db such as:
CREATE TABLE example (
id serial PRIMARY KEY,
updated timestamptz DEFAULT now()
);
...but it is not clear to me if it is more efficient to handle defaults on INSERT or via table creation.
Would there be any performance improvement or degradation for row inserts if each of the default parameters in the example below were changed to server_default?
from uuid import uuid4
from sqlalchemy import Column, Boolean, DateTime, Integer
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql import func
Base = declarative_base()
class Item(Base):
__tablename__ = 'item'
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
count = Column(Integer, nullable=False, default=0)
flag = Column(Boolean, nullable=False, default=False)
updated = Column(DateTime(timezone=True), nullable=False, default=func.now())
NOTE: The best explanation I found so far for when to use default instead of server_default does not address performance (see Mike Bayer's SO answer on the subject). My oversimplified summary of that explanation is that default is preferred over server_default when...
The db can't handle the expression you need or want to use for the default value.
You can't or don't want to modify the schema directly.
...so the question remains as to whether performance should be considered when choosing between default and server_default?

It is impossible to give you a 'this is faster' answer, because performance per default value expression can vary widely, both on the server and in Python. A function to retrieve the current time behaves differently from a scalar default value.
Next, you must realise that defaults can be provided in five different ways:
Client-side scalar defaults. A fixed value, such a 0 or True. The value is used in an INSERT statement.
Client-side Python function. Called each time a default is needed, produces the value to insert, used the same way as a scalar default from there on out. These can be context sensitive (have access to the current execution context with values to be inserted).
Client-side SQL expression; this generates an extra piece of SQL expression that is then used in the query and executed on the server to produce a value.
Server-side DLL expression are SQL expressions that are then stored in the table definition, so are part of the schema. The server uses these to fill a value for any columns omitted from INSERT statements, or when a column value is set to DEFAULT in an INSERT or UPDATE statement.
Server-side implicit defaults or triggers, where other DLL such as triggers or specific database features provide a default value for columns.
Note that when it comes to a SQL expression determining the default value, be that a client-side SQL expression, a server-side DLL expression, or a trigger, it makes very little difference to a database where the default value expression is coming from. The query executor will need to know how to produce values for a given column, once that's parsed out of the DML statement or the schema definition, the server still has to execute the expression for each row.
Choosing between these options is rarely going to be based on performance alone, performance should at most be but one of multiple aspects you consider. There are many factors involved here:
default with a scalar or Python function directly produces a Python default value, then sends the new value to the server when inserting. Python code can access the default value before the data is inserted into the database.
A client-side SQL expression, a server_default value, and server-side implicit defaults and triggers all have the server generate the default, which then must be fetched by the client if you want to be able to access it in the same SQLAlchemy session. You can't access the value until the object has been inserted into the database.
Depending on the exact query and database support, SQLAlchemy may have to make extra SQL queries to either generate a default before the INSERT statement or run a separate SELECT afterwards to fetch the defaults that have been inserted. You can control when this happens (directly when inserting or on first access after flushing, with the eager_defaults mapper configuration).
If you have multiple clients on different platforms accessing the same database, a server_default or other default attached to the schema (such as a trigger) ensures that all clients will use the same defaults, regardless, while defaults implemented in Python can't be accessed by other platforms.
When using PostgreSQL, SQLAlchemy can make use of the RETURNING clause for DML statements, which gives a client access to server-side generated defaults in a single step.
So when using a server_default column default that calculates a new value for each row (not a scalar value), you save a small amount of Python-side time, and save a small amount of network bandwidth as you are not sending data for that column over to the database. The database could be faster creating that same value, or it could be slower; it largely depends on the type of operation. If you need to have access to the generated default value from Python, in the same transaction, you do then have to wait for a return stream of data, parsed out by SQLAlchemy. All these details can become insignificant compared to everything else that happens around inserting or updating rows, however.
Do understand that a ORM is not suitable to be used for high-performance bulk row inserts or updates; quoting from the SQAlchemy Performance FAQ entry:
The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple “inserts” of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its “last inserted id” if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it’s best to chunk these.
Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.
ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.
Because an ORM like SQLAlchemy comes with a hefty overhead price, any performance differences between a server-side or Python-side default quickly disappears in the noise of ORM operations.
So if you are concerned about performance for large-quantity insert or update operations, you would want to use bulk operations for those, and enable the psycopg2 batch execution helpers to really get a speed boost. When using these bulk operations, I'd expect server-side defaults to improve performance just by saving bandwidth moving row data from Python to the server, but how much depends on the exact nature of the default values.
If ORM insert and update performance outside of bulk operations is a big issue for you, you need to test your specific options. I'd start with the SQLAlchemy examples.performance package and add your own test suite using two models that differ only in a single server_default and default configuration.

There's something else important rather than just comparing the performance of the two
If you needed to add a new Column create_at (Not Null) to an existing Table User with some data in it, default will not work.
If used default, during upgrading the database, the error will occur saying cannot insert Null value to existing data in the table. And this will cause significant troubles if you want to maintain your data, even just for testing.
And when used server_default, during upgrading the DB, database will insert the current DateTime value to all previous existing testing data.
So in this case, only server_default will work.

Does Django provide any built-in way to update PostgreSQL autoincrement counters?

I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)

About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.

One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.

flask-sqlalchemy delete query failing with "Could not evaluate current criteria in Python"

I have a query using flask-sqlalchemy in which I want to delete all the stocks from the database where there ticker matches one in a list. This is the current query I have:
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete()
Where new_tickers is a list of str of valid tickers.
The error I am getting is the following:
sqlalchemy.exc.InvalidRequestError: Could not evaluate current criteria in Python: "Cannot evaluate clauselist with operator <function comma_op at 0x1104e4730>". Specify 'fetch' or False for the synchronize_session parameter.

You need to use one of options for bulk delete
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session=False)
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session='evaluate')
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session='fetch')
Basically, SQLAlchemy maintains the session in Python as you issue various SQLAlchemy methods. When you delete entries, how will SQLAlchemy remove any removed rows from the session? This is controlled by a parameter to the delete method, "synchronize_session". synchronize_session has three possible:
'evaluate': it evaluates the produced query directly in Python to determine the objects that need to be removed from the session. This is the default and is very efficient, but is not very robust and complicated queries cannot be be evaluated. If it can't evaluate the query, it raises the sqlalchemy.orm.evaluator.UnevaluatableError condition
'fetch': this performs a select query before the delete and uses that result to determine which objects in the session need to be removed. This is less efficient (potential much less efficient) but will be able to handle any valid query
False: this doesn't attempt to update the session, so it's very efficient, however if you continue to use the session after the delete you may get inaccurate results.
Which option you use is very dependent on how your code uses the session. In most simple queries where you just need to delete rows based on a complicated query, False should work fine. (the example in the question fits this scenario)
SQLAlchemy Delete Method Reference

Try it with this code:
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session=False)
https://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=delete#sqlalchemy.orm.query.Query.delete

Should I always use 'implicit_returning':False in SQLAlchemy?

What's the potential pitfall of always using 'implicit_returning': False in SQLAlchemy?
I've encountered problems a number of times when working on MSSQL tables that have triggers defined, and since the DB is in replication, all of the tables have triggers.
I'm not sure now what the problem exactly is. It has something to do with auto-increment fields - maybe because I'm prefetching the auto-incremented value so I can insert it in another table.
If I don't set 'implicit_returning': False for the table, when I try to insert values, I get this error:
The target table of the DML statement cannot have any enabled triggers
if the statement contains an OUTPUT clause without INTO clause.
So what if I put __table_args__ = {'implicit_returning': False} into all mapped classes just to be safe?
Particularly frustrating for me is that local DB I use for development & testing is not in replication and doesn't need that option, but the production DB is replicated so when I deploy changes they sometimes don't work. :)

As you probably already know, the cause of your predicament is described in SQLAlchemy Docs as the following:
SQLAlchemy by default uses OUTPUT INSERTED to get at newly generated
primary key values via IDENTITY columns or other server side
defaults. MS-SQL does not allow the usage of OUTPUT INSERTED on
tables that have triggers. To disable the usage of OUTPUT INSERTED
on a per-table basis, specify implicit_returning=False for each
Table which has triggers.
If you set your SQLAlchemy engine to echo the SQL, you will see that by default, it does this:
INSERT INTO [table] (id, ...) OUTPUT inserted.[id] VALUES (...)
But if you disable implicit_returning, it does this instead:
INSERT INTO [table] (id, ...) VALUES (...); select scope_identity()
So the question, "Is there any harm in disabling implicit_returning for all tables just in case?" is really, "Is there any disadvantage to using SCOPE_IDENTITY() instead of OUTPUT INSERTED?"
I'm no expert, but I get the impression that although OUTPUT INSERTED is the preferred method these days, SCOPE_IDENTITY() is usually fine too. In the past, SQL Server 2008 (and maybe earlier versions too?) had a bug where SCOPE_IDENTITY sometimes didn't return the correct value, but I hear that has now been fixed (see this question for more detail). (On the other hand, other techniques like ##IDENTITY and IDENT_CURRENT() are still dangerous since they can return the wrong value in corner cases. See this answer and the others on that same page for more detail.)
The big advantage that OUTPUT INSERTED still has is that it can work for cases where you are inserting multiple rows via a single INSERT statement. Is that something you are doing with SQLAlchemy? Probably not, right? So it doesn't matter.
Note that if you are going to have to disable implicit_returning for many tables, you could avoid a bit of boilerplate by making a mixin for it (and whichever other columns and properties you want all of the tables to inherit):
class AutoincTriggerMixin():
__table_args__ = {
'implicit_returning': False
}
id = Column(Integer, primary_key=True, autoincrement=True)
class SomeModel(AutoincTriggerMixin, Base):
some_column = Column(String(1000))
...
See this page in the SQLALchemy documentation for more detail. As an added bonus, it makes it more obvious which tables involve triggers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

getting the id of a created record in SQLAlchemy - python

How can I get the id of the created record in SQLAlchemy? I'm doing: engine.execute("insert into users values (1,'john')")

There's an extra clause you can add: RETURNING ie INSERT INTO users (name, address) VALUES ('richo', 'beaconsfield') RETURNING id Then just retrieve a row like your insert was a SELECT statement.

Related

Effective insert-only permissions for peewee tables

SQLAlchemy "default" vs "server_default" performance

Does Django provide any built-in way to update PostgreSQL autoincrement counters?

flask-sqlalchemy delete query failing with "Could not evaluate current criteria in Python"

Should I always use 'implicit_returning':False in SQLAlchemy?

Categories

Resources