SQLAlchemy "default" vs "server_default" performance - python

Is there a performance advantage (or disadvantage) when using default instead of server_default for mapping table column default values when using SQLAlchemy with PostgreSQL?
My understanding is that default renders the expression in the INSERT (usually) and that server_default places the expression in the CREATE TABLE statement. Seems like server_default is analogous to typical handling of defaults directly in the db such as:
CREATE TABLE example (
id serial PRIMARY KEY,
updated timestamptz DEFAULT now()
);
...but it is not clear to me if it is more efficient to handle defaults on INSERT or via table creation.
Would there be any performance improvement or degradation for row inserts if each of the default parameters in the example below were changed to server_default?
from uuid import uuid4
from sqlalchemy import Column, Boolean, DateTime, Integer
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql import func
Base = declarative_base()
class Item(Base):
__tablename__ = 'item'
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
count = Column(Integer, nullable=False, default=0)
flag = Column(Boolean, nullable=False, default=False)
updated = Column(DateTime(timezone=True), nullable=False, default=func.now())
NOTE: The best explanation I found so far for when to use default instead of server_default does not address performance (see Mike Bayer's SO answer on the subject). My oversimplified summary of that explanation is that default is preferred over server_default when...
The db can't handle the expression you need or want to use for the default value.
You can't or don't want to modify the schema directly.
...so the question remains as to whether performance should be considered when choosing between default and server_default?

It is impossible to give you a 'this is faster' answer, because performance per default value expression can vary widely, both on the server and in Python. A function to retrieve the current time behaves differently from a scalar default value.
Next, you must realise that defaults can be provided in five different ways:
Client-side scalar defaults. A fixed value, such a 0 or True. The value is used in an INSERT statement.
Client-side Python function. Called each time a default is needed, produces the value to insert, used the same way as a scalar default from there on out. These can be context sensitive (have access to the current execution context with values to be inserted).
Client-side SQL expression; this generates an extra piece of SQL expression that is then used in the query and executed on the server to produce a value.
Server-side DLL expression are SQL expressions that are then stored in the table definition, so are part of the schema. The server uses these to fill a value for any columns omitted from INSERT statements, or when a column value is set to DEFAULT in an INSERT or UPDATE statement.
Server-side implicit defaults or triggers, where other DLL such as triggers or specific database features provide a default value for columns.
Note that when it comes to a SQL expression determining the default value, be that a client-side SQL expression, a server-side DLL expression, or a trigger, it makes very little difference to a database where the default value expression is coming from. The query executor will need to know how to produce values for a given column, once that's parsed out of the DML statement or the schema definition, the server still has to execute the expression for each row.
Choosing between these options is rarely going to be based on performance alone, performance should at most be but one of multiple aspects you consider. There are many factors involved here:
default with a scalar or Python function directly produces a Python default value, then sends the new value to the server when inserting. Python code can access the default value before the data is inserted into the database.
A client-side SQL expression, a server_default value, and server-side implicit defaults and triggers all have the server generate the default, which then must be fetched by the client if you want to be able to access it in the same SQLAlchemy session. You can't access the value until the object has been inserted into the database.
Depending on the exact query and database support, SQLAlchemy may have to make extra SQL queries to either generate a default before the INSERT statement or run a separate SELECT afterwards to fetch the defaults that have been inserted. You can control when this happens (directly when inserting or on first access after flushing, with the eager_defaults mapper configuration).
If you have multiple clients on different platforms accessing the same database, a server_default or other default attached to the schema (such as a trigger) ensures that all clients will use the same defaults, regardless, while defaults implemented in Python can't be accessed by other platforms.
When using PostgreSQL, SQLAlchemy can make use of the RETURNING clause for DML statements, which gives a client access to server-side generated defaults in a single step.
So when using a server_default column default that calculates a new value for each row (not a scalar value), you save a small amount of Python-side time, and save a small amount of network bandwidth as you are not sending data for that column over to the database. The database could be faster creating that same value, or it could be slower; it largely depends on the type of operation. If you need to have access to the generated default value from Python, in the same transaction, you do then have to wait for a return stream of data, parsed out by SQLAlchemy. All these details can become insignificant compared to everything else that happens around inserting or updating rows, however.
Do understand that a ORM is not suitable to be used for high-performance bulk row inserts or updates; quoting from the SQAlchemy Performance FAQ entry:
The SQLAlchemy ORM uses the unit of work pattern when synchronizing changes to the database. This pattern goes far beyond simple “inserts” of data. It includes that attributes which are assigned on objects are received using an attribute instrumentation system which tracks changes on objects as they are made, includes that all rows inserted are tracked in an identity map which has the effect that for each row SQLAlchemy must retrieve its “last inserted id” if not already given, and also involves that rows to be inserted are scanned and sorted for dependencies as needed. Objects are also subject to a fair degree of bookkeeping in order to keep all of this running, which for a very large number of rows at once can create an inordinate amount of time spent with large data structures, hence it’s best to chunk these.
Basically, unit of work is a large degree of automation in order to automate the task of persisting a complex object graph into a relational database with no explicit persistence code, and this automation has a price.
ORMs are basically not intended for high-performance bulk inserts - this is the whole reason SQLAlchemy offers the Core in addition to the ORM as a first-class component.
Because an ORM like SQLAlchemy comes with a hefty overhead price, any performance differences between a server-side or Python-side default quickly disappears in the noise of ORM operations.
So if you are concerned about performance for large-quantity insert or update operations, you would want to use bulk operations for those, and enable the psycopg2 batch execution helpers to really get a speed boost. When using these bulk operations, I'd expect server-side defaults to improve performance just by saving bandwidth moving row data from Python to the server, but how much depends on the exact nature of the default values.
If ORM insert and update performance outside of bulk operations is a big issue for you, you need to test your specific options. I'd start with the SQLAlchemy examples.performance package and add your own test suite using two models that differ only in a single server_default and default configuration.

There's something else important rather than just comparing the performance of the two
If you needed to add a new Column create_at (Not Null) to an existing Table User with some data in it, default will not work.
If used default, during upgrading the database, the error will occur saying cannot insert Null value to existing data in the table. And this will cause significant troubles if you want to maintain your data, even just for testing.
And when used server_default, during upgrading the DB, database will insert the current DateTime value to all previous existing testing data.
So in this case, only server_default will work.

Related

Setting read_policy in AppEngine Python

In this document it is mentioned that the default read_policy setting is ndb.EVENTUAL_CONSISTENCY.
After I did a bulk delete of entity items from the Datastore versions of the app I pulled up continued to read the old data, so I've tried to figure out how to change this to STRONG_CONSISTENCY with no success, including:
entity.query().fetch(read_policy=ndb.STRONG_CONSISTENCY) and
...fetch(options=ndb.ContextOptions(read_policy=ndb.STRONG_CONSISTENCY))
The error I get is
BadArgumentError: read_policy argument invalid ('STRONG_CONSISTENCY')
How does one change this default? More to the point, how can I ensure that NDB will go to the Datastore to load a result rather than relying on an old cached value? (Note that after the bulk delete the datastore browser tells me the entity is gone.)
You cannot change that default, it is also the only option available. From the very doc you referenced (no other options are mentioned):
Description
Set this to ndb.EVENTUAL_CONSISTENCY if, instead of waiting for the
Datastore to finish applying changes to all returned results, you wish
to get possibly-not-current results faster.
The same is confirmed by inspecting the google.appengine.ext.ndb.context.py file (no STRONG_CONSISTENCY definition in it):
# Constant for read_policy.
EVENTUAL_CONSISTENCY = datastore_rpc.Configuration.EVENTUAL_CONSISTENCY
The EVENTUAL_CONSISTENCY ends up in ndb via the google.appengine.ext.ndb.__init__.py:
from context import *
__all__ += context.__all__
You might be able to avoid the error using a hack like this:
from google.appengine.datastore.datastore_rpc import Configuration
...fetch(options=ndb.ContextOptions(read_policy=Configuration.STRONG_CONSISTENCY))
However I think that only applies to reading the entities for the keys obtained through the query, but not to obtaining the list of keys themselves, which comes from the index the query uses, which is always eventually consistent - the root cause of your deleted entities still appearing in the result (for a while, until the index is updated). From Keys-only Global Query Followed by Lookup by Key:
But it should be noted that a keys-only global query can not exclude
the possibility of an index not yet being consistent at the time of
the query, which may result in an entity not being retrieved at all.
The result of the query could potentially be generated based on
filtering out old index values. In summary, a developer may use a
keys-only global query followed by lookup by key only when an
application requirement allows the index value not yet being
consistent at the time of a query.
Potentially of interest: Bulk delete datastore entity older than 2 days

Should I always use 'implicit_returning':False in SQLAlchemy?

What's the potential pitfall of always using 'implicit_returning': False in SQLAlchemy?
I've encountered problems a number of times when working on MSSQL tables that have triggers defined, and since the DB is in replication, all of the tables have triggers.
I'm not sure now what the problem exactly is. It has something to do with auto-increment fields - maybe because I'm prefetching the auto-incremented value so I can insert it in another table.
If I don't set 'implicit_returning': False for the table, when I try to insert values, I get this error:
The target table of the DML statement cannot have any enabled triggers
if the statement contains an OUTPUT clause without INTO clause.
So what if I put __table_args__ = {'implicit_returning': False} into all mapped classes just to be safe?
Particularly frustrating for me is that local DB I use for development & testing is not in replication and doesn't need that option, but the production DB is replicated so when I deploy changes they sometimes don't work. :)
As you probably already know, the cause of your predicament is described in SQLAlchemy Docs as the following:
SQLAlchemy by default uses OUTPUT INSERTED to get at newly generated
primary key values via IDENTITY columns or other server side
defaults. MS-SQL does not allow the usage of OUTPUT INSERTED on
tables that have triggers. To disable the usage of OUTPUT INSERTED
on a per-table basis, specify implicit_returning=False for each
Table which has triggers.
If you set your SQLAlchemy engine to echo the SQL, you will see that by default, it does this:
INSERT INTO [table] (id, ...) OUTPUT inserted.[id] VALUES (...)
But if you disable implicit_returning, it does this instead:
INSERT INTO [table] (id, ...) VALUES (...); select scope_identity()
So the question, "Is there any harm in disabling implicit_returning for all tables just in case?" is really, "Is there any disadvantage to using SCOPE_IDENTITY() instead of OUTPUT INSERTED?"
I'm no expert, but I get the impression that although OUTPUT INSERTED is the preferred method these days, SCOPE_IDENTITY() is usually fine too. In the past, SQL Server 2008 (and maybe earlier versions too?) had a bug where SCOPE_IDENTITY sometimes didn't return the correct value, but I hear that has now been fixed (see this question for more detail). (On the other hand, other techniques like ##IDENTITY and IDENT_CURRENT() are still dangerous since they can return the wrong value in corner cases. See this answer and the others on that same page for more detail.)
The big advantage that OUTPUT INSERTED still has is that it can work for cases where you are inserting multiple rows via a single INSERT statement. Is that something you are doing with SQLAlchemy? Probably not, right? So it doesn't matter.
Note that if you are going to have to disable implicit_returning for many tables, you could avoid a bit of boilerplate by making a mixin for it (and whichever other columns and properties you want all of the tables to inherit):
class AutoincTriggerMixin():
__table_args__ = {
'implicit_returning': False
}
id = Column(Integer, primary_key=True, autoincrement=True)
class SomeModel(AutoincTriggerMixin, Base):
some_column = Column(String(1000))
...
See this page in the SQLALchemy documentation for more detail. As an added bonus, it makes it more obvious which tables involve triggers.

How do I load data in many-to-many relation using pony orm?

Here are my entities:
class Article(db.Entity):
id = PrimaryKey(int, auto=True)
creation_time = Required(datetime)
last_modification_time = Optional(datetime, default=datetime.now)
title = Required(str)
contents = Required(str)
authors = Set('Author')
class Author(db.Entity):
id = PrimaryKey(int, auto=True)
first_name = Required(str)
last_name = Required(str)
articles = Set(Article)
And here is the code I'm using to get some data:
return left_join((article, author) for article in entities.Article
for author in article.authors).prefetch(entities.Author)[:]
Whether I'm using the prefetch method or not, the generated sql always looks the same:
SELECT DISTINCT "article"."id", "t-1"."author"
FROM "article" "article"
LEFT JOIN "article_author" "t-1"
ON "article"."id" = "t-1"."article"
And then when I iterated over the results, pony is issuing yet another query (queries):
SELECT "id", "creation_time", "last_modification_time", "title", "contents"
FROM "article"
WHERE "id" = %(p1)s
SELECT "id", "first_name", "last_name"
FROM "author"
WHERE "id" IN (%(p1)s, %(p2)s)
The desired behavior for me would be if the orm would issue just one query that would load all the data needed. So how do I achieve that?
Author of PonyORM is here. We don't want to load all this objects using just one query, because this is inefficient.
The only benefit of using a single query to load many-to-many relation is to reduce the number of round-trips to the database. But if we would replace three queries with one, this is not a major improvement. When your database server located near your application server these round-trips are actually very fast, comparing with the processing the resulted data in Python.
On the other side, when both sides of many-to-many relation are loaded using the same query, it is inevitable that the same object's data will be repeated over and over in multiple rows. This has many drawbacks:
The size of data transferred from the database became much larger as compared to situation when no duplicate information is transferred. In your example, if you have ten articles, and each is written by three authors, the single query will return thirty rows, with large fields like article.contents duplicated multiple times. Separate queries will transfer the minimum amount of data possible, the difference in size may easily be an order of magnitude depending on specific many-to-many relation.
The database server is usually written in compiled language like C and works very fast. The same is true for networking layer. But Python code is interpreted, and the time consumed by Python code is (contrary to some opinions) usually much more than the time which is spent in the database. You can see profiling tests that was performed by the SQLAlchemy author Mike Bayer after which he came to conclusion:
A great misconception I seem to encounter often is the notion that communication with the database takes up a majority of the time spent in a database-centric Python application. This perhaps is a common wisdom in compiled languages such as C or maybe even Java, but generally not in Python. Python is very slow, compared to such systems (...) Whether a database driver (DBAPI) is written in pure Python or in C will incur significant additional Python-level overhead. For just the DBAPI alone, this can be as much as an order of magnitude slower.
When all data of many-to-many relation are loaded using the same query and the same data is repeated in many rows, it is necessary to parse all of this repeated data in Python just to throw out most of them. As Python is the slowest part of process, such "optimization" may lead to decreased performance.
As a support to my words I can point to Django ORM. This ORM has two methods which can be used to query optimization. The first one, called select_related loads all related objects in a single query, while the more recently added method called prefetch_related loads objects in a way Pony does by default. According to Django users the second method works much faster:
In some scenarios, we have found up to a 30% speed improvement.
The database is required to perform joins which consume precious resources of the database server.
While Python code is the slowest part when processing a single request, the database server CPU time is a shared resource which is used by all parallel requests. You can scale Python code easily by starting multiple Python processes on different servers, but it is much harder to scale the database. Because of this, in high-load application it is better to offload useful work from the database server to application server, so this work can be done in parallel by multiple application servers.
When database performs join it needs to spend additional time for doing it. But for Pony it is irrelevant if database make join or not, because in any case an object will be interlinked inside ORM identity map. So the work that database doing when perform join is just useless spend of database time. On the other hand, using identity map pattern Pony can link objects equally fast regardless of whether they are provided in the same database row or not.
Returning to the number of round-trips, Pony have dedicated mechanism to eliminate "N+1 query" problem. The "N+1 query" anti-pattern arises when an ORM sends hundreds of very similar queries each of them loads separate object from the database. Many ORMs suffers from this problem. But Pony can detect it and replace repeated N queries with a single query which loads all necessary objects at once. This mechanism is very efficient and can greatly reduce the number of round-trips. But when we speak about loading many-to-many relation, there are no N queries here, there are just three queries which are more efficient when executed separately, so there are no benefit in trying to execute single query instead.
To summarize, I need to say that the ORM performance is a very important to us, Pony ORM developers. And because of that, we don't want to implement loading many-to-many relation in a single query, as it most certainly will be slower than our current solution.
So, to answer your question, you cannot load both side of many-to-many relation in a single query. And I think this is a good thing.
This should work
python
from pony.orm import select
select((article, author) for article in Article if Article.authors == Authors.id)

Column default value persisted to the table

I am currently using a Column that has the following signature:
Column('my_column', DateTime, default=datetime.datetime.utcnow)
I am trying to figure out how to change that in order to be able to do vanilla sql inserts (INSERT INTO ...) rather than through sqlalchemy. Basically I want to know how to persist the default on the table without lossing this functionality of setting the column to the current utc time.
The database I am using is PostgreSQL.
There are multiple ways to have SQLAlchemy define how a value should be set on insert/update. You can view them in the documentation.
The way you're doing it right now (defining a default argument for the column) will only affect when SQLAlchemy is generating insert statements. In this case, it will call the callable function (in your case, the datetime.datetime.utcnow function) and use that value.
However, if you're going to be running straight SQL code, this function will never be run, since you'll be bypassing SQLAlchemy altogether.
What you probably want to do is use the Server Side Default ability of SQLAlchemy.
Try out this code instead:
from sqlalchemy.sql import func
...
Column('my_column', DateTime, server_default=func.current_timestamp())
SQLAlchemy should now generate a table that will cause the database to automatically insert the current date into the my_column column. The reason this works is that now the default is being used at the database side, so you shouldn't need SQLAlchemy anymore for the default.
Note: I think this will actually insert local time (as opposed to UTC). Hopefully, though, you can figure out the rest from here.
If you wanted to insert UTC time you would have to do something like this:
from sqlalchemy import text
...
created = Column(DateTime,
server_default=text("(now() at time zone 'utc')"),
nullable=False)

getting the id of a created record in SQLAlchemy

How can I get the id of the created record in SQLAlchemy?
I'm doing:
engine.execute("insert into users values (1,'john')")
When you execute a plain text statement, you're at the mercy of the DBAPI you're using as to whether or not the new PK value is available and via what means. With SQlite and MySQL DBAPIs you'll have it as result.lastrowid, which just gives you the value of .lastrowid for the cursor. With PG, Oracle, etc., there's no ".lastrowid" - as someone else said you can use "RETURNING" for those in which case results are available via result.fetchone() (although using RETURNING with oracle, again not taking advantage of SQLAlchemy expression constructs, requires several awkward steps), or if RETURNING isn't available you can use direct sequence access (NEXTVAL in pg), or a "post fetch" operation (CURRVAL in PG, ##identity or scope_identity() in MSSQL).
Sounds complicated right ? That's why you're better off using table.insert(). SQLAlchemy's primary system of providing newly generated PKs is designed to work with these constructs. One you're there, the result.last_inserted_ids() method gives you the newly generated (possibly composite) PK in all cases, regardless of backend. The above methods of .lastrowid, sequence execution, RETURNING etc. are all dealt with for you (0.6 uses RETURNING when available).
There's an extra clause you can add: RETURNING
ie
INSERT INTO users (name, address) VALUES ('richo', 'beaconsfield') RETURNING id
Then just retrieve a row like your insert was a SELECT statement.

Categories