I have a SQLAlchemy model Foo which contains a lazy-loaded relationship bar which points to another model that also has a lazy-loaded relationship foobar.
When querying normally I would use this code to ensure that all objects are loaded with a single query:
session.query(Foo).options(joinedload('bar').joinedload('foobar'))
However, now I have a case where a base class already provides me a Foo instance that was retrieved using session.query(Foo).one(), so the relationships are lazy-loaded (which is the default, and I don't want to change that).
For a single level of nesting I wouldn't mind it being loaded once I access foo.bar, but since I also need to access foo.bar[x].foobar I really prefer to avoid sending queries in a loop (which would happen whenever I access foobar).
I'm looking for a way to make SQLAlchemy load the foo.bar relationship while also using the joinedload strategy for foobar.
I ran into a similar situation recently, and ended up doing the following:
eager_loaded = db.session.query(Bar).options(joinedload('foobar'))
.filter_by(bar_fk=foo.foo_pk).all()
Assuming you can recreate the bar join condition in the filter_by arguments, all the objects in the collection will be loaded into the identity map, and foo.bar[x].foobar will not need to go to the database.
One caveat: It looks like the identity map may dispose of the loaded entities if they are no longer strongly referenced - thus the assignment to eager_loaded.
The SQLAlchemy wiki contains the Disjoint Eager Loading recipe. A query is issued for the parent collection, then the children are queried and combined. For the most part, this was implemented in SQLAlchemy as the subquery strategy, but the recipe covers the case where you explicitly need to make the query later, not just separately.
The idea is that you order the child query and group the results by the remote columns linking the relationship, then populate the attribute for each parent item with the group of children. The following is slightly modified from the recipe to allow passing in a custom child query with extra options, rather than building it from the parent query. This does mean that you have to construct the child query more carefully: if your parent query has filters, then the child should join and filter as well, to prevent loading unneeded rows.
from itertools import groupby
from sqlalchemy.orm import attributes
def disjoint_load(parents, rel, q):
local_cols, remote_cols = zip(*rel.prop.local_remote_pairs)
q = q.join(rel).order_by(*remote_cols)
if attr.prop.order_by:
q = q.order_by(*rel.prop.order_by)
collections = dict((k, list(v)) for k, v in groupby(q, lambda x: tuple([getattr(x, c.key) for c in remote_cols])))
for p in parents:
attributes.set_committed_value(
p, attr.key,
collections.get(tuple([getattr(p, c.key) for c in local_cols]), ()))
return parents
# load the parents
devices = session.query(Device).filter(Device.active).all()
# build the child query with extras, use the same filter
findings = session.query(Finding
).join(Device.findings
).filter(Device.active
).options(db.joinedload(Finding.scans))
for d in disjoint_load(devices, Device.findings, findings):
print(d.cn, len(d.findings))
Related
I have an SQLAlchemy mapped class MyClass, and two aliases for it. I can eager-load a relationship MyClass.relationship on each alias separately using selectinload() like so:
alias_1, alias_2 = aliased(MyClass), aliased(MyClass)
q = session.query(alias_1, alias_2).options(
selectinload(alias_1.relationship),
selectinload(alias_2.relationship))
However, this results in 2 separate SQL queries on MyClass.relationship (in addition to the main query on MyClass, but this is irrelevant to the question). Since these 2 queries on MyClass.relationship are to the same table, I think that it should be possible to merge the primary keys generated within the IN clause in these queries, and just run 1 query on MyClass.relationship.
My best guess for how to do this is:
alias_1, alias_2 = aliased(MyClass), aliased(MyClass)
q = session.query(alias_1, alias_2).options(
selectinload(MyClass.relationship))
But it clearly didn't work:
sqlalchemy.exc.ArgumentError: Mapped attribute "MyClass.relationship" does not apply to any of the root entities in this query, e.g. aliased(MyClass), aliased(MyClass). Please specify the full path from one of the root entities to the target attribute.
Is there a way to do this in SQLAlchemy?
So, this is exactly the same issue we had. This docs explains how to do it.
You need to add selectin_polymorphic. For anyone else if you are using with_polymorphic in your select then remove it.
from sqlalchemy.orm import selectin_polymorphic
query = session.query(MyClass).options(
selectin_polymorphic(MyClass, [alias_1, alias_2]),
selectinload(MyClass.relationship)
)
Question
If I have an existing query for an entity, how can I restrict to it to return only results of a polymorphic subclass of this entity?
Details
Using the Employee / Engineer / Manager setup from the Mapping Class Inheritance Hierarchies section of the sqlalchemy documentation:
Imagine I have a complex query that I got from somewhere and which original definition I don't want to change:
def get_complex_employee_query():
""" Super complex query """
query = session.query(Employee).filter(Employee.name.like('John %'))
[... imagine a bunch of other `.filter()`, `.join()` and/or `.options()` here ...]
return query
query = get_complex_query()
I know I can filter the query to list only engineers by doing
query = query.filter(Employee.type='engineer')
But suppose that:
there are polymorphic subclasses of Engineer,
or that I don't know/care that the Employee is polymorphic_on the column type,
or that I don't know/care about the correct value for the column type to filter only to engineers
Is there a way to apply a .filter(), .options() or some other method of query that will restrict the query to engineers (and subclasses) without knowledge of the specific fields that configure the polymorphic inheritance?
I'm ok with a .join() as well, as long as I don't have to know/care about the primary/foreign key relationships between the classes/tables that are part of the polymorphic hierarchy.
In short
I'd like something like Model.relationship.of_type() method, but for queries instead of relationships.
Is there such a thing?
Short answer
from sqlalchemy import inspect
[...]
eng_mapper = inspect(Engineer)
query.filter(
eng_mapper.polymorphic_on.in_(
m.polymorphic_identity
for m in eng_mapper.polymorphic_iterator()
),
)
I'd prefer a slightly less verbose incantation, but this works and doesn't require knowledge of the specific configuration of the polymorphic hierarchy.
Details
When inspect() is called on an ORM mapped class, it returns the Mapper for that class. This is identical to the Model.__mapper__ class attribute.
The Mapper contains all information needed to introspect the polymorphic hierarchy. In particular:
.polymorphic_on is the field (column) in the model at the top of the hierarchy that contains the polymorphic identity value for a record (e.g. for Engineer that would be the Employee.type field).
.polymorphic_identity is the value that each instance of the mapped model will have in the .polymorphic_on field (e.g. for Engineer that would be "engineer").
.polymorphic_iterator() iterates over a collection of model Mappers that includes Model.__mapper__ and the .__mapper__ of all subclasses of Model recursively (e.g. for Engineer that would be an iterator containing only the Engineer.__mapper__).
To make it more readable, one could easily turn the above filter expression into a function:
from sqlalchemy import inspect
def filter_instances_of(cls):
mapper = inspect(cls)
return mapper.polymorphic_on.in_(
m.polymorphic_identity for m in mapper.polymorphic_iterator()
)
And use it like:
query = query.filter(
filter_instances_of(Engineer),
[... other filter criteria ...]
)
I'm familiar with the joinedload and subqueryload options in Sqlalchemy, and I'm using them to query a large result set that's later expunged from the session and cached.
Is there a way to verify that every possible relationship from the top-level model on down is eager-loaded at this point?
The supported way to ensure that you've eager loaded all the relationships you need is to put lazy="raise" on all of your relationship. It won't tell you that you did something wrong until you do it, but EAFP.
children = relationship(Child, lazy="raise")
The following iterator will iterate over all relationships reachable from a given model. It yields a tuple of (model_class, relationship_name). You can modify to look at prop.lazy or similar, or use this to construct loader options to lazy load the right things, or whatever seems appropriate.
from sqlalchemy import inspect
def recursive_relations(model, already_traversed = None):
if not already_traversed: already_traversed = set()
inspection = inspect(model)
already_traversed.add(model)
for name, prop in inspection.relationships.items():
yield (model, name)
if prop.mapper.class_ not in already_traversed:
already_traversed.add(prop.mapper.class_)
yield from recursive_relations(prop.mapper.class_, already_traversed)
Is there a more elegant/faster way to do this?
Here are my models:
class X(Model):
(...)
class A(Model):
xs = ManyToManyField(X)
class B(A):
(...)
class C(A): # NOTE: not relevant here
(...)
class Y(Model):
b = ManyToOneField(B)
Here is what I want to do:
def foo(x):
# NOTE: b's pk (b.a_ptr) is actually a's pk (a.id)
return Y.objects.filter(b__in=x.a_set.all())
But it returns the following error:
<repr(<django.db.models.query.QuerySet at 0x7f6f2c159a90>) failed:
django.core.exceptions.FieldError: Cannot resolve keyword u'a_ptr'
into field. Choices are: ...enumerate all fields of A...>
And here is what I'm doing right now in order to minimize the queries:
def foo(x):
a_set = x.a_set
a_set.model = B # NOTE: force the model to be B
return Y.filter(b__in=a_set.all())
It works but it's not one line. It would be cool to have something like this:
def foo(x):
return Y.filter(b__in=x.a_set._as(B).all())
The simpler way in Django seems to be the following:
def foo(x):
return Y.filter(b__in=B.objects.filter(pk__in=x.a_set.all()))
...but this makes useless sub-queries in SQL:
SELECT Y.* FROM Y WHERE Y.b IN (
SELECT B.a_ptr FROM B WHERE B.a_ptr IN (
SELECT A.id FROM A WHERE ...));
This is what I want:
SELECT Y.* FROM Y WHERE Y.b IN (SELECT A.id FROM A WHERE ...);
(This SQL example is a bit simplified because the relation between A and X is actually a ManyToMany table, I substituted this with SELECT A.id FROM A WHERE ... for clarity sake.)
You might get a better result if you follow the relationship one more step, rather than using in.
Y.objects.filter(b__xs=x)
I know next to nothing about databases, but there are many warnings in the Django literature about concrete inheritance class B(A) causing database inefficiency because of repeated INNER JOIN operations.
The underlying problem is that relational databases aren't object-orientated.
If you use Postgres, then Django has native support (from 1.8) for Hstore fields, which are searchable but schema-less collections of keyword-value pairs. I intend to be using one of these in my base class (your A) to eliminate any need for derived classes (your B and C). Instead, I'll maintain a type field in A (equal to "B" or "C" in your example) and use that and/or careful key presence checking to access subfields of the Hstore field which are "guaranteed" to exist if (say) instance.type=="C" (which constraint isn't enforced at the DB integrity level).
I also discovered the django-hstore module which existed prior to Django 1.8, and which has even greater functionality. I've yet to do anything with it so cannot comment further.
I also found reference to wide tables in the literature, where you simply define a model A with a potentially large number of fields which contain useful data only if the object is of type B, and more yet fields for types C,D,... I don't have the DB knowledge to assess the cost of lots of blank or null fields attached to every row of a table. Clearly if you have a million sheep records and one pedigree-racehorse record and one stick-insect record, as "subclasses" of Animal in the same wide table, it gets silly.
I'd appreciate feedback on either idea if you have the DB understanding that I lack.
Oh, and for completeness, there's the django-polymorphic module, but that builds on concrete inheritance with its attendant inefficiencies.
I have a Query object which was initially configured to lazyload() all relations on a model:
query = session.query(Article).options(lazyload('author'))
Is it possible to revert the relationship loading back to default? E.g. the relationship was configured with lazy='joined', and I want the query to have joinedload() behavior without using joinedload() explicitly.
I was expecting defaultload() to have this behavior, but in fact it does not: it references the query default instead of the relationship default. So I'm searching for kinda resetload() solution.
The reason for doing this is because I'm creating a JSON-based query syntax, and no relations should be loaded unless the user explicitly names them.
Currently, I'm using lazyload() on all relations that were not explicitly requested, but want to go the other way around: lazyload() all relations first, and then override it for some of them.
This would have made the code more straigntforward.
Just to be clear:
By default, all inter-object relationships are lazy loading.
http://docs.sqlalchemy.org/en/latest/orm/loading.html
So we are talking about a case in which a relation has been specifically marked as eager loading, then the queries are configured as lazy loading, then you want to "override the override" as it were.
Chaining calls to options will override earlier calls. I did test this a bit.
q = s.query(User) # lazy loads 'addresses'
q = s.query(User).options(contains_eager('addresses')) # eager loads
q = s.query(User).options(contains_eager('addresses'))\
.options(lazyload('addresses')) # lazy loads
q = s.query(User).options(contains_eager('addresses'))\
.options(lazyload('addresses'))\
.options(contains_eager('addresses')) # eager loads
However, it sounds like you're talking about just reverting the lazyload option, whereas the above case involves an explicit change to eager loading.
The defaultload docstring says its use case is to be chained to other loader options, so I don't think it's related.
Based on a glance through the source, I don't think this behavior is supported. When you update the loading strategy option, it updates a dictionary with the new loading strategy and I don't think there's still a reference to the old strategy, at least as far as I can tell.
You could keep a reference to the query object before .options(lazyload(...)), or just have an option to generate the query with or without the lazyload on everything.
To force everything to lazyload, ignoring what was specified on the relationship, you can use the '*' target. From the docs:
affecting all relationships not otherwise specified in the query. This
feature is available by passing the string '*' as the argument to any
of these options:
session.query(Article).options(lazyload('*'))
Then you can specify whatever load types you want per relationship or relationship chain.
# not sure how you are mapping json data to relationships
# once you know the relationships, you can build a list of them to load
my_loads = [joinedload(rel) for rel in json_rel_data]
query = session.query(Article).options(lazyload('*'), *my_loads)
# query lazy loads **everything** except the explicitly set joined loads
If you are joining on the relationships for query purposes, you can use contains_eager instead of joinedload in the options to use the already joined relationship.
my_eagers = [contains_eager(rel) for rel in json_rel_joins]
my_loads = [joinedload(rel) for rel in json_rel_loads]
query = session.query(Article
).join(*json_rel_joins
).options(lazyload('*'), *my_eagers, *my_loads)