SQLAlchemy ORM select multiple entities from subquery - python

I need to query multiple entities, something like session.query(Entity1, Entity2), only from a subquery rather than directly from the tables. The docs have something about selecting one entity from a subquery but I can't find how to select more than one, either in the docs or by experimentation.
My use case is that I need to filter the tables underlying the mapped classes by a window function, which in PostgreSQL can only be done in a subquery or CTE.
EDIT: The subquery spans a JOIN of both tables so I can't just do aliased(Entity1, subquery).

from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class A(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
bs = relationship("B")
class B(Base):
__tablename__ = "b"
id = Column(Integer, primary_key=True)
a_id = Column(Integer, ForeignKey('a.id'))
e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
s.add_all([A(bs=[B(), B()]), A(bs=[B()])])
s.commit()
# with_labels() here is to disambiguate A.id and B.id.
# without it, you'd see a warning
# "Column 'id' on table being replaced by another column with the same key."
subq = s.query(A, B).join(A.bs).with_labels().subquery()
# method 1 - select_from()
print s.query(A, B).select_from(subq).all()
# method 2 - alias them both. "subq" renders
# once because FROM objects render based on object
# identity.
a_alias = aliased(A, subq)
b_alias = aliased(B, subq)
print s.query(a_alias, b_alias).all()

I was trying to do something like the original question: join a filtered table with another filtered table using an outer join. I was struggling because it's not at all obvious how to:
create a SQLAlchemy query that returns entities from both tables. #zzzeek's answer showed me how to do that: get_session().query(A, B).
use a query as a table in such a query. #zzzeek's answer showed me how to do that too: filtered_a = aliased(A).filter(...).subquery().
use an OUTER join between the two entities. Using select_from() after outerjoin() destroys the join condition between the tables, resulting in a cross join. From #zzzeek answer I guessed that if a is aliased(), then you can include a in the query() and also .outerjoin(a), and it won't be joined a second time, and that appears to work.
Following either of #zzzeek's suggested approaches directly resulted in a cross join (combinatorial explosion), because one of my models uses inheritance, and SQLAlchemy added the parent tables outside the inner SELECT without any conditions! I think this is a bug in SQLAlchemy. The approach that I adopted in the end was:
filtered_a = aliased(A, A.query().filter(...)).subquery("filtered_a")
filtered_b = aliased(B, B.query().filter(...)).subquery("filtered_b")
query = get_session().query(filtered_a, filtered_b)
query = query.outerjoin(filtered_b, filtered_a.relation_to_b)
query = query.order_by(filtered_a.some_column)
for a, b in query:
...

Related

How to query with like() when using many-to-many relationships in SQLAlchemy?

I have the folloing many-to-many relationship defined in SQLAlchemy:
training_ids_association_table = db.Table(
"training_ids_association",
db.Model.metadata,
Column("training_id", Integer, ForeignKey("training_sessions.id")),
Column("ids_id", Integer, ForeignKey("image_data_sets.id")),
)
class ImageDataSet(db.Model):
__tablename__ = "image_data_sets"
id = Column(Integer, primary_key=True)
tags = Column(String)
trainings = relationship("TrainingSession", secondary=training_ids_association_table, back_populates="image_data_sets")
class TrainingSession(db.Model):
__tablename__ = "training_sessions"
id = Column(Integer, primary_key=True)
image_data_sets = relationship("ImageDataSet", secondary=training_ids_association_table, back_populates="trainings")
Note the field ImageDataSet.tags, which can contain a list of string items (i.e. tags), separated by a slash character. If possible I would rather stick to that format instead of creating a new table just for these tags.
What I want now is to query table TrainingSession for all entries that have a certain tag set ub their related ImageDataSet's. Now, if an ImageDataSet has only one tag saved in the tags field, then the following works:
TrainingSession.query.filter(TrainingSession.image_data_sets.any(tags=find_tag))
However, as soon as there are multiple tags in the tags field (e.g. something like "tag1/tag2/tag3"), then of course this filter above does not work any more. So I tried it with a like:
.filter(TrainingSession.image_data_sets.like(f'%{find_tag}%'))
But this leads to an NotImplementedError in SQLAlchemy. So is there a way to achieve what I am trying to do here, or do I necessarily need another table for the tags per ImageDataSet?
You can apply any filters on related model columns if you join this model first:
query = session.query(TrainingSession). \
join(TrainingSession.image_data_sets). \
filter(ImageDataSet.tags.like(f"%{find_tag}%"))
This query is translated to the following SQL statement:
SELECT training_sessions.id FROM training_sessions
JOIN training_ids_association ON training_sessions.id = training_ids_association.training_id
JOIN image_data_sets ON image_data_sets.id = training_ids_association.ids_id
WHERE image_data_sets.tags LIKE %(find_tag)s
Note that you may stumble to a problem with storing tags as strings with separators. If some records have tags tag1, tag12, tag123 they will all pass the filter LIKE '%tag1%'.
It would be better to switch to ARRAY column if your database supports this column type (PostgreSQL for example). Your column may be defined like this:
tags = Column(ARRAY(String))
And the query may look like this:
query = session.query(TrainingSession). \
join(TrainingSession.image_data_sets). \
filter(ImageDataSet.tags.any(find_tag))

Bulk Upsert with SQLAlchemy Postgres

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable:
class MyTable(base):
__tablename__ = 'mytable'
id = Column(types.Integer, primary_key=True)
test_value = Column(types.Text)
Creating a generic insert statement is simple enough:
from sqlalchemy.dialects import postgresql
values = [{'id': 0, 'test_value': 'a'}, {'id': 1, 'test_value': 'b'}]
insert_stmt = postgresql.insert(MyTable.__table__).values(values)
The problem I run into is when I try to add the "on conflict" part of the upsert.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
Trying to execute this statement yields a ProgrammingError:
from sqlalchemy import create_engine
engine = create_engine('postgres://localhost/db_name')
engine.execute(update_stmt)
>>> ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I think my misunderstanding is in constructing the statement with the on_conflict_do_update method. Does anyone know how to construct this statement? I have looked at other questions on StackOverflow (eg. here) but I can't seem to a way to address the above error.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)
set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)
So, the whole thing would be
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_={'test_value': insert_stmt.excluded.test_value}
)
Of course, in a real world example I usually want to change more then one column. In that case I would do something like...
update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)
(This example would overwrite every column except for the id and datetime_created columns)

Adding a join to an SQL Alchemy expression that already has a select_from()

Note: this is a question about SQL Alchemy's expression language not the ORM
SQL Alchemy is fine for adding WHERE or HAVING clauses to an existing query:
q = select([bmt_gene.c.id]).select_from(bmt_gene)
q = q.where(bmt_gene.c.ensembl_id == "ENSG00000000457")
print q
SELECT bmt_gene.id
FROM bmt_gene
WHERE bmt_gene.ensembl_id = %s
However if you try to add a JOIN in the same way you'll get an exception:
q = select([bmt_gene.c.id]).select_from(bmt_gene)
q = q.join(bmt_gene_name)
sqlalchemy.exc.NoForeignKeysError: Can't find any foreign key relationships between 'Select object' and 'bmt_gene_name'
If you specify the columns it creates a subquery (which is incomplete SQL anyway):
q = select([bmt_gene.c.id]).select_from(bmt_gene)
q = q.join(bmt_gene_name, q.c.id == bmt_gene_name.c.gene_id)
(SELECT bmt_gene.id AS id FROM bmt_gene)
JOIN bmt_gene_name ON id = bmt_gene_name.gene_id
But what I actually want is this:
SELECT
bmt_gene.id AS id
FROM
bmt_gene
JOIN bmt_gene_name ON id = bmt_gene_name.gene_id
edit: Adding the JOIN has to be after the creation of the initial query expression q. The idea is that I make a basic query skeleton then I iterate over all the joins requested by the user and add them to the query.
Can this be done in SQL Alchemy?
The first error (NoForeignKeysError) means that your table lacks foreign key definition. Fix this if you don't want to write join clauses by hand:
from sqlalchemy.types import Integer
from sqlalchemy.schema import MetaData, Table, Column, ForeignKey
meta = MetaData()
bmt_gene_name = Table(
'bmt_gene_name', meta,
Column('id', Integer, primary_key=True),
Column('gene_id', Integer, ForeignKey('bmt_gene.id')),
# ...
)
The joins in SQLAlchemy expression language work a little bit different from what you expect. You need to create Join object where you join all the tables and only then provide it to Select object:
q = select([bmt_gene.c.id])
q = q.where(bmt_gene.c.ensembl_id == 'ENSG00000000457')
j = bmt_gene # Initial table to join.
table_list = [bmt_gene_name, some_other_table, ...]
for table in table_list:
j = j.join(table)
q = q.select_from(j)
The reason why you see the subquery in your join is that Select object is treated like a table (which essentially it is) which you asked to join to another table.
You can access the current select_from of a query with the froms attribute, and then join it with another table and update the select_from.
As explained in the documentation, calling select_from usually adds another selectable to the FROM list, however:
Passing a Join that refers to an already present Table or other selectable will have the effect of concealing the presence of that selectable as an individual element in the rendered FROM list, instead rendering it into a JOIN clause.
So you can add a join like this, for example:
q = select([bmt_gene.c.id]).select_from(bmt_gene)
q = q.select_from(
join(q.froms[0], bmt_gene_name,
bmt_gene.c.id == bmt_gene_name.c.gene_id)
)

How can I prevent sqlalchemy from prefixing the column names of a CTE?

Consider the following query codified via SQLAlchemy.
# Create a CTE that performs a join and gets some values
x_cte = session.query(SomeTable.col1
,OtherTable.col5
) \
.select_from(SomeTable) \
.join(OtherTable, SomeTable.col2 == OtherTable.col3)
.filter(OtherTable.col6 == 34)
.cte(name='x')
# Create a subquery that splits the CTE based on the value of col1
# and computes the quartile for positive col1 and assigns a dummy
# "quartile" for negative and zero col1
subquery = session.query(x_cte
,literal('-1', sqlalchemy.INTEGER).label('quartile')
) \
.filter(x_cte.col1 <= 0)
.union_all(session.query(x_cte
,sqlalchemy.func.ntile(4).over(order_by=x_cte.col1).label('quartile')
)
.filter(x_cte.col1 > 0)
) \
.subquery()
# Compute some aggregate values for each quartile
result = session.query(sqlalchemy.func.avg(subquery.columns.x_col1)
,sqlalchemy.func.avg(subquery.columns.x_col5)
,subquery.columns.x_quartile
) \
.group_by(subquery.columns.x_quartile) \
.all()
Sorry for the length, but this is similar to my real query. In my real code, I've given a more descriptive name to my CTE, and my CTE has far more columns for which I must compute the average. (It's also actually a weighted average weighted by a column in the CTE.)
The real "problem" is purely one of trying to keep my code more clear and shorter. (Yes, I know. This query is already a monster and hard to read, but the client insists on this data being available.) Notice that in the final query, I must refer to my columns as subquery.columns.x_[column name]; this is because SQLAlchemy is prefixing my column name with the CTE name. I would just like for SQLAlchemy to leave off my CTE's name when generating column names, but since I have many columns, I would prefer not to list them individually in my subquery. Leaving off the CTE name would make my column names (which are long enough on their own) shorter and slightly more readable; I can guarantee that the columns are unique. How can I do this?
Using Python 2.7.3 with SQLAlchemy 0.7.10.
you're not being too specific what "x_" is here, but if that's the final result, use label() to give the result columns whatever name you want:
row = session.query(func.avg(foo).label('foo_avg'), func.avg(bar).label('bar_avg')).first()
foo_avg = row['foo_avg'] # indexed access
bar_avg = row.bar_avg # attribute access
Edit: I'm not able to reproduce the "x_" here. Here's a test:
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class A(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
x = Column(Integer)
y = Column(Integer)
s = Session()
subq = s.query(A).cte(name='x')
subq2 = s.query(subq, (subq.c.x + subq.c.y)).filter(A.x == subq.c.x).subquery()
print s.query(A).join(subq2, A.id == subq2.c.id).\
filter(subq2.c.x == A.x, subq2.c.y == A.y)
above, you can see I can refer to subq2.c.<colname> without issue, there is no "x" prepended. If you can please specify SQLAlchemy version information and fill out your example fully, I can run it as is in order to reproduce your issue.

How to count rows with SELECT COUNT(*) with SQLAlchemy?

I'd like to know if it's possible to generate a SELECT COUNT(*) FROM TABLE statement in SQLAlchemy without explicitly asking for it with execute().
If I use:
session.query(table).count()
then it generates something like:
SELECT count(*) AS count_1 FROM
(SELECT table.col1 as col1, table.col2 as col2, ... from table)
which is significantly slower in MySQL with InnoDB. I am looking for a solution that doesn't require the table to have a known primary key, as suggested in Get the number of rows in table using SQLAlchemy.
Query for just a single known column:
session.query(MyTable.col1).count()
I managed to render the following SELECT with SQLAlchemy on both layers.
SELECT count(*) AS count_1
FROM "table"
Usage from the SQL Expression layer
from sqlalchemy import select, func, Integer, Table, Column, MetaData
metadata = MetaData()
table = Table("table", metadata,
Column('primary_key', Integer),
Column('other_column', Integer) # just to illustrate
)
print select([func.count()]).select_from(table)
Usage from the ORM layer
You just subclass Query (you have probably anyway) and provide a specialized count() method, like this one.
from sqlalchemy.sql.expression import func
class BaseQuery(Query):
def count_star(self):
count_query = (self.statement.with_only_columns([func.count()])
.order_by(None))
return self.session.execute(count_query).scalar()
Please note that order_by(None) resets the ordering of the query, which is irrelevant to the counting.
Using this method you can have a count(*) on any ORM Query, that will honor all the filter andjoin conditions already specified.
I needed to do a count of a very complex query with many joins. I was using the joins as filters, so I only wanted to know the count of the actual objects. count() was insufficient, but I found the answer in the docs here:
http://docs.sqlalchemy.org/en/latest/orm/tutorial.html
The code would look something like this (to count user objects):
from sqlalchemy import func
session.query(func.count(User.id)).scalar()
Addition to the Usage from the ORM layer in the accepted answer: count(*) can be done for ORM using the query.with_entities(func.count()), like this:
session.query(MyModel).with_entities(func.count()).scalar()
It can also be used in more complex cases, when we have joins and filters - the important thing here is to place with_entities after joins, otherwise SQLAlchemy could raise the Don't know how to join error.
For example:
we have User model (id, name) and Song model (id, title, genre)
we have user-song data - the UserSong model (user_id, song_id, is_liked) where user_id + song_id is a primary key)
We want to get a number of user's liked rock songs:
SELECT count(*)
FROM user_song
JOIN song ON user_song.song_id = song.id
WHERE user_song.user_id = %(user_id)
AND user_song.is_liked IS 1
AND song.genre = 'rock'
This query can be generated in a following way:
user_id = 1
query = session.query(UserSong)
query = query.join(Song, Song.id == UserSong.song_id)
query = query.filter(
and_(
UserSong.user_id == user_id,
UserSong.is_liked.is_(True),
Song.genre == 'rock'
)
)
# Note: important to place `with_entities` after the join
query = query.with_entities(func.count())
liked_count = query.scalar()
Complete example is here.
If you are using the SQL Expression Style approach there is another way to construct the count statement if you already have your table object.
Preparations to get the table object. There are also different ways.
import sqlalchemy
database_engine = sqlalchemy.create_engine("connection string")
# Populate existing database via reflection into sqlalchemy objects
database_metadata = sqlalchemy.MetaData()
database_metadata.reflect(bind=database_engine)
table_object = database_metadata.tables.get("table_name") # This is just for illustration how to get the table_object
Issuing the count query on the table_object
query = table_object.count()
# This will produce something like, where id is a primary key column in "table_name" automatically selected by sqlalchemy
# 'SELECT count(table_name.id) AS tbl_row_count FROM table_name'
count_result = database_engine.scalar(query)
I'm not clear on what you mean by "without explicitly asking for it with execute()" So this might be exactly what you are not asking for.
OTOH, this might help others.
You can just run the textual SQL:
your_query="""
SELECT count(*) from table
"""
the_count = session.execute(text(your_query)).scalar()
def test_query(val: str):
query = f"select count(*) from table where col1='{val}'"
rtn = database_engine.query(query)
cnt = rtn.one().count
but you can find the way if you checked debug watch
query = session.query(table.column).filter().with_entities(func.count(table.column.distinct()))
count = query.scalar()
this worked for me.
Gives the query:
SELECT count(DISTINCT table.column) AS count_1
FROM table where ...
Below is the way to find the count of any query.
aliased_query = alias(query)
db.session.query(func.count('*')).select_from(aliased_query).scalar()
Here is the link to the reference document if you want to explore more options or read details.

Categories