In sqlalchemy (postgresql DB), I would like to create a bounded sum function, for lack of a better term. The goal is to create a running total within a defined range.
Currently, I have something that works great for calculating a running total without the bounds. Something like this:
from sqlalchemy.sql import func
foos = (
db.query(
Foo.id,
Foo.points,
Foo.timestamp,
func.sum(Foo.points).over(order_by=Foo.timestamp).label('running_total')
)
.filter(...)
.all()
)
However, I would like to be able to bound this running total to always be within a specific range, let's say [-100, 100]. So we would get something like this (see running_total):
{'timestamp': 1, 'points': 75, 'running_total': 75}
{'timestamp': 2, 'points': 50, 'running_total': 100}
{'timestamp': 3, 'points': -100, 'running_total': 0}
{'timestamp': 4, 'points': -50, 'running_total': -50}
{'timestamp': 5, 'points': -75, 'running_total': -100}
Any ideas?
Unfortunately, no built-in aggregate can help you achieve your expected output with window function calls.
You could get the expected output with manually calculating the rows one-by-one with a recursive CTE:
with recursive t as (
(select *, points running_total
from foo
order by timestamp
limit 1)
union all
(select foo.*, least(greatest(t.running_total + foo.points, -100), 100)
from foo, t
where foo.timestamp > t.timestamp
order by foo.timestamp
limit 1)
)
select timestamp,
points,
running_total
from t;
Unfortunately, this will be very hard to achieve with SQLAlchemy.
Your other option is, to write a custom aggregate for your specific needs, like:
create function bounded_add(int_state anyelement, next_value anyelement, next_min anyelement, next_max anyelement)
returns anyelement
immutable
language sql
as $func$
select least(greatest(int_state + next_value, next_min), next_max);
$func$;
create aggregate bounded_sum(next_value anyelement, next_min anyelement, next_max anyelement)
(
sfunc = bounded_add,
stype = anyelement,
initcond = '0'
);
With this, you just need to replace your call to sum to be a call to bounded_sum:
select timestamp,
points,
bounded_sum(points, -100.0, 100.0) over (order by timestamp) running_total
from foo;
This latter solution will probably scale better too.
http://rextester.com/LKCUK93113
note my initial answer is wrong, see edit below:
In raw sql, you'd do this using greatest & least functions.
Something like this:
LEAST(GREATEST(SUM(myfield) OVER (window_clause), lower_bound), upper_bound)
sqlalchemy expression language allows one two write that almost identically
import sqlalchemy as sa
import sqlalchemy.ext.declarative as dec
base = dec.declarative_base()
class Foo(base):
__tablename__ = 'foo'
id = sa.Column(sa.Integer, primary_key=True)
points = sa.Column(sa.Integer, nullable=False)
timestamp = sa.Column('tstamp', sa.Integer)
upper_, lower_ = 100, -100
win_expr = func.sum(Foo.points).over(order_by=Foo.timestamp)
bound_expr = sa.func.least(sa.func.greatest(win_expr, lower_), upper_).label('bounded_running_total')
stmt = sa.select([Foo.id, Foo.points, Foo.timestamp, bound_expr])
str(stmt)
# prints output:
# SELECT foo.id, foo.points, foo.tstamp, least(greatest(sum(foo.points) OVER (ORDER BY foo.tstamp), :greatest_1), :least_1) AS bounded_running_total
# FROM foo'
# alternatively using session.query you can also fetch results
from sqlalchemy.orm sessionmaker
DB = sessionmaker()
db = DB()
foos_stmt = dm.query(Foo.id, Foo.points, Foo.timestamp, bound_expr).filter(...)
str(foos_stmt)
# prints output:
# SELECT foo.id, foo.points, foo.tstamp, least(greatest(sum(foo.points) OVER (ORDER BY foo.tstamp), :greatest_1), :least_1) AS bounded_running_total
# FROM foo'
foos = foos_stmt.all()
EDIT As user #pozs points out in the comments, the above does not produce the intended results.
Two alternate approaches have been presented by #pozs. Here, I've adapted the first, recursive query approach, constructed via sqlalchemy.
import sqlalchemy as sa
import sqlalchemy.ext.declarative as dec
import sqlalchemy.orm as orm
base = dec.declarative_base()
class Foo(base):
__tablename__ = 'foo'
id = sa.Column(sa.Integer, primary_key=True)
points = sa.Column(sa.Integer, nullable=False)
timestamp = sa.Column('tstamp', sa.Integer)
upper_, lower_ = 100, -100
t = sa.select([
Foo.timestamp,
Foo.points,
Foo.points.label('bounded_running_sum')
]).order_by(Foo.timestamp).limit(1).cte('t', recursive=True)
t_aliased = orm.aliased(t, name='ta')
bounded_sum = t.union_all(
sa.select([
Foo.timestamp,
Foo.points,
sa.func.greatest(sa.func.least(Foo.points + t_aliased.c.bounded_running_sum, upper_), lower_)
]).order_by(Foo.timestamp).limit(1)
)
stmt = sa.select([bounded_sum])
# inspect the query:
from sqlalchemy.dialects import postgresql
print(stmt.compile(dialect=postgresql.dialect(),
compile_kwargs={'literal_binds': True}))
# prints output:
# WITH RECURSIVE t(tstamp, points, bounded_running_sum) AS
# ((SELECT foo.tstamp, foo.points, foo.points AS bounded_running_sum
# FROM foo ORDER BY foo.tstamp
# LIMIT 1) UNION ALL (SELECT foo.tstamp, foo.points, greatest(least(foo.points + ta.bounded_running_sum, 100), -100) AS greatest_1
# FROM foo, t AS ta ORDER BY foo.tstamp
# LIMIT 1))
# SELECT t.tstamp, t.points, t.bounded_running_sum
# FROM t
I used this link from the documentation as a reference to construct the above, which also highlights how one may use the session instead to work with recursive CTEs
This would be the pure sqlalchemy method to generate the required results.
The 2nd approach suggested by #pozs could also be used via sqlalchemy.
The solution would have to be a variant of this section from the documentation
Related
I have a Postgres query (via SQLAlchemy) that selects matching rows using complex criteria:
original_query = session.query(SomeTable).filter(*complex_filters)
I don't know exactly how the query is constructed, I only have access to the resulting Query instance.
Now I want to use this "opaque" query (black-box for the purposes of this question) to construct other queries, from the same table using the exact same criteria, but with additional logic on top of the matched original_query rows. For example with SELECT DISTINCT(column) on top:
another_query = session.query(SomeTable.column).distinct().?select_from_query?(original_query)
or
SELECT SUM(tab_value) FROM (
SELECT tab.key AS tab_key, tab.value AS tab_value -- inner query, fixed
FROM tab
WHERE tab.product_id IN (1, 2) -- simplified; the inner query is quite complex
) AS tbl
WHERE tab_key = 'length';
or
SELECT tab_key, COUNT(*) FROM (
SELECT tab.key AS tab_key, tab.value AS tab_value
FROM tab
WHERE tab.product_id IN (1, 2)
) AS tbl
GROUP BY tab_key;
etc.
How to implement that ?select_from_query? part cleanly, in SQLAlchemy?
Basically, how to do SELECT dynamic FROM (SELECT fixed) in SqlAlchemy?
Motivation: the inner Query object comes from a different part of code. I don't have control over how it is constructed, and want to avoid duplicating its logic ad-hoc for each SELECT that I have to run on top of it. I want to re-use that query, but add additional logic on top (as per the examples above).
original_query is just a SQLAlchemy query API object, you can apply additional filters and criteria to this. The query API is generative; each Query() instance operation returns a new (immutable) instance and your starting point (original_query) is unaffected.
This includes using Query.distinct() to add a DISTINCT() clause, Query.with_entities() to alter what columns are part of the query, and Query.values() to execute your query but return just specific single column values.
Use either .distinct(<column>).with_entities(<column>) to create a new query object (which can be further re-used):
another_query = original_query.distinct(SomeTable.column).with_entities(SomeTable.column)
or just use .distinct(<column>).values(<column>) to get an iterator of (column_value,) tuple results right there and then:
distinct_values = original_query.distinct(SomeTable.column).values(SomeTable.column)
Note that .values() executes the query immediately, like .all() would, while .with_entities() gives you back a new Query object with just the single column (and .all() or iteration or slicing would then execute and return the results).
Demo, using a contrived Foo model (executing against sqlite to make it easier to demo quickly):
>>> from sqlalchemy import *
>>> from sqlalchemy.ext.declarative import declarative_base
>>> from sqlalchemy.orm import sessionmaker
>>> Base = declarative_base()
>>> class Foo(Base):
... __tablename__ = "foo"
... id = Column(Integer, primary_key=True)
... bar = Column(String)
... spam = Column(String)
...
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> session = sessionmaker(bind=engine)()
>>> Base.metadata.create_all(engine)
2019-06-10 13:10:43,910 INFO sqlalchemy.engine.base.Engine PRAGMA table_info("foo")
2019-06-10 13:10:43,910 INFO sqlalchemy.engine.base.Engine ()
2019-06-10 13:10:43,911 INFO sqlalchemy.engine.base.Engine
CREATE TABLE foo (
id INTEGER NOT NULL,
bar VARCHAR,
spam VARCHAR,
PRIMARY KEY (id)
)
2019-06-10 13:10:43,911 INFO sqlalchemy.engine.base.Engine ()
2019-06-10 13:10:43,913 INFO sqlalchemy.engine.base.Engine COMMIT
>>> original_query = session.query(Foo).filter(Foo.id.between(17, 42))
>>> print(original_query) # show what SQL would be executed for this query
SELECT foo.id AS foo_id, foo.bar AS foo_bar, foo.spam AS foo_spam
FROM foo
WHERE foo.id BETWEEN ? AND ?
>>> another_query = original_query.distinct(Foo.bar).with_entities(Foo.bar)
>>> print(another_query) # print the SQL again, don't execute
SELECT DISTINCT foo.bar AS foo_bar
FROM foo
WHERE foo.id BETWEEN ? AND ?
>>> distinct_values = original_query.distinct(Foo.bar).values(Foo.bar) # executes!
2019-06-10 13:10:48,470 INFO sqlalchemy.engine.base.Engine SELECT DISTINCT foo.bar AS foo_bar
FROM foo
WHERE foo.id BETWEEN ? AND ?
2019-06-10 13:10:48,470 INFO sqlalchemy.engine.base.Engine (17, 42)
In the above demo, the original query would select certain Foo instances with a BETWEEN filter, but adding .distinct(Foo.bar).values(Foo.bar) then executes a query for just the DISTINCT foo.bar column, but with the same BETWEEN filter in place. Similarly, by using .with_entities(), we were given a new query object for just that single column, but the filter is still part of that new query.
Your added example works just the same way; you don't actually need to have a sub-select there, as the same query can be expressed as:
SELECT sum(tab.value)
FROM tab
WHERE tab.product_id IN (1, 2) AND tab_key = 'length';
which can be achieved simply by adding extra filters and then use .with_entities() to replace the columns selected with your SUM():
summed_query = (
original_query
.filter(Tab.key == 'length') # add a filter
.with_entities(func.sum(Tab.value)
or, in terms of the above Foo demo:
>>> print(original_query.filter(Foo.spam == 42).with_entities(func.sum(Foo.bar)))
SELECT sum(foo.bar) AS sum_1
FROM foo
WHERE foo.id BETWEEN ? AND ? AND foo.spam = ?
There are use-cases for sub-queries (such as limiting results from a specific table in a join), but this is not one of those.
If you do need a sub-query, then the query API has Query.from_self() (for simpler cases) and Query.subselect().
For example, if you needed to select only aggregated rows from the original query and filter on the aggregated values via HAVING, and then join the results with another table for the highest row id for each group and some further filtering, then you need a subquery:
summed_col = func.sum(SomeTable.some_column)
max_id = func.max(SomeTable.primary_key)
summed_results_by_eggs = (
original_query
.with_entities(max_id, summed_col) # only select highest id and the sum
.group_by(SomeTable.other_column) # per group
.having(summed_col > 10) # where the sum is high enough
.from_self(summed_col) # give us the summed value as a subselect
.join( # join these rows with another table
OtherTable,
OtherTable.foreign_key == max_id # using the highest id
)
.filter(OtherTable.some_column < 1000) # and filter some more
)
The above would only select the summed SomeTable.some_column values where that value is greater than 10, and where the highest SomeTable.id value in each group. This query has to use a sub-query, because you want to limit the eligible SomeTable rows before joining against the other table.
To demo this, I added a second table Eggs:
>>> from sqlalchemy.orm import relationship
>>> class Eggs(Base):
... __tablename__ = "eggs"
... id = Column(Integer, primary_key=True)
... foo_id = Column(Integer, ForeignKey(Foo.id))
... foo = relationship(Foo, backref="eggs")
...
>>> summed_col = func.sum(Foo.bar)
>>> max_id = func.max(Foo.id)
>>> print(
... original_query
... .with_entities(max_id, summed_col)
... .group_by(Foo.spam)
... .having(summed_col > 10)
... .from_self(summed_col)
... .join(Eggs, Eggs.foo_id==max_id)
... .filter(Eggs.id < 1000)
... )
SELECT anon_1.sum_2 AS sum_1
FROM (SELECT max(foo.id) AS max_1, sum(foo.bar) AS sum_2
FROM foo
WHERE foo.id BETWEEN ? AND ? GROUP BY foo.spam
HAVING sum(foo.bar) > ?) AS anon_1 JOIN eggs ON eggs.foo_id = anon_1.max_1
WHERE eggs.id < ?
The Query.from_self() method takes new entities to use in the outer query, if you omit those then all columns are pulled out. In the above I pulled out the summed column value; without that argument the MAX(Foo.id) column would also be selected.
I have some time series data where I have sets of time series, each Timeseries instance of which has a one-to-many relationship
with Point instances. Below is a simplified representation of the data.
tables.py:
class Timeseries(Base):
__tablename__ = "timeseries"
id = Column("id", Integer, primary_key=True)
points = relationship("Point", back_populates="ts")
class Point(Base):
__tablename__ = "point"
id = Column("id", Integer, primary_key=True)
t = Column("t", Float)
v = Column("v", Float)
ts_id = Column(Integer, ForeignKey("timeseries.id"))
ts = relationship("Timeseries", back_populates="points")
Question: I'm trying to come up with a query with these kind of columns: "timeseries_id", "id", "t", "v", "id_next", "t_next", "v_next". That is, I want to be able to see each point's data alongside the next points data in the time series in chronological order, but I've been struggling get a table that doesn't elements from a implicit join? (Edit: An important point is that I want to be able to get this list using 100% queries and subquery objects in sqlalchemy, because I need to use this queried table in further joins, filters, etc.) Here's the basic start of what I got, (Note that I haven't run this code since this is a simplified version of my actual database, but it's the same idea):
# The point data actually in the database.
sq = (session.query(
Timeseries.id.label("timeseries_id"),
Point.id,
Point.t,
Point.v)
.select_from(
join(Timeseries, Point, Timeseries.id==Point.ts_id))
.group_by('timeseries_id')
.subquery())
# first point manually added to each list in query
sq_first = (session.query(
Timeseries.id.label("timeseries_id"),
sa.literal_column("-1", Integer).label("id"), # Some unused Point.id value
sa.literal_column(-math.inf, Float).label("t"),
sa.literal_column(-math.inf, Float).label("v"))
.select_from(
join(Timeseries, Point, Timeseries.id==Point.ts_id))
.subquery())
# last point manually added to each list in query.
sq_last = (session.query(
Timeseries.id.label("timeseries_id"),
sa.literal_column("-2", Integer).label("id"), # Another unused Point.id value
sa.literal_column(math.inf, Float).label("t"),
sa.literal_column(math.inf, Float).label("v"))
.select_from(
join(Timeseries, Point, Timeseries.id==Point.ts_id))
.subquery())
# Append each timeseries in `sq` table with last point
sq_points_curr = session.query(sa.union_all(sq_first, sq)).subquery()
sq_points_next = session.query(sa.union_all(sq, sq_last)).subquery()
Assuming what I've done so far is useful, this is the part where I get stuck:
#I guess rename the columns in `sq_points_next` to append them by "_next"....
sq_points_next = (session.query(
sq_points_curr.c.timeseries_id
sq_points_curr.c.id.label("id_next"),
sq_points_curr.c.t.label("t_next"),
sq_points_curr.c.v.label("v_next"))
.subquery())
# ... and then perform a join along "timeseries_id" somehow to get the table I originally wanted...
sq_point_pairs = (session.query(
Timeseries.id.label("timeseries_id")
"id",
"t",
"v",
"id_next",
"t_next",
"v_next"
).select_from(
sq_points, sq_points_next, sq_points.timeseries_id==sq_points_next.timeseries_id)
)
I'm not even sure if this last would compile at this point since again it is adapted/simplified from real code but it doesn't yield a table of adjacent points in time, etc..
Edit (August 10, 2019):
The following simplified query from Nathan is most certainly the right approach close to working, but raises errors for sqlite.
sq = session.query(
Timeseries.id.label("timeseries_id"),
Point.t.label("point_t"),
func.lead(Point.t).over().label('point_after_t')
).select_from(
join(Timeseries, Point, Timeseries.id == Point.ts_id)
).order_by(Timeseries.id)
print(sq.all())
Assuming you can get a recent enough version of the sqlite3 python module working (eg. by using Anaconda), you can use the LEAD window function to accomplish your goal. In order to use the results of the LEAD function in further queries, you'll need to use a CTE as well. The following approach worked for me with the schema you gave:
sq = session.query(
Timeseries.id.label("timeseries_id"),
Point.id.label("point_id"),
Point.t.label("point_t"),
Point.v.label("point_v"),
func.lead(Point.id).over().label('point_after_id'),
func.lead(Point.v).over().label('point_after_v'),
func.lead(Point.t).over().label('point_after_t')).select_from(
join(Timeseries, Point, Timeseries.id == Point.ts_id)).order_by(Timeseries.id)
with_after = sq.cte()
session.execute(with_after.select().where(
with_after.c.point_v < with_after.c.point_after_v)).fetchall()
Rather than jumping through hoops to get the query to produce the paired results you are looking for, why not just retrieve all the points data related to a particular Timeseries row and then recombine the data into the pairs you are looking for? For example:
from operator import attrgetter
def to_dict(a, b):
# formats a pair of points rows into a dict object
return {
'timeseries_id': a.ts_id,
'id': a.id, 't': a.t, 'v': a.v,
'id_next': b.id, 't_next': b.t, 'v_next': b.v
}
def timeseries_pairs(session, ts_id):
# queries the db for particular Timeseries row, and combines points pairs
ts = session.query(Timeseries).\
filter(Timeseries.id == ts_id).\
first()
ts.points.sort(key=attrgetter('t'))
pairs = [to_dict(a, b) for a, b in zip(ts.points, ts.points[1:])]
last = ts.points[-1]
pairs.append({
'timeseries_id': last.ts_id,
'id': last.id, 't': last.t, 'v': last.v,
'id_next': None, 't_next': None, 'v_next': None
})
return pairs
# pass the session and a timeseries id to return a list of points pairs
timeseries_pairs(session, 1)
I am trying to use use a temp table with SQLAlchemy and join it against an existing table. This is what I have so far
engine = db.get_engine(db.app, 'MY_DATABASE')
df = pd.DataFrame({"id": [1, 2, 3], "value": [100, 200, 300], "date": [date.today(), date.today(), date.today()]})
temp_table = db.Table('#temp_table',
db.Column('id', db.Integer),
db.Column('value', db.Integer),
db.Column('date', db.DateTime))
temp_table.create(engine)
df.to_sql(name='tempdb.dbo.#temp_table',
con=engine,
if_exists='append',
index=False)
query = db.session.query(ExistingTable.id).join(temp_table, temp_table.c.id == ExistingTable.id)
out_df = pd.read_sql(query.statement, engine)
temp_table.drop(engine)
return out_df.to_dict('records')
This doesn't return any results because the insert statements that to_sql does don't get run (I think this is because they are run using sp_prepexec, but I'm not entirely sure about that).
I then tried just writing out the SQL statement (CREATE TABLE #temp_table..., INSERT INTO #temp_table..., SELECT [id] FROM...) and then running pd.read_sql(query, engine). I get the error message
This result object does not return rows. It has been closed automatically.
I guess this is because the statement does more than just SELECT?
How can I fix this issue (either solution would work, although the first would be preferable as it avoids hard-coded SQL). To be clear, I can't modify the schema in the existing database—it's a vendor database.
In case the number of records to be inserted in the temporary table is small/moderate, one possibility would be to use a literal subquery or a values CTE instead of creating temporary table.
# MODEL
class ExistingTable(Base):
__tablename__ = 'existing_table'
id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column(sa.String)
# ...
Assume also following data is to be inserted into temp table:
# This data retrieved from another database and used for filtering
rows = [
(1, 100, datetime.date(2017, 1, 1)),
(3, 300, datetime.date(2017, 3, 1)),
(5, 500, datetime.date(2017, 5, 1)),
]
Create a CTE or a sub-query containing that data:
stmts = [
# #NOTE: optimization to reduce the size of the statement:
# make type cast only for first row, for other rows DB engine will infer
sa.select([
sa.cast(sa.literal(i), sa.Integer).label("id"),
sa.cast(sa.literal(v), sa.Integer).label("value"),
sa.cast(sa.literal(d), sa.DateTime).label("date"),
]) if idx == 0 else
sa.select([sa.literal(i), sa.literal(v), sa.literal(d)]) # no type cast
for idx, (i, v, d) in enumerate(rows)
]
subquery = sa.union_all(*stmts)
# Choose one option below.
# I personally prefer B because one could reuse the CTE multiple times in the same query
# subquery = subquery.alias("temp_table") # option A
subquery = subquery.cte(name="temp_table") # option B
Create final query with the required joins and filters:
query = (
session
.query(ExistingTable.id)
.join(subquery, subquery.c.id == ExistingTable.id)
# .filter(subquery.c.date >= XXX_DATE)
)
# TEMP: Test result output
for res in query:
print(res)
Finally, get pandas data frame:
out_df = pd.read_sql(query.statement, engine)
result = out_df.to_dict('records')
You can try to use another solution - Process-Keyed Table
A process-keyed table is simply a permanent table that serves as a
temp table. To permit processes to use the table simultaneously, the
table has an extra column to identify the process. The simplest way to
do this is the global variable ##spid (##spid is the process id in SQL
Server).
...
One alternative for the process-key is to use a GUID (data type
uniqueidentifier).
http://www.sommarskog.se/share_data.html#prockeyed
Consider the following query codified via SQLAlchemy.
# Create a CTE that performs a join and gets some values
x_cte = session.query(SomeTable.col1
,OtherTable.col5
) \
.select_from(SomeTable) \
.join(OtherTable, SomeTable.col2 == OtherTable.col3)
.filter(OtherTable.col6 == 34)
.cte(name='x')
# Create a subquery that splits the CTE based on the value of col1
# and computes the quartile for positive col1 and assigns a dummy
# "quartile" for negative and zero col1
subquery = session.query(x_cte
,literal('-1', sqlalchemy.INTEGER).label('quartile')
) \
.filter(x_cte.col1 <= 0)
.union_all(session.query(x_cte
,sqlalchemy.func.ntile(4).over(order_by=x_cte.col1).label('quartile')
)
.filter(x_cte.col1 > 0)
) \
.subquery()
# Compute some aggregate values for each quartile
result = session.query(sqlalchemy.func.avg(subquery.columns.x_col1)
,sqlalchemy.func.avg(subquery.columns.x_col5)
,subquery.columns.x_quartile
) \
.group_by(subquery.columns.x_quartile) \
.all()
Sorry for the length, but this is similar to my real query. In my real code, I've given a more descriptive name to my CTE, and my CTE has far more columns for which I must compute the average. (It's also actually a weighted average weighted by a column in the CTE.)
The real "problem" is purely one of trying to keep my code more clear and shorter. (Yes, I know. This query is already a monster and hard to read, but the client insists on this data being available.) Notice that in the final query, I must refer to my columns as subquery.columns.x_[column name]; this is because SQLAlchemy is prefixing my column name with the CTE name. I would just like for SQLAlchemy to leave off my CTE's name when generating column names, but since I have many columns, I would prefer not to list them individually in my subquery. Leaving off the CTE name would make my column names (which are long enough on their own) shorter and slightly more readable; I can guarantee that the columns are unique. How can I do this?
Using Python 2.7.3 with SQLAlchemy 0.7.10.
you're not being too specific what "x_" is here, but if that's the final result, use label() to give the result columns whatever name you want:
row = session.query(func.avg(foo).label('foo_avg'), func.avg(bar).label('bar_avg')).first()
foo_avg = row['foo_avg'] # indexed access
bar_avg = row.bar_avg # attribute access
Edit: I'm not able to reproduce the "x_" here. Here's a test:
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class A(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
x = Column(Integer)
y = Column(Integer)
s = Session()
subq = s.query(A).cte(name='x')
subq2 = s.query(subq, (subq.c.x + subq.c.y)).filter(A.x == subq.c.x).subquery()
print s.query(A).join(subq2, A.id == subq2.c.id).\
filter(subq2.c.x == A.x, subq2.c.y == A.y)
above, you can see I can refer to subq2.c.<colname> without issue, there is no "x" prepended. If you can please specify SQLAlchemy version information and fill out your example fully, I can run it as is in order to reproduce your issue.
I need to query multiple entities, something like session.query(Entity1, Entity2), only from a subquery rather than directly from the tables. The docs have something about selecting one entity from a subquery but I can't find how to select more than one, either in the docs or by experimentation.
My use case is that I need to filter the tables underlying the mapped classes by a window function, which in PostgreSQL can only be done in a subquery or CTE.
EDIT: The subquery spans a JOIN of both tables so I can't just do aliased(Entity1, subquery).
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class A(Base):
__tablename__ = "a"
id = Column(Integer, primary_key=True)
bs = relationship("B")
class B(Base):
__tablename__ = "b"
id = Column(Integer, primary_key=True)
a_id = Column(Integer, ForeignKey('a.id'))
e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
s = Session(e)
s.add_all([A(bs=[B(), B()]), A(bs=[B()])])
s.commit()
# with_labels() here is to disambiguate A.id and B.id.
# without it, you'd see a warning
# "Column 'id' on table being replaced by another column with the same key."
subq = s.query(A, B).join(A.bs).with_labels().subquery()
# method 1 - select_from()
print s.query(A, B).select_from(subq).all()
# method 2 - alias them both. "subq" renders
# once because FROM objects render based on object
# identity.
a_alias = aliased(A, subq)
b_alias = aliased(B, subq)
print s.query(a_alias, b_alias).all()
I was trying to do something like the original question: join a filtered table with another filtered table using an outer join. I was struggling because it's not at all obvious how to:
create a SQLAlchemy query that returns entities from both tables. #zzzeek's answer showed me how to do that: get_session().query(A, B).
use a query as a table in such a query. #zzzeek's answer showed me how to do that too: filtered_a = aliased(A).filter(...).subquery().
use an OUTER join between the two entities. Using select_from() after outerjoin() destroys the join condition between the tables, resulting in a cross join. From #zzzeek answer I guessed that if a is aliased(), then you can include a in the query() and also .outerjoin(a), and it won't be joined a second time, and that appears to work.
Following either of #zzzeek's suggested approaches directly resulted in a cross join (combinatorial explosion), because one of my models uses inheritance, and SQLAlchemy added the parent tables outside the inner SELECT without any conditions! I think this is a bug in SQLAlchemy. The approach that I adopted in the end was:
filtered_a = aliased(A, A.query().filter(...)).subquery("filtered_a")
filtered_b = aliased(B, B.query().filter(...)).subquery("filtered_b")
query = get_session().query(filtered_a, filtered_b)
query = query.outerjoin(filtered_b, filtered_a.relation_to_b)
query = query.order_by(filtered_a.some_column)
for a, b in query:
...