SELECT DISTINCT FROM <subquery> GROUP BY HAVING Query in SQLAlchemy - python

I want to write a query in SQLAlchemy, which is the equivalent of the following,
SELECT DISTINCT <column>
FROM <subquery> AS result
GROUP BY <columns>
HAVING COUNT(*) = 1
I already have the subquery that I want to use, but I am not sure how to add the rest of it. How can I implement this?

Related

count subquery in sqlalchemy

I'm having some trouble translating a subquery into sqlalchemy. I have two tables that both have a store_id column that is a foreign key (but it isn't a direct many-to-many relationship) and I need to return the id, store_id and name from table 1 along with the number of records from table 2 that also have the same store_id. I know the SQL that I would use to return those records I'm just now sure how to do it using sqlalchemy.
SELECT
table_1.id
table_1.store_id,
table_1.name,
(
SELECT
count(table_2.id)
FROM
table_2
WHERE
table_1.store_id = table_2.store_id
) AS store_count FROM table_1;
This post actually answered my question. I must have missed it when I was searching initially. My solution below.
Generate sql with subquery as a column in select statement using SQLAlchemy
store_count = session.query(func.count(Table2.id)).filter(Table2.store_id == Table1.store_id)
session.query.add_columns(Table1.id, Table1.name, Table1.store_id, store_count.label("store_count"))

Update Django model based on the row number of rows produced by a subquery on the same model

I have a PostgreSQL UPDATE query which updates a field (global_ranking) of every row in a table, based on the ROW_NUMBER() of each row in that same table sorted by another field (rating). Additionally, the update is partitioned, so that the ranking of each row is relative only to those rows which belong to the same language.
In short, I'm updating the ranking of each player in a game, based on their current rating.
The PostgreSQL query looks like this:
UPDATE stats_userstats
SET
global_ranking = sub.row_number
FROM (
SELECT id, ROW_NUMBER() OVER (
PARTITION BY language
ORDER BY rating DESC
) AS row_number
FROM stats_userstats
) sub
WHERE stats_userstats.id = sub.id;
I'm also using Django, and it'd be fun to learn how to express this query using the Django ORM, if possible.
At first, it seemed like Django had everything necessary to express the query, including the ability to use PostgreSQL's ROW_NUMBER() windowing function, but my best attempt updates all rows ranking with 1:
from django.db.models import F, OuterRef, Subquery
from django.db.models.expressions import Window
from django.db.models.functions import RowNumber
UserStats.objects.update(
global_ranking=Subquery(
UserStats.objects.filter(
id=OuterRef('id')
).annotate(
row_number=Window(
expression=RowNumber(),
partition_by=[F('language')],
order_by=F('rating').desc()
)
).values('row_number')
)
)
I used from django.db import connection; print(connection.queries) to see the query produced by that Django ORM statement, and got this vaguely similar SQL statement:
UPDATE "stats_userstats"
SET "global_ranking" = (
SELECT ROW_NUMBER() OVER (
PARTITION BY U0."language"
ORDER BY U0."rating" DESC
) AS "row_number"
FROM "stats_userstats" U0
WHERE U0."id" = "stats_userstats"."id"
It looks like what I need to do is move the subquery from the SET portion of the query to the FROM, but it's unclear to me how to restructure the Django ORM statement to achieve that.
Any help is greatly appreciated. Thank you!
Subquery filters qs by provided OuterRef. You're always getting 1 as each user is in fact first in any ranking if only them are considered.
A "correct" query would be:
UserStats.objects.alias(
row_number=Window(
expression=RowNumber(),
partition_by=[F('language')],
order_by=F('rating').desc()
)
).update(global_ranking=F('row_number'))
But Django will not allow that:
django.core.exceptions.FieldError: Window expressions are not allowed in this query
Related Django ticket: https://code.djangoproject.com/ticket/25643
I think you might comment there with your use case.

sqlalchemy using INTERSECT and UNNEST

I'm trying to translate a raw SQL to sqlalchemy core/orm but I'm having some difficulties. Here is the SQL:
SELECT
(SELECT UNNEST(MyTable.my_array_column)
INTERSECT
SELECT UNNEST(ARRAY['VAL1', 'VAL2']::varchar[])) AS matched
FROM
MyTable
WHERE
my_array_column && ARRAY['VAL1', 'VAL2']::varchar[];
The following query, gives me a FROM clause which I don't need in my nested SELECT:
matched = select([func.unnest(MyTable.my_array_column)]).intersect(select([func.unnest('VAL1', 'VAL2')]))
# SELECT unnest(MyTable.my_array_colum) AS unnest_1
# FROM MyTable INTERSECT SELECT unnest(%(unnest_3)s, %(unnest_4)s) AS unnest_2
How can I tell the select to not include the FROM clause? Note that func.unnest() only accepts a column. So I cannot use func.unnest('my_array_column').
Referring to a table of an enclosing query in a subquery is the process of correlation, which SQLAlchemy attempts to do automatically. In this case, it doesn't quite work, I believe, because your INTERSECT query is a "selectable", not a scalar value, which SQLAlchemy attempts to put in the FROM list instead of the SELECT list.
The solution is twofold. We need to make SQLAlchemy put the INTERSECT query in the SELECT list by applying a label, and make it correlate MyTable correctly:
select([
select([func.unnest(MyTable.my_array_column)]).correlate(MyTable)
.intersect(select([func.unnest('VAL1', 'VAL2')]))
.label("matched")
]).select_from(MyTable)
# SELECT (SELECT unnest("MyTable".my_array_column) AS unnest_1 INTERSECT SELECT unnest(%(unnest_3)s, %(unnest_4)s) AS unnest_2) AS matched
# FROM "MyTable"

SqlAlchemy select with max, group_by and order_by

I have to list the last modified resources for each group, for that I can do this query:
model.Session.query(
model.Resource, func.max(model.Resource.last_modified)
).group_by(model.Resource.resource_group_id).order_by(
model.Resource.last_modified.desc())
But SqlAlchemy complains with:
ProgrammingError: (ProgrammingError) column "resource.id" must appear in
the GROUP BY clause or be used in an aggregate function
How I can select only resource_group_id and last_modified columns?
In SQL what I want is this:
SELECT resource_group_id, max(last_modified) AS max_1
FROM resource GROUP BY resource_group_id ORDER BY max_1 DESC
model.Session.query(
model.Resource.resource_group_id, func.max(model.Resource.last_modified)
).group_by(model.Resource.resource_group_id).order_by(
func.max(model.Resource.last_modified).desc())
You already got it, but I'll try to explain what's going on with the original query for future reference.
In sqlalchemy if you specified query(model.Resource, ...), a model reference, it will list each column on the resource table in the generated SQL select statement, so your original query would look something like:
SELECT resource.resource_group_id AS resource_group_id,
resource.extra_column1 AS extra_column1,
resource.extra_column2 AS extra_column2,
...
count(resource.resource_group_id) AS max_1
GROUP BY resource_group_id ORDER BY max_1 DESC;
This won't work with a GROUP BY.
A common way to avoid this is to specify what columns you want to select explicitly by adding them to the query method .query(model.Resource.resource_group_id)

Delete rows without a related record using SQLAlchemy

I have 2 tables; we'll call them table1 and table2. table2 has a foreign key to table1. I need to delete the rows in table1 that have zero child records in table2. The SQL to do this is pretty straightforward:
DELETE FROM table1
WHERE 0 = (SELECT COUNT(*) FROM table2 WHERE table2.table1_id = table1.table1_id);
However, I haven't been able to find a way to translate this query to SQLAlchemy. Trying the straightforward approach:
subquery = session.query(sqlfunc.count(Table2).label('t2_count')).select_from(Table2).filter(Table2.table1_id == Table1.table1_id).subquery()
session.query(Table1).filter(0 == subquery.columns.t2_count).delete()
Just yielded an error:
sqlalchemy.exc.ArgumentError: Only deletion via a single table query is currently supported
How can I perform this DELETE with SQLAlchemy?
Python 2.7
PostgreSQL 9.2.4
SQLAlchemy 0.7.10 (Cannot upgrade due to using GeoAlchemy, but am interested if newer versions would make this easier)
I'm pretty sure this is what you want. You should try it out though. It uses EXISTS.
from sqlalchemy.sql import not_
# This fetches rows in python to determine which ones were removed.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session='fetch')
# If you will not be referencing more Table1 objects in this session then you
# can just ignore syncing the session.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session=False)
Explanation of the argument for delete():
http://docs.sqlalchemy.org/en/rel_0_8/orm/query.html#sqlalchemy.orm.query.Query.delete
Example with exists(using any() above uses EXISTS):
http://docs.sqlalchemy.org/en/rel_0_8/orm/tutorial.html#using-exists
Here is the SQL that should be generated:
DELETE FROM table1 WHERE NOT (EXISTS (SELECT 1
FROM table2
WHERE table1.id = table2.table1_id))
If you are using declarative I think there is a way to access Table2.table and then you could just use the sql layer of sqlalchemy to do exactly what you want. Although you run into the same issue of making your Session out of sync.
Well, I found one very ugly way to do it. You can do a select with a join to get the rows loaded into memory, then you can delete them individually:
subquery = session.query(Table2.table1_id
,sqlalchemy.func.count(Table2.table2_id).label('t1count')
) \
.select_from(Table2) \
.group_by(Table2.table1_id) \
.subquery()
rows = session.query(Table1) \
.select_from(Table1) \
.outerjoin(subquery, Table1.table1_id == subquery.c.table1_id) \
.filter(subquery.c.t1count == None) \
.all()
for r in rows:
session.delete(r)
This is not only nasty to write, it's also pretty nasty performance-wise. For starters, you have to bring the table1 rows into memory. Second, if you were like me and had a line like this on Table2's class definition:
table1 = orm.relationship(Table1, backref=orm.backref('table2s'))
then SQLAlchemy will actually perform a query to pull the related table2 rows into memory, too (even though there aren't any). Even worse, because you have to loop over the list (I tried just passing in the list; didn't work), it does so one table1 row at a time. So if you're deleting 10 rows, it's 21 individual queries (1 for the initial select, 1 for each relationship pull, and 1 for each delete). Maybe there are ways to mitigate that; I would have to go through the documentation to see. All this for things I don't even want in my database, much less in memory.
I won't mark this as the answer. I want a cleaner, more efficient way of doing this, but this is all I have for now.

Categories