In SQL, I can sum two counts like
SELECT (
(SELECT count(*) FROM a WHERE val=42)
+
(SELECT count(*) FROM b WHERE val=42)
)
How do I perform this query with the Django ORM?
The closest I got is
a.objects.filter(val=42).order_by().values_list('id', flat=True).union(
b.objects.filter(val=42).order_by().values_list('id', flat=True)
).count()
This works fine if the returned count is small, but seems bad if there's a lot of rows that the database must hold in memory just to count them.
Your solution can be only little simplified by values('pk') instead of values_list('id', flat=True), because this would affect only a type of rows of the output, but the source SQL of both querysets is the same:
SELECT id FROM a WHERE val=42 UNION SELECT id FROM b WHERE val=42
and the method .count() makes only a query around a subquery:
SELECT COUNT(*) FROM (... subquery ...)
It is not necessary that a database backend would hold all values in memory. It can also only count them and forget. (not checked)
Similarly if you run a simple SELECT COUNT(id) FROM a, it doesn't need to collect id.
Subqueries of the form SELECT count(*) FROM a WHERE val=42 in a bigger query are not possible because Django doesn't use lazy evaluation for aggregations and immediately evaluates them.
The evaluation can be postponed e.g. by grouping by some expression that has only one possible value, e.g. GROUP BY (i >= 0) (or by an outer reference if it would work), but the query plan can be worse.
Another problem is that a SELECT is not possible without a table. Therefore I will use an unimportant row of an unimportant table in the base of query.
Example:
qs = Unimportant.objects.filter(pk=unimportant_pk).values('id').annotate(
total_a=a.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt'),
total_b=b.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt')
)
It is not nice, but it could be easily parallelized
SELECT
id,
(SELECT COUNT(*) AS cnt FROM a WHERE val=42 GROUP BY val) AS total_a,
(SELECT COUNT(*) AS cnt FROM b WHERE val=42 GROUP BY val) AS total_b
FROM unimportant WHERE id = unimportant_pk
Django docs confirms that simple solution doesn't exist.
Using aggregates within a Subquery expression
...
... This is the only way to perform an aggregation within a Subquery, as using aggregate() attempts to evaluate the queryset (and if there is an OuterRef, this will not be possible to resolve).
Related
I have three queries and another table called output_table. This code works but needs to be executed in 1. REPLACE INTO query. I know this involves nested and subqueries, but I have no idea if this is possible since my key is the DISTINCT coins datapoints from target_currency.
How to rewrite 2 and 3 so they execute in query 1? That is, the REPLACE INTO query instead of the individual UPDATE ones:
1. conn3.cursor().execute(
"""REPLACE INTO coin_best_returns(coin) SELECT DISTINCT target_currency FROM output_table"""
)
2. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_price = (SELECT MAX(ask_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency),
lowest_price = (SELECT MIN(bid_price_usd) FROM output_table WHERE coin_best_returns.coin = output_table.target_currency)"""
)
3. conn3.cursor().execute(
"""UPDATE coin_best_returns SET
highest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.highest_price = output_table.ask_price_usd),
lowest_market = (SELECT exchange FROM output_table WHERE coin_best_returns.lowest_price = output_table.bid_price_usd)"""
)
You can do it with the help of some window functions, a subquery, and an inner join. The version below is pretty lengthy, but it is less complicated than it may appear. It uses window functions in a subquery to compute the needed per-currency statistics, and factors this out into a common table expression to facilitate joining it to
itself.
Other than the inline comments, the main reason for the complication is original query number 3. Queries (1) and (2) could easily be combined as a single, simple, aggregate query, but the third query is not as easily addressed. To keep the exchange data associated with the corresponding ask and bid prices, this query uses window functions instead of aggregate queries. This also provides a vehicle different from DISTINCT for obtaining one result per currency.
Here's the bare query:
WITH output_stats AS (
-- The ask and bid information for every row of output_table, every row
-- augmented by the needed maximum ask and minimum bid statistics
SELECT
target_currency as tc,
ask_price_usd as ask,
bid_price_usd as bid,
exchange as market,
MAX(ask_price_usd) OVER (PARTITION BY target_currency) as high,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, ask_price_usd ORDER BY exchange DESC)
as ask_rank
MIN(bid_price_usd) OVER (PARTITION BY target_currency) as low,
ROW_NUMBER() OVER (
PARTITION_BY target_currency, bid_price_usd ORDER BY exchange ASC)
as bid_rank
FROM output_table
)
REPLACE INTO coin_best_returns(
-- you must, of course, include all the columns you want to fill in the
-- upsert column list
coin,
highest_price,
lowest_price,
highest_market,
lowest_market)
SELECT
-- ... and select a value for each column
asks.tc,
asks.ask,
bids.bid,
asks.market,
bids.market
FROM output_stats asks
JOIN output_stats bids
ON asks.tc = bids.tc
WHERE
-- These conditions choose exactly one asks row and one bids row
-- for each currency
asks.ask = asks.high
AND asks.ask_rank = 1
AND bids.bid = bids.low
AND bids.bid_rank = 1
Note well that unlike the original query 3, this will consider only exchange values associated with the target currency for setting the highest_market and lowest_market columns in the destination table. I'm supposing that that's what you really want, but if not, then a different strategy will be needed.
I have a model that register some kind of event and the date in which it occurs. I need to calculate: 1) the count of events for each date, and 2) the cumulative count of events over time.
My model looks something like this:
class Event(models.Model):
date = models.DateField()
...
Calculating 1) is pretty straightforward, but I'm having trouble calculating the cumulative sum. I tried something like this:
query_set = Event.objects.values("date") \
.annotate(count=Count("date")) \
.annotate(cumcount=Window(Sum("count"), order_by="date"))
But I'm getting this error:
Cannot compute Sum('count'): 'count' is an aggregate
Edit: Ideally, I'd like to have a query set equivalent to this SQL query, using Django's ORM:
SELECT date,
COUNT(date) as count,
SUM(COUNT(date)) OVER(ORDER BY date) acc_count
FROM event_event
GROUP BY date
In some cases perform an aggregate of an aggregate are not valid in SQL, whether you're using the ORM or not, for example: MAX(SUM(...)).
In your case you can do it by a raw query (as already mentioned in the other answer(s) and in your query).
Or using the ORM as following:
subquery = (
Event.objects.filter(date=OuterRef("date")) # we need this for the join
.values("date") # this to create the group by
.annotate(subcount=Count("date")) # the aggregate function
)
Event.objects.values("date").annotate(count=Count("date")).annotate(
sumcount=Window(Sum(subquery.values("subcount")), order_by="date")
# above we can use the Sum with the subquery
# we can also replace it for any aggregation functions that we want
).values("date", "count", "cumcount")
It will generate the following SQL:
SELECT
"app_event"."date",
COUNT("app_event"."date") AS "count",
SUM((SELECT
COUNT(U0."date") AS "subcount"
FROM
"app_event" U0
WHERE
U0."date" = ("app_event"."date")
GROUP BY
U0."date"
)) OVER ( ORDER BY "app_event"."date")
AS "cumcount"
FROM
"app_event"
GROUP BY
"app_event"."date"
It's surprisingly common to see developers wanting to convert from SQL query to Django QuerySet.
In this case, as OP already knows SQL, OP might be better off just performing raw SQL query.
There are different ways one can go about doing it, like executing custom SQL directly.
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute("SELECT date, COUNT(date) as count, SUM(COUNT(date)) OVER(ORDER BY date) acc_count
FROM event_event
GROUP BY date")
Then, call cursor.fetchone() or cursor.fetchall() to return the resulting rows.
I have got 2 column properties which use the same query, but just return different columns:
action_time = column_property(
select([Action.created_at]).where((Action.id == id)).order_by(desc(Action.created_at)).limit(1)
)
action_customer = column_property(
select([Action.customer_id]).where((Action.id == id)).order_by(desc(Action.created_at)).limit(1)
)
SQL query that is produced will have 2 subqueries for each of the properties. So it mean if I'd like to add a few more similar properties, SQL query will end up with N subqueries.
I am wondering whether it is possible to have one LEFT OUTER JOIN which will be used for multiple column_property (ies)?
I want to get all the columns of a table with max(timestamp) and group by name.
What i have tried so far is:
normal_query ="Select max(timestamp) as time from table"
event_list = normal_query \
.distinct(Table.name)\
.filter_by(**filter_by_query) \
.filter(*queries) \
.group_by(*group_by_fields) \
.order_by('').all()
the query i get :
SELECT DISTINCT ON (schema.table.name) , max(timestamp)....
this query basically returns two columns with name and timestamp.
whereas, the query i want :
SELECT DISTINCT ON (schema.table.name) * from table order by ....
which returns all the columns in that table.Which is the expected behavior and i am able to get all the columns, how could i right it down in python to get to this statement?.Basically the asterisk is missing.
Can somebody help me?
What you seem to be after is the DISTINCT ON ... ORDER BY idiom in Postgresql for selecting greatest-n-per-group results (N = 1). So instead of grouping and aggregating just
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
all()
This will end up selecting rows "grouped" by name, having the greatest timestamp value.
You do not want to use the asterisk most of the time, not in your application code anyway, unless you're doing manual ad-hoc queries. The asterisk is basically "all columns from the FROM table/relation", which might then break your assumptions later, if you add columns, reorder them, and such.
In case you'd like to order the resulting rows based on timestamp in the final result, you can use for example Query.from_self() to turn the query to a subquery, and order in the enclosing query:
event_list = Table.query.\
distinct(Table.name).\
filter_by(**filter_by_query).\
filter(*queries).\
order_by(Table.name, Table.timestamp.desc()).\
from_self().\
order_by(Table.timestamp.desc()).\
all()
I have a DB query that matches the desired rows. Let's say (for simplicity):
select * from stats where id in (1, 2);
Now I want to extract several frequency statistics (count of distinct values) for multiple columns, across these matching rows:
-- `stats.status` is one such column
select status, count(*) from stats where id in (1, 2) group by 1 order by 2 desc;
-- `stats.category` is another column
select category, count(*) from stats where id in (1, 2) group by 1 order by 2 desc;
-- etc.
Is there a way to re-use the same underlying query in SqlAlchemy? Raw SQL works too.
Or even better, return all the histograms at once, in a single command?
I'm mostly interested in performance, because I don't want Postgres to run the same row-matching many times, once for each column, over and over. The only change is which column is used for the histogram grouping. Otherwise it's the same set of rows.
I don't want Postgres to run the same row-matching many times
That's one of the motivations behind the GROUPING SETS functionality. Try this model:
SELECT category, status, count(*)
FROM stats where id in (1,2)
GROUP BY grouping sets ((category),(status));
User Abelisto's comment & the other answer both have the correct sql required to generate the histogram for multiple fields in 1 single query.
The only edit I would suggest to their efforts is to add an ORDER BY clause, as it seems from OP's attempts that more frequent labels are desired at the top of the result. You might find that sorting the results in python rather than in the database is simpler. In that case, disregard the complexity brought on the order by clause.
Thus, the modified query would be:
SELECT category, status, count(*)
FROM stats
WHERE id IN (1, 2)
GROUP BY GROUPING SETS (
(category), (status)
)
ORDER BY
GROUPING(category, status), 3 DESC
It is also possible to express the same query using sqlalchemy.
from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Stats(Base):
__tablename__ = 'stats'
id = Column(Integer, primary_key=True)
category = Column(Text)
status = Column(Text)
stmt = select(
[Stats.category, Stats.status, func.count(1)]
).where(
Stats.id.in_([1, 2])
).group_by(
func.grouping_sets(tuple_(Stats.category),
tuple_(Stats.status))
).order_by(
func.grouping(Stats.category, Stats.status),
func.count(1).desc()
)
Investigating the output, we see that it generates the desired query (extra newlines added in output for legibility)
print(stmt.compile(compile_kwargs={'literal_binds': True}))
# outputs:
SELECT stats.category, stats.status, count(1) AS count_1
FROM stats
WHERE stats.id IN (1, 2)
GROUP BY GROUPING SETS((stats.category), (stats.status))
ORDER BY grouping(stats.category, stats.status), count(1) DESC