Django nested aggregates with group_by - python

I have a query
select avg(total) from
(select date, sum(value) as total from table group by some_field, date) as res
group by week(date)
This is the way I'm getting sum of metrics for each date and show average result grouped by week. How could I get the same result with Django ORM not using raw query.
So far I tried the raw() with this query.
Also I can get the results for inner select as
model_obj.objects.filter(some_field='some_value')\
.values('date', 'some_field')\
.order_by('date', 'some_field')\
.annotate(total=Sum('value'))\
.all()

Related

How to join these 2 tables by date with ORM

I have two querysets -
A = Bids.objects.filter(*args,**kwargs).annotate(highest_priority=Case(*[
When(data_source=data_source, then Value(i))
for i, data_source in enumerate(data_source_order_list)
],
.order_by(
"date",
"highest_priority"
))
B= A.values("date").annotate(Min("highest_priority)).order_by("date")
First query give me all objects with selected time range with proper data sources and values. Through highest_priority i set which item should be selected. All items have additional data.
Second query gives me grouped by information about items in every date. In second query i do not have important values like price etc. So i assume i have to join these two tables and filter out where a.highest_priority = b.highest priority. Because in this case i will get queryset with objects and only one item per date.
I have tried using distinct - not working with .first()/.last(). Annotates gives me dict by grouped by, and grouping by only date cutting a lot of important data, but i have to group by only date...
Tables looks like that
A
B
How to join them? Because when i join them i could easily filter highest_prio with highest_prio and get my date with only one database shot. I want to use ORM, because i could just distinct and put it on the list and i do not want to hammer base with connecting multiple queries through date.
Look if this sugestion works :
SELECT * , (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority AS prioritycalc;
FROM table A
JOIN table B ON (to_char(a.date, 'YYYYMMDD')::integer)*highest_priority = (to_char(b.date, 'YYYYMMDD')::integer)*highest_priority
ORDER BY prioritycalc DESC;

How to compute cumulative sum of a count field in Django

I have a model that register some kind of event and the date in which it occurs. I need to calculate: 1) the count of events for each date, and 2) the cumulative count of events over time.
My model looks something like this:
class Event(models.Model):
date = models.DateField()
...
Calculating 1) is pretty straightforward, but I'm having trouble calculating the cumulative sum. I tried something like this:
query_set = Event.objects.values("date") \
.annotate(count=Count("date")) \
.annotate(cumcount=Window(Sum("count"), order_by="date"))
But I'm getting this error:
Cannot compute Sum('count'): 'count' is an aggregate
Edit: Ideally, I'd like to have a query set equivalent to this SQL query, using Django's ORM:
SELECT date,
COUNT(date) as count,
SUM(COUNT(date)) OVER(ORDER BY date) acc_count
FROM event_event
GROUP BY date
In some cases perform an aggregate of an aggregate are not valid in SQL, whether you're using the ORM or not, for example: MAX(SUM(...)).
In your case you can do it by a raw query (as already mentioned in the other answer(s) and in your query).
Or using the ORM as following:
subquery = (
Event.objects.filter(date=OuterRef("date")) # we need this for the join
.values("date") # this to create the group by
.annotate(subcount=Count("date")) # the aggregate function
)
Event.objects.values("date").annotate(count=Count("date")).annotate(
sumcount=Window(Sum(subquery.values("subcount")), order_by="date")
# above we can use the Sum with the subquery
# we can also replace it for any aggregation functions that we want
).values("date", "count", "cumcount")
It will generate the following SQL:
SELECT
"app_event"."date",
COUNT("app_event"."date") AS "count",
SUM((SELECT
COUNT(U0."date") AS "subcount"
FROM
"app_event" U0
WHERE
U0."date" = ("app_event"."date")
GROUP BY
U0."date"
)) OVER ( ORDER BY "app_event"."date")
AS "cumcount"
FROM
"app_event"
GROUP BY
"app_event"."date"
It's surprisingly common to see developers wanting to convert from SQL query to Django QuerySet.
In this case, as OP already knows SQL, OP might be better off just performing raw SQL query.
There are different ways one can go about doing it, like executing custom SQL directly.
from django.db import connection
def my_custom_sql(self):
with connection.cursor() as cursor:
cursor.execute("SELECT date, COUNT(date) as count, SUM(COUNT(date)) OVER(ORDER BY date) acc_count
FROM event_event
GROUP BY date")
Then, call cursor.fetchone() or cursor.fetchall() to return the resulting rows.

What is the sql equivalent function of the python function .size()?

I am trying to solve a problem on bigquery; list of customers with consistent transactions for 6 months. I already solved it with python but I don't know how to replicate the code on sql. This is the code
df.groupby(['Month','accounttoken'])['transactionid'].value_counts()
a=df[df.groupby(['Month','accounttoken'])['transactionid'].transform('count')>=5]
df_grouped = a.groupby(['Month', 'accounttoken','Name']).size().reset_index(name='num_transactions')
a1 = df_grouped[df_grouped['num_transactions']>=5]
This is what I have done with sql so far
select Month, Name,accounttoken,count(transactionid) no_of_trans from data
group by Month, accounttoken,Name
having count(transactionid)>=5
I think what I need is the equivalent of the .size() function
count(*) is counting the number of rows in a group.
SELECT count(*) as num_transactions
FROM data
GROUP BY Month, accounttoken, name
HAVING count(*) >= 5
You can use these SQL Query for the replacement of last two line of python code given by you. I hope SQL query given by you is also working.

Sum numeric values from different tables in one query

In SQL, I can sum two counts like
SELECT (
(SELECT count(*) FROM a WHERE val=42)
+
(SELECT count(*) FROM b WHERE val=42)
)
How do I perform this query with the Django ORM?
The closest I got is
a.objects.filter(val=42).order_by().values_list('id', flat=True).union(
b.objects.filter(val=42).order_by().values_list('id', flat=True)
).count()
This works fine if the returned count is small, but seems bad if there's a lot of rows that the database must hold in memory just to count them.
Your solution can be only little simplified by values('pk') instead of values_list('id', flat=True), because this would affect only a type of rows of the output, but the source SQL of both querysets is the same:
SELECT id FROM a WHERE val=42 UNION SELECT id FROM b WHERE val=42
and the method .count() makes only a query around a subquery:
SELECT COUNT(*) FROM (... subquery ...)
It is not necessary that a database backend would hold all values in memory. It can also only count them and forget. (not checked)
Similarly if you run a simple SELECT COUNT(id) FROM a, it doesn't need to collect id.
Subqueries of the form SELECT count(*) FROM a WHERE val=42 in a bigger query are not possible because Django doesn't use lazy evaluation for aggregations and immediately evaluates them.
The evaluation can be postponed e.g. by grouping by some expression that has only one possible value, e.g. GROUP BY (i >= 0) (or by an outer reference if it would work), but the query plan can be worse.
Another problem is that a SELECT is not possible without a table. Therefore I will use an unimportant row of an unimportant table in the base of query.
Example:
qs = Unimportant.objects.filter(pk=unimportant_pk).values('id').annotate(
total_a=a.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt'),
total_b=b.objects.filter(val=42).order_by().values('val')
.annotate(cnt=models.Count('*')).values('cnt')
)
It is not nice, but it could be easily parallelized
SELECT
id,
(SELECT COUNT(*) AS cnt FROM a WHERE val=42 GROUP BY val) AS total_a,
(SELECT COUNT(*) AS cnt FROM b WHERE val=42 GROUP BY val) AS total_b
FROM unimportant WHERE id = unimportant_pk
Django docs confirms that simple solution doesn't exist.
Using aggregates within a Subquery expression
...
... This is the only way to perform an aggregation within a Subquery, as using aggregate() attempts to evaluate the queryset (and if there is an OuterRef, this will not be possible to resolve).

Calculate Max of Sum of an annotated field over a grouped by query in Django ORM?

To keep it simple I have four tables(A, B, Category and Relation), Relation table stores the Intensity of A in B and Category stores the type of B.
A <--- Relation ---> B ---> Category
(So the relation between A and B is n to n, when the relation between B and Category is n to 1)
I need an ORM to group Relation records by Category and A, then calculate Sum of Intensity in each (Category, A) (seems simple till here), then I want to annotate Max of calculated Sum in each Category.
My code is something like:
A.objects.values('B_id').annotate(AcSum=Sum(Intensity)).annotate(Max(AcSum))
Which throws the error:
django.core.exceptions.FieldError: Cannot compute Max('AcSum'): 'AcSum' is an aggregate
Django-group-by package with the same error.
For further information please also see this stackoverflow question.
I am using Django 2 and PostgreSQL.
Is there a way to achieve this using ORM, if there is not, what would be the solution using raw SQL expression?
Update
After lots of struggling I found out that what I wrote was indeed an aggregation, however what I want is to find out the maximum of AcSum of each A in each category. So I suppose I have to group-by the result once more after AcSum Calculation. Based on this insight I found a stack-overflow question which asks the same concept(The question was asked 1 year, 2 months ago without any accepted answer).
Chaining another values('id') to the set does not function neither as a group_by nor as a filter for output attributes, It removes AcSum from the set. Adding AcSum to values() is also not an option due to changes in the grouped by result set.
I think what I am trying to do is re grouping the grouped by query based on the fields inside a column (i.e id).
any thoughts?
You can't do an aggregate of an aggregate Max(Sum()), it's not valid in SQL, whether you're using the ORM or not. Instead, you have to join the table to itself to find the maximum. You can do this using a subquery. The below code looks right to me, but keep in mind I don't have something to run this on, so it might not be perfect.
from django.db.models import Subquery, OuterRef
annotation = {
'AcSum': Sum('intensity')
}
# The basic query is on Relation grouped by A and Category, annotated
# with the Sum of intensity
query = Relation.objects.values('a', 'b__category').annotate(**annotation)
# The subquery is joined to the outerquery on the Category
sub_filter = Q(b__category=OuterRef('b__category'))
# The subquery is grouped by A and Category and annotated with the Sum
# of intensity, which is then ordered descending so that when a LIMIT 1
# is applied, you get the Max.
subquery = Relation.objects.filter(sub_filter).values(
'a', 'b__category').annotate(**annotation).order_by(
'-AcSum').values('AcSum')[:1]
query = query.annotate(max_intensity=Subquery(subquery))
This should generate SQL like:
SELECT a_id, category_id,
(SELECT SUM(U0.intensity) AS AcSum
FROM RELATION U0
JOIN B U1 on U0.b_id = U1.id
WHERE U1.category_id = B.category_id
GROUP BY U0.a_id, U1.category_id
ORDER BY SUM(U0.intensity) DESC
LIMIT 1
) AS max_intensity
FROM Relation
JOIN B on Relation.b_id = B.id
GROUP BY Relation.a_id, B.category_id
It may be more performant to eliminate the join in Subquery by using a backend specific feature like array_agg (Postgres) or GroupConcat (MySQL) to collect the Relation.ids that are grouped together in the outer query. But I don't know what backend you're using.
Something like this should work for you. I couldn't test it myself, so please let me know the result:
Relation.objects.annotate(
b_category=F('B__Category')
).values(
'A', 'b_category'
).annotate(
SumInensityPerCategory=Sum('Intensity')
).values(
'A', MaxIntensitySumPerCategory=Max('SumInensityPerCategory')
)

Categories