Aggregating a windowed queryset in Django - python

Background
Suppose we have a set of questions, and a set of students that answered these questions.
The answers have been reviewed, and scores have been assigned, on some unknown range.
Now, we need to normalize the scores with respect to the extreme values within each question.
For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).
Then, for each student, we would like to know the mean of the normalized scores for all questions combined.
Minimal example
Here's a very naive minimal implementation, just to illustrate what we would like to achieve:
class Question(models.Model):
pass
class Student(models.Model):
def mean_normalized_score(self):
normalized_scores = []
for score in self.score_set.all():
normalized_scores.append(score.normalized_value())
return mean(normalized_scores) if normalized_scores else None
class Score(models.Model):
student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
value = models.FloatField()
def normalized_value(self):
limits = Score.objects.filter(question=self.question).aggregate(
min=models.Min('value'), max=models.Max('value'))
return (self.value - limits['min']) / (limits['max'] - limits['min'])
This works well, but it is quite inefficient in terms of database queries, etc.
Goal
Instead of the implementation above, I would prefer to offload the number-crunching on to the database.
What I've tried
Consider, for example, these two use cases:
list the normalized_value for all Score objects
list the mean_normalized_score for all Student objects
The first use case can be covered using window functions in a query, something like this:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
normalized_value=(F('value') - w_min) / (w_max - w_min))
This works nicely, so the Score.normalized_value() method from the example is no longer needed.
Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.
The raw SQL could look something like this (for sqlite):
SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
SELECT
myapp_score.*,
((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
AS normalized_value
FROM myapp_score
)
GROUP BY student_id
I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.
I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.
Probably the closest I've gotten is this:
subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))
But this is incorrect.
Could someone point me in the right direction, without resorting to raw queries?

I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:
# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
min=Subquery(min_subquery.values('min')[:1]),
max=Subquery(max_subquery.values('max')[:1]),
normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))
# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
The resulting SQL is:
SELECT
"student"."id",
"student"."name",
(
SELECT
AVG(
(
(
V0."value" - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
) / (
(
SELECT
MAX(U0."value") AS "max"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
) - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
)
)
) AS "mean"
FROM
"score" V0
WHERE
V0."student_id" = ("student"."id")
GROUP BY
V0."student_id"
LIMIT
1
) AS "mean"
FROM
"student"

As mentioned by #bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).
As a temporary workaround, I refactored #bdbd's excellent Subquery solution as follows.
class ScoreQuerySet(models.QuerySet):
def annotate_normalized(self):
w_min = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
min=Min('value')).values('min')[:1])
w_max = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
max=Max('value')).values('max')[:1])
return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))
def aggregate_student_mean(self):
return self.annotate_normalized().values('student_id').annotate(
mean=Avg('normalized'))
class Score(models.Model):
objects = ScoreQuerySet.as_manager()
...
Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.
Now, if it ever becomes possible to filter and annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])

Related

Complex Django query involving an ArrayField & coefficients

On the one hand, let's consider this Django model:
from django.db import models
from uuid import UUID
class Entry(models.Model):
id = models.UUIDField(primary_key=True, default=uuid4, editable=False)
value = models.DecimalField(decimal_places=12, max_digits=22)
items = ArrayField(base_field=models.UUIDField(null=False, blank=False), default=list)
On the other hand, let's say we have this dictionary:
coefficients = {item1_uuid: item1_coef, item2_uuid: item2_coef, ... }
Entry.value is intended to be distributed among the Entry.items according to coefficients.
Using Django ORM, what would be the most efficient way (in a single SQL query) to get the sum of the values of my Entries for a single Item, given the coefficients?
For instance, for item1 below I want to get 168.5454..., that is to say 100 * 1 + 150 * (0.2 / (0.2 + 0.35)) + 70 * 0.2.
Entry ID
Value
Items
uuid1
100
[item1_uuid]
uuid2
150
[item1_uuid, item2_uuid]
uuid3
70
[item1_uuid, item2_uuid, item3_uuid]
coefficients = { item1_uuid: Decimal("0.2"), item2_uuid: Decimal("0.35"), item3_uuid: Decimal("0.45") }
Bonus question: how could I adapt my models for this query to run faster? I've deliberately chosen to use an ArrayField and decided not to use a ManyToManyField, was that a bad idea? How to know where I could add db_index[es] for this specific query?
I am using Python 3.10, Django 4.1. and Postgres 14.
I've found a solution to my own question, but I'm sure someone here could come up with a more efficient & cleaner approach.
The idea here is to chain the .alias() methods (cf. Django documentation) and the conditional expressions with Case and When in a for loop.
This results in an overly complex query, which at least does work as expected:
def get_value_for_item(coefficients, item):
item_coef = coefficients.get(item.pk, Decimal(0))
if not item_coef:
return Decimal(0)
several = Q(items__len__gt=1)
queryset = (
Entry.objects
.filter(items__contains=[item.pk])
.alias(total=Case(When(several, then=Value(Decimal(0)))))
)
for k, v in coefficients.items():
has_k = Q(items__contains=[k])
queryset = queryset.alias(total=Case(
When(several & has_k, then=Value(v) + F("total")),
default="total",
)
)
return (
queryset.annotate(
coef_applied=Case(
When(several, then=Value(item_coef) / F("total") * F("value")),
default="value",
)
).aggregate(Sum("coef_applied", default=Decimal(0)))
)["coef_applied__sum"]
With the example I gave in my question and for item1, the output of this function is Decimal(168.5454...) as expected.

Calculate the ForeignKey type Percentage (individual) Django ORM

I want to calculate the percentage of all car types using Django ORM, or group by all of the cars on the basis of their types, and calculate the percentage. I've multiple solutions but they are old-fashioned and itrative. I am going to use this query over the dashboard where already multiple queries calculating different analytics. I don't want to compromise on performance, that's why I prefer the single query. Here is the structure of my tables (written) on Django:
class CarType:
name = models.CharField(max_length=50)
class Car:
car_type = models.ForeignKey(CarType, on_delete=models.CASCADE)
I have a utility function that has the following details:
input => cars: (Queryset) of cars Django.
output => list of all car_types (dictionaries) having percentage.
[{'car_type': 'car01', 'percentage': 70, 'this_car_type_count': 20}, ...]
What I've tried so far:
cars.annotate(
total=Count('pk')
).annotate(
car_type_name=F('car_type__name')
).values(
'car_type_name'
).annotate(
car_type_count=Count('car_type_name'),
percentage=Cast(F('car_type_count') * 100.0 / F('total'), FloatField()),
)
But, this solution is giving 100% on all car_types. I know this weird behavior is because of the values() I'm using, but I've kinda stuck it here.
F('total') will be the count of cars within each group (each car type) not the total count of the whole table. This is why you always get 100%. You can achieve what you want in two queries:
total = cars.count()
cars.annotate(
car_type_name=F('car_type__name')
).values(
'car_type_name'
).annotate(
car_type_count=Count('car_type_name'),
percentage=Cast(F('car_type_count') * 100.0 / total, FloatField())
)
If you really want to do this in one query instead of two, the computation of total will need to be a window function instead of a regular aggregate.

Django annotation on (model → FK → model) relation

Galaxies across the universe host millions/billions of stars, each belonging to a specific type, depending on its physical properties (Red stars, Blue Supergiant, White Dwarf, etc). For each Star in my database, I'm trying to find the number of distinct galaxies that are also home for some star of that same type.
class Galaxy(Model):
...
class Star(Model):
galaxy = ForeignKey(Galaxy, related_name='stars')
type = CharField(...)
Performing this query individually for each Star might be comfortably done by:
star = <some_Star>
desired_galaxies = Galaxy.objects.filter(stars__type=star.type).distinct()
desired_count = desired_galaxies.count()
Or even, albeit more redundant:
desired_count = Star.objects.filter(galaxy__stars__type=star.type).values('galaxy').distinct()
This get a little fuzzier when I try to get the count result for all the stars in a "single" query:
all_stars = Star.objects.annotate(desired_count=...)
The main reason I want to do that is to be capable of sorting Star.objects.order_by('desired_count') in a clean way.
What I've tried so far:
Star.annotate(desired_count=Count('galaxy', filter=Q(galaxy__stars__type=F('type')), distinct=True))
But this annotates 1 for every star. I guess I'll have to go for OuterRef, Subquery here, but not sure on how.
You can use GROUP BY to get the count:
Star.objects.values('type').annotate(desired_count=Count('galaxy')).values('type', 'desired_count')
Django doesn't provide a way to define multi-valued relationships between models that don't involve foreign keys yet. If it did you could do something like
class Galaxy(Model):
...
class Star(Model):
galaxy = ForeignKey(Galaxy, related_name='stars')
type = CharField(...)
same_type_stars = Relation(
'self', from_fields=('type',), to_fields=('type',)
)
Star.objects.annotate(
galaxies_count=Count('same_type_stars__galaxy', distinct=True)
)
Which would result in something along
SELECT
star.*,
COUNT(DISTINCT same_star_type.galaxy_id) galaxies_count
FROM star
LEFT JOIN star same_star_type ON (same_star_type.type = star.type)
GROUP BY star.id
If you want to achieve something similar you'll need to use subquery for now
Star.objects.annotate(
galaxies_count=Subquery(Star.objects.filter(
type=OuterRef('type'),
).values('type').values(
inner_count=Count('galaxy', distinct=True),
))
)
Which would result in something along
SELECT
star.*,
(
SELECT COUNT(DISTINCT inner_star.galaxy_id)
FROM star inner_star
WHERE inner_star.type = star.type
GROUP BY inner_star.type
) galaxies_count
FROM star
Which likely perform poorly on some databases that don't materialize correlated subqueries (e.g. MySQL). In all cases make sure you index Star.type otherwise you'll get bad performance no matter what. A composite index on ('type', 'galaxy') might be even better as it might allow you to perform index only scan (e.g. on PostgreSQL).

Django ORM filter by Max column value of two related models

I have 3 related models:
Program(Model):
... # which aggregates ProgramVersions
ProgramVersion(Model):
program = ForeignKey(Program)
index = IntegerField()
UserProgramVersion(Model):
user = ForeignKey(User)
version = ForeignKey(ProgramVersion)
index = IntegerField()
ProgramVersion and UserProgramVersion are orderable models based on index field - object with highest index in the table is considered latest/newest object (this is handled by some custom logic, not relevant).
I would like to select all latest UserProgramVersion's, i.e. latest UPV's which point to the same Program.
this can be handled by this UserProgramVersion queryset:
def latest_user_program_versions(self):
latest = self\
.order_by('version__program_id', '-version__index', '-index')\
.distinct('version__program_id')
return self.filter(id__in=latest)
this works fine however im looking for a solution which does NOT use .distinct()
I tried something like this:
def latest_user_program_versions(self):
latest = self\
.annotate(
'max_version_index'=Max('version__index'),
'max_index'=Max('index'))\
.filter(
'version__index'=F('max_version_index'),
'index'=F('max_index'))
return self.filter(id__in=latest)
this however does not work
Use Subquery() expressions in Django 1.11. The example in docs is similar and the purpose is also to get the newest item for required parent records.
(You could start probably by that example with your objects, but I wrote also a complete more complicated suggestion to avoid possible performance pitfalls.)
from django.db.models import OuterRef, Subquery
...
def latest_user_program_versions(self, *args, **kwargs):
# You should filter users by args or kwargs here, for performance reasons.
# If you do it here it is applied also to subquery - much faster on a big db.
qs = self.filter(*args, **kwargs)
parent = Program.objects.filter(pk__in=qs.values('version__program'))
newest = (
qs.filter(version__program=OuterRef('pk'))
.order_by('-version__index', '-index')
)
pks = (
parent.annotate(newest_id=Subquery(newest.values('pk')[:1]))
.values_list('newest_id', flat=True)
)
# Maybe you prefer to uncomment this to be it compiled by two shorter SQLs.
# pks = list(pks)
return self.filter(pk__in=pks)
If you considerably improve it, write the solution in your answer.
EDIT Your problem in your second solution:
Nobody can cut a branch below him, neither in SQL, but I can sit on its temporary copy in a subquery, to can survive it :-) That is also why I ask for a filter at the beginning. The second problem is that Max('version__index') and Max('index') could be from two different objects and no valid intersection is found.
EDIT2: Verified: The internal SQL from my query is complicated, but seems correct.
SELECT app_userprogramversion.id,...
FROM app_userprogramversion
WHERE app_userprogramversion.id IN
(SELECT
(SELECT U0.id
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE (U0.user_id = 123 AND U2.program_id = (V0.id))
ORDER BY U2.index DESC, U0.index DESC LIMIT 1
) AS newest_id
FROM app_program V0 WHERE V0.id IN
(SELECT U2.program_id AS Col1
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE U0.user_id = 123
)
)

django - annotate() instead of distinct()

I am stuck in this issue:
I have two models:
Location and Rate.
each location has its rate, possibly multiple rates.
i want to get locations ordered by its rates, ascendingly.
obvouisly, order_by and distinct() dont work together:
locations = Location.objects.filter(**s_kwargs).order_by('locations_rate__rate').distinct('id')
then i read the docs and came to annotate(). but i am not sure whether i have to use a function between annotate.
if i do this:
locations = Location.objects.filter(**s_kwargs).annotate(rate=Count('locations_rate__rate')).order_by('rate')
but this counts the rates and orders by the sum. i want to get locations with its rates ordered by the value of those rates.
my model definitions are:
class Location(models.Model):
name = models.TextField()
adres = models.TextField()
class Rate(models.Model):
location = models.ForeignKey(Location,related_name='locations_rate')
rate = models.IntegerField(max_length=2)
price_rate = models.IntegerField(max_length=2) #<--- added now
datum = models.DateTimeField(auto_now_add=True,blank=True) #<--- added now
Well the issue is not how to make query in Django for the problem you described. It's that your problem is either incorrect or not property thought through. Let me explained with an example:
Suppose you have two Location objects, l1 and l2. l1 has two Rate objects related to it, r1 and r3, such that r1.rate = 1 and r3.rate = 3; And l2 has one rate object related to it, r2, such that r2.rate = 2. Now what should be the order of your query's result l1 followed l2 or l2 followed by l1?? As one of l1's rate is less than l2's rate and the other one is greater than l2's rate.
Try this:
from django.db.models import Count, Sum
# if you want to annotate by count of rates
locations = Location.objects.filter(**s_kwargs) \
.annotate(rate_count = Count('locations_rate')) \
.order_by('rate_count')
# if you want to annotate on values of rate e.g. Sum
locations = Location.objects.filter(**s_kwargs) \
.annotate(rate_count = Sum('locations_rate')) \
.order_by('rate_count')
Possibly you want something like this:
locations = (Location.objects.filter(**s_kwargs)
.values('locations_rate__rate')
.annotate(Count('locations_rate__rate'))
.order_by('locations_rate__rate'))
You need the Count() since you actually need a GROUP BY query, and GROUP BY only works with aggregate functions like COUNT or SUM.
Anyway I think your problem can be solved with normal distinct():
locations = (Location.objects.filter(**s_kwargs)
.order_by('locations_rate__rate')
.distinct('locations_rate__rate'))
Why would you want to use annotate() instead?
I haven't tested both but hope it helps.
annotate(*args, **kwargs),Annotates each object in the QuerySet with the provided list of aggregate values (averages, sums, etc) that have
been computed over the objects that are related to the objects in the QuerySet.
So if you want only to get locations ordered by its rates, ascendingly you dont have to use annotate()
you can try this :
loc = Location.objects.all()
rate = Rate.objects.filter(loc=rate__location).order_by('-rate')

Categories