Galaxies across the universe host millions/billions of stars, each belonging to a specific type, depending on its physical properties (Red stars, Blue Supergiant, White Dwarf, etc). For each Star in my database, I'm trying to find the number of distinct galaxies that are also home for some star of that same type.
class Galaxy(Model):
...
class Star(Model):
galaxy = ForeignKey(Galaxy, related_name='stars')
type = CharField(...)
Performing this query individually for each Star might be comfortably done by:
star = <some_Star>
desired_galaxies = Galaxy.objects.filter(stars__type=star.type).distinct()
desired_count = desired_galaxies.count()
Or even, albeit more redundant:
desired_count = Star.objects.filter(galaxy__stars__type=star.type).values('galaxy').distinct()
This get a little fuzzier when I try to get the count result for all the stars in a "single" query:
all_stars = Star.objects.annotate(desired_count=...)
The main reason I want to do that is to be capable of sorting Star.objects.order_by('desired_count') in a clean way.
What I've tried so far:
Star.annotate(desired_count=Count('galaxy', filter=Q(galaxy__stars__type=F('type')), distinct=True))
But this annotates 1 for every star. I guess I'll have to go for OuterRef, Subquery here, but not sure on how.
You can use GROUP BY to get the count:
Star.objects.values('type').annotate(desired_count=Count('galaxy')).values('type', 'desired_count')
Django doesn't provide a way to define multi-valued relationships between models that don't involve foreign keys yet. If it did you could do something like
class Galaxy(Model):
...
class Star(Model):
galaxy = ForeignKey(Galaxy, related_name='stars')
type = CharField(...)
same_type_stars = Relation(
'self', from_fields=('type',), to_fields=('type',)
)
Star.objects.annotate(
galaxies_count=Count('same_type_stars__galaxy', distinct=True)
)
Which would result in something along
SELECT
star.*,
COUNT(DISTINCT same_star_type.galaxy_id) galaxies_count
FROM star
LEFT JOIN star same_star_type ON (same_star_type.type = star.type)
GROUP BY star.id
If you want to achieve something similar you'll need to use subquery for now
Star.objects.annotate(
galaxies_count=Subquery(Star.objects.filter(
type=OuterRef('type'),
).values('type').values(
inner_count=Count('galaxy', distinct=True),
))
)
Which would result in something along
SELECT
star.*,
(
SELECT COUNT(DISTINCT inner_star.galaxy_id)
FROM star inner_star
WHERE inner_star.type = star.type
GROUP BY inner_star.type
) galaxies_count
FROM star
Which likely perform poorly on some databases that don't materialize correlated subqueries (e.g. MySQL). In all cases make sure you index Star.type otherwise you'll get bad performance no matter what. A composite index on ('type', 'galaxy') might be even better as it might allow you to perform index only scan (e.g. on PostgreSQL).
Related
I want to calculate the percentage of all car types using Django ORM, or group by all of the cars on the basis of their types, and calculate the percentage. I've multiple solutions but they are old-fashioned and itrative. I am going to use this query over the dashboard where already multiple queries calculating different analytics. I don't want to compromise on performance, that's why I prefer the single query. Here is the structure of my tables (written) on Django:
class CarType:
name = models.CharField(max_length=50)
class Car:
car_type = models.ForeignKey(CarType, on_delete=models.CASCADE)
I have a utility function that has the following details:
input => cars: (Queryset) of cars Django.
output => list of all car_types (dictionaries) having percentage.
[{'car_type': 'car01', 'percentage': 70, 'this_car_type_count': 20}, ...]
What I've tried so far:
cars.annotate(
total=Count('pk')
).annotate(
car_type_name=F('car_type__name')
).values(
'car_type_name'
).annotate(
car_type_count=Count('car_type_name'),
percentage=Cast(F('car_type_count') * 100.0 / F('total'), FloatField()),
)
But, this solution is giving 100% on all car_types. I know this weird behavior is because of the values() I'm using, but I've kinda stuck it here.
F('total') will be the count of cars within each group (each car type) not the total count of the whole table. This is why you always get 100%. You can achieve what you want in two queries:
total = cars.count()
cars.annotate(
car_type_name=F('car_type__name')
).values(
'car_type_name'
).annotate(
car_type_count=Count('car_type_name'),
percentage=Cast(F('car_type_count') * 100.0 / total, FloatField())
)
If you really want to do this in one query instead of two, the computation of total will need to be a window function instead of a regular aggregate.
Background
Suppose we have a set of questions, and a set of students that answered these questions.
The answers have been reviewed, and scores have been assigned, on some unknown range.
Now, we need to normalize the scores with respect to the extreme values within each question.
For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).
Then, for each student, we would like to know the mean of the normalized scores for all questions combined.
Minimal example
Here's a very naive minimal implementation, just to illustrate what we would like to achieve:
class Question(models.Model):
pass
class Student(models.Model):
def mean_normalized_score(self):
normalized_scores = []
for score in self.score_set.all():
normalized_scores.append(score.normalized_value())
return mean(normalized_scores) if normalized_scores else None
class Score(models.Model):
student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
value = models.FloatField()
def normalized_value(self):
limits = Score.objects.filter(question=self.question).aggregate(
min=models.Min('value'), max=models.Max('value'))
return (self.value - limits['min']) / (limits['max'] - limits['min'])
This works well, but it is quite inefficient in terms of database queries, etc.
Goal
Instead of the implementation above, I would prefer to offload the number-crunching on to the database.
What I've tried
Consider, for example, these two use cases:
list the normalized_value for all Score objects
list the mean_normalized_score for all Student objects
The first use case can be covered using window functions in a query, something like this:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
normalized_value=(F('value') - w_min) / (w_max - w_min))
This works nicely, so the Score.normalized_value() method from the example is no longer needed.
Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.
The raw SQL could look something like this (for sqlite):
SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
SELECT
myapp_score.*,
((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
AS normalized_value
FROM myapp_score
)
GROUP BY student_id
I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.
I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.
Probably the closest I've gotten is this:
subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))
But this is incorrect.
Could someone point me in the right direction, without resorting to raw queries?
I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:
# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
min=Subquery(min_subquery.values('min')[:1]),
max=Subquery(max_subquery.values('max')[:1]),
normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))
# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
The resulting SQL is:
SELECT
"student"."id",
"student"."name",
(
SELECT
AVG(
(
(
V0."value" - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
) / (
(
SELECT
MAX(U0."value") AS "max"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
) - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
)
)
) AS "mean"
FROM
"score" V0
WHERE
V0."student_id" = ("student"."id")
GROUP BY
V0."student_id"
LIMIT
1
) AS "mean"
FROM
"student"
As mentioned by #bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).
As a temporary workaround, I refactored #bdbd's excellent Subquery solution as follows.
class ScoreQuerySet(models.QuerySet):
def annotate_normalized(self):
w_min = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
min=Min('value')).values('min')[:1])
w_max = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
max=Max('value')).values('max')[:1])
return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))
def aggregate_student_mean(self):
return self.annotate_normalized().values('student_id').annotate(
mean=Avg('normalized'))
class Score(models.Model):
objects = ScoreQuerySet.as_manager()
...
Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.
Now, if it ever becomes possible to filter and annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
I have 3 related models:
Program(Model):
... # which aggregates ProgramVersions
ProgramVersion(Model):
program = ForeignKey(Program)
index = IntegerField()
UserProgramVersion(Model):
user = ForeignKey(User)
version = ForeignKey(ProgramVersion)
index = IntegerField()
ProgramVersion and UserProgramVersion are orderable models based on index field - object with highest index in the table is considered latest/newest object (this is handled by some custom logic, not relevant).
I would like to select all latest UserProgramVersion's, i.e. latest UPV's which point to the same Program.
this can be handled by this UserProgramVersion queryset:
def latest_user_program_versions(self):
latest = self\
.order_by('version__program_id', '-version__index', '-index')\
.distinct('version__program_id')
return self.filter(id__in=latest)
this works fine however im looking for a solution which does NOT use .distinct()
I tried something like this:
def latest_user_program_versions(self):
latest = self\
.annotate(
'max_version_index'=Max('version__index'),
'max_index'=Max('index'))\
.filter(
'version__index'=F('max_version_index'),
'index'=F('max_index'))
return self.filter(id__in=latest)
this however does not work
Use Subquery() expressions in Django 1.11. The example in docs is similar and the purpose is also to get the newest item for required parent records.
(You could start probably by that example with your objects, but I wrote also a complete more complicated suggestion to avoid possible performance pitfalls.)
from django.db.models import OuterRef, Subquery
...
def latest_user_program_versions(self, *args, **kwargs):
# You should filter users by args or kwargs here, for performance reasons.
# If you do it here it is applied also to subquery - much faster on a big db.
qs = self.filter(*args, **kwargs)
parent = Program.objects.filter(pk__in=qs.values('version__program'))
newest = (
qs.filter(version__program=OuterRef('pk'))
.order_by('-version__index', '-index')
)
pks = (
parent.annotate(newest_id=Subquery(newest.values('pk')[:1]))
.values_list('newest_id', flat=True)
)
# Maybe you prefer to uncomment this to be it compiled by two shorter SQLs.
# pks = list(pks)
return self.filter(pk__in=pks)
If you considerably improve it, write the solution in your answer.
EDIT Your problem in your second solution:
Nobody can cut a branch below him, neither in SQL, but I can sit on its temporary copy in a subquery, to can survive it :-) That is also why I ask for a filter at the beginning. The second problem is that Max('version__index') and Max('index') could be from two different objects and no valid intersection is found.
EDIT2: Verified: The internal SQL from my query is complicated, but seems correct.
SELECT app_userprogramversion.id,...
FROM app_userprogramversion
WHERE app_userprogramversion.id IN
(SELECT
(SELECT U0.id
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE (U0.user_id = 123 AND U2.program_id = (V0.id))
ORDER BY U2.index DESC, U0.index DESC LIMIT 1
) AS newest_id
FROM app_program V0 WHERE V0.id IN
(SELECT U2.program_id AS Col1
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE U0.user_id = 123
)
)
Using Django ORM, can one do something like queryset.objects.annotate(Count('queryset_objects', gte=VALUE)). Catch my drift?
Here's a quick example to use for illustrating a possible answer:
In a Django website, content creators submit articles, and regular users view (i.e. read) the said articles. Articles can either be published (i.e. available for all to read), or in draft mode. The models depicting these requirements are:
class Article(models.Model):
author = models.ForeignKey(User)
published = models.BooleanField(default=False)
class Readership(models.Model):
reader = models.ForeignKey(User)
which_article = models.ForeignKey(Article)
what_time = models.DateTimeField(auto_now_add=True)
My question is: How can I get all published articles, sorted by unique readership from the last 30 mins? I.e. I want to count how many distinct (unique) views each published article got in the last half an hour, and then produce a list of articles sorted by these distinct views.
I tried:
date = datetime.now()-timedelta(minutes=30)
articles = Article.objects.filter(published=True).extra(select = {
"views" : """
SELECT COUNT(*)
FROM myapp_readership
JOIN myapp_article on myapp_readership.which_article_id = myapp_article.id
WHERE myapp_readership.reader_id = myapp_user.id
AND myapp_readership.what_time > %s """ % date,
}).order_by("-views")
This sprang the error: syntax error at or near "01" (where "01" was the datetime object inside extra). It's not much to go on.
For django >= 1.8
Use Conditional Aggregation:
from django.db.models import Count, Case, When, IntegerField
Article.objects.annotate(
numviews=Count(Case(
When(readership__what_time__lt=treshold, then=1),
output_field=IntegerField(),
))
)
Explanation:
normal query through your articles will be annotated with numviews field. That field will be constructed as a CASE/WHEN expression, wrapped by Count, that will return 1 for readership matching criteria and NULL for readership not matching criteria. Count will ignore nulls and count only values.
You will get zeros on articles that haven't been viewed recently and you can use that numviews field for sorting and filtering.
Query behind this for PostgreSQL will be:
SELECT
"app_article"."id",
"app_article"."author",
"app_article"."published",
COUNT(
CASE WHEN "app_readership"."what_time" < 2015-11-18 11:04:00.000000+01:00 THEN 1
ELSE NULL END
) as "numviews"
FROM "app_article" LEFT OUTER JOIN "app_readership"
ON ("app_article"."id" = "app_readership"."which_article_id")
GROUP BY "app_article"."id", "app_article"."author", "app_article"."published"
If we want to track only unique queries, we can add distinction into Count, and make our When clause to return value, we want to distinct on.
from django.db.models import Count, Case, When, CharField, F
Article.objects.annotate(
numviews=Count(Case(
When(readership__what_time__lt=treshold, then=F('readership__reader')), # it can be also `readership__reader_id`, it doesn't matter
output_field=CharField(),
), distinct=True)
)
That will produce:
SELECT
"app_article"."id",
"app_article"."author",
"app_article"."published",
COUNT(
DISTINCT CASE WHEN "app_readership"."what_time" < 2015-11-18 11:04:00.000000+01:00 THEN "app_readership"."reader_id"
ELSE NULL END
) as "numviews"
FROM "app_article" LEFT OUTER JOIN "app_readership"
ON ("app_article"."id" = "app_readership"."which_article_id")
GROUP BY "app_article"."id", "app_article"."author", "app_article"."published"
For django < 1.8 and PostgreSQL
You can just use raw for executing SQL statement created by newer versions of django. Apparently there is no simple and optimized method for querying that data without using raw (even with extra there are some problems with injecting required JOIN clause).
Articles.objects.raw('SELECT'
' "app_article"."id",'
' "app_article"."author",'
' "app_article"."published",'
' COUNT('
' DISTINCT CASE WHEN "app_readership"."what_time" < 2015-11-18 11:04:00.000000+01:00 THEN "app_readership"."reader_id"'
' ELSE NULL END'
' ) as "numviews"'
'FROM "app_article" LEFT OUTER JOIN "app_readership"'
' ON ("app_article"."id" = "app_readership"."which_article_id")'
'GROUP BY "app_article"."id", "app_article"."author", "app_article"."published"')
For django >= 2.0 you can use Conditional aggregation with a filter argument in the aggregate functions:
from datetime import timedelta
from django.utils import timezone
from django.db.models import Count, Q # need import
Article.objects.annotate(
numviews=Count(
'readership__reader__id',
filter=Q(readership__what_time__gt=timezone.now() - timedelta(minutes=30)),
distinct=True
)
)
I am stuck in this issue:
I have two models:
Location and Rate.
each location has its rate, possibly multiple rates.
i want to get locations ordered by its rates, ascendingly.
obvouisly, order_by and distinct() dont work together:
locations = Location.objects.filter(**s_kwargs).order_by('locations_rate__rate').distinct('id')
then i read the docs and came to annotate(). but i am not sure whether i have to use a function between annotate.
if i do this:
locations = Location.objects.filter(**s_kwargs).annotate(rate=Count('locations_rate__rate')).order_by('rate')
but this counts the rates and orders by the sum. i want to get locations with its rates ordered by the value of those rates.
my model definitions are:
class Location(models.Model):
name = models.TextField()
adres = models.TextField()
class Rate(models.Model):
location = models.ForeignKey(Location,related_name='locations_rate')
rate = models.IntegerField(max_length=2)
price_rate = models.IntegerField(max_length=2) #<--- added now
datum = models.DateTimeField(auto_now_add=True,blank=True) #<--- added now
Well the issue is not how to make query in Django for the problem you described. It's that your problem is either incorrect or not property thought through. Let me explained with an example:
Suppose you have two Location objects, l1 and l2. l1 has two Rate objects related to it, r1 and r3, such that r1.rate = 1 and r3.rate = 3; And l2 has one rate object related to it, r2, such that r2.rate = 2. Now what should be the order of your query's result l1 followed l2 or l2 followed by l1?? As one of l1's rate is less than l2's rate and the other one is greater than l2's rate.
Try this:
from django.db.models import Count, Sum
# if you want to annotate by count of rates
locations = Location.objects.filter(**s_kwargs) \
.annotate(rate_count = Count('locations_rate')) \
.order_by('rate_count')
# if you want to annotate on values of rate e.g. Sum
locations = Location.objects.filter(**s_kwargs) \
.annotate(rate_count = Sum('locations_rate')) \
.order_by('rate_count')
Possibly you want something like this:
locations = (Location.objects.filter(**s_kwargs)
.values('locations_rate__rate')
.annotate(Count('locations_rate__rate'))
.order_by('locations_rate__rate'))
You need the Count() since you actually need a GROUP BY query, and GROUP BY only works with aggregate functions like COUNT or SUM.
Anyway I think your problem can be solved with normal distinct():
locations = (Location.objects.filter(**s_kwargs)
.order_by('locations_rate__rate')
.distinct('locations_rate__rate'))
Why would you want to use annotate() instead?
I haven't tested both but hope it helps.
annotate(*args, **kwargs),Annotates each object in the QuerySet with the provided list of aggregate values (averages, sums, etc) that have
been computed over the objects that are related to the objects in the QuerySet.
So if you want only to get locations ordered by its rates, ascendingly you dont have to use annotate()
you can try this :
loc = Location.objects.all()
rate = Rate.objects.filter(loc=rate__location).order_by('-rate')