How can you achieve this with one query:
upcoming_events = Event.objects.order_by('date').filter(date__gte=today)
try:
return upcoming_events[0]
except IndexError:
return Event.objects.all().order_by('-date')[0]
My idea is to do something like this:
Event.objects.filter(Q(date__gte=today) | Q(date is max date))[0]
But I don't know how to implement the max date. Maybe I've just to do it with Func. Or When or Case in django.db.expressions might be helpful.
Here is a solution with one query (thanks to this post) but I'm not sure if it's faster than the implementation in my question:
from django.db.models import Max, Value, Q
latest = (
Event.objects
.all()
.annotate(common=Value(1))
.values('common')
.annotate(latest=Max('date'))
.values('latest')
)
events = Event.objects.order_by('date').filter(
Q(date__gte=datetime.date.today()) | Q(date=latest)
)
return events[0]
In my case I finally just took the rows in question (the two latest) and checked for the right event on Python level.
Related
Background
Suppose we have a set of questions, and a set of students that answered these questions.
The answers have been reviewed, and scores have been assigned, on some unknown range.
Now, we need to normalize the scores with respect to the extreme values within each question.
For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).
Then, for each student, we would like to know the mean of the normalized scores for all questions combined.
Minimal example
Here's a very naive minimal implementation, just to illustrate what we would like to achieve:
class Question(models.Model):
pass
class Student(models.Model):
def mean_normalized_score(self):
normalized_scores = []
for score in self.score_set.all():
normalized_scores.append(score.normalized_value())
return mean(normalized_scores) if normalized_scores else None
class Score(models.Model):
student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
value = models.FloatField()
def normalized_value(self):
limits = Score.objects.filter(question=self.question).aggregate(
min=models.Min('value'), max=models.Max('value'))
return (self.value - limits['min']) / (limits['max'] - limits['min'])
This works well, but it is quite inefficient in terms of database queries, etc.
Goal
Instead of the implementation above, I would prefer to offload the number-crunching on to the database.
What I've tried
Consider, for example, these two use cases:
list the normalized_value for all Score objects
list the mean_normalized_score for all Student objects
The first use case can be covered using window functions in a query, something like this:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
normalized_value=(F('value') - w_min) / (w_max - w_min))
This works nicely, so the Score.normalized_value() method from the example is no longer needed.
Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.
The raw SQL could look something like this (for sqlite):
SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
SELECT
myapp_score.*,
((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
AS normalized_value
FROM myapp_score
)
GROUP BY student_id
I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.
I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.
Probably the closest I've gotten is this:
subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))
But this is incorrect.
Could someone point me in the right direction, without resorting to raw queries?
I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:
# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))
# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))
# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
min=Subquery(min_subquery.values('min')[:1]),
max=Subquery(max_subquery.values('max')[:1]),
normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))
# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))
The resulting SQL is:
SELECT
"student"."id",
"student"."name",
(
SELECT
AVG(
(
(
V0."value" - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
) / (
(
SELECT
MAX(U0."value") AS "max"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
) - (
SELECT
MIN(U0."value") AS "min"
FROM
"score" U0
WHERE
U0."question_id" = (V0."question_id")
GROUP BY
U0."question_id"
LIMIT
1
)
)
)
) AS "mean"
FROM
"score" V0
WHERE
V0."student_id" = ("student"."id")
GROUP BY
V0."student_id"
LIMIT
1
) AS "mean"
FROM
"student"
As mentioned by #bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).
As a temporary workaround, I refactored #bdbd's excellent Subquery solution as follows.
class ScoreQuerySet(models.QuerySet):
def annotate_normalized(self):
w_min = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
min=Min('value')).values('min')[:1])
w_max = Subquery(self.filter(
question=OuterRef('question')).values('question').annotate(
max=Max('value')).values('max')[:1])
return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))
def aggregate_student_mean(self):
return self.annotate_normalized().values('student_id').annotate(
mean=Avg('normalized'))
class Score(models.Model):
objects = ScoreQuerySet.as_manager()
...
Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.
Now, if it ever becomes possible to filter and annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:
w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
I am using elasticsearch-dsl python library to connect to elasticsearch and do aggregations.
I am following code
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
This works fine but returns only 10 results in response.aggregations.per_ts.buckets
I want all the results
I have tried one solution with size=0 as mentioned in this question
search.aggs.bucket('per_ts', 'terms', field='ts', size=0)\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})
response = search.execute()
But this results in error
TransportError(400, u'parsing_exception', u'[terms] failed to parse field [size]')
I had the same issue. I finally found this solution:
s = Search(using=client, index="jokes").query("match", jks_content=keywords).extra(size=0)
a = A('terms', field='jks_title.keyword', size=999999)
s.aggs.bucket('by_title', a)
response = s.execute()
After 2.x, size=0 for all bucket results won't work anymore, please refer to this thread. Here in my example I just set the size equal 999999. You can pick a large number according to your case.
It is recommended to explicitly set reasonable value for size a number
between 1 to 2147483647.
Hope this helps.
This is a bit older but I ran into the same issue. What I wanted was basically an iterator that i could use to go through all aggregations that i got back (i also have a lot of unique results).
The best thing i found is to create a python generator like this
def scan_aggregation_results():
i=0
partitions=20
while i < partitions:
s = Search(using=elastic, index='my_index').extra(size=0)
agg = A('terms', field='my_field.keyword', size=999999,
include={"partition": i, "num_partitions": partitions})
s.aggs.bucket('my_agg', agg)
result = s.execute()
for item in result.aggregations.my_agg.buckets:
yield my_field.key
i = i + 1
# in other parts of the code just do
for item in scan_aggregation_results():
print(item) # or do whatever you want with it
The magic here is that elastic will automatically partition the number of results by 20, ie the number of partitions i define. I just have to set the size to something large enough to hold a single partition, in this case the result can be up to 20 million items large (or 20*999999). If you have much less items, like me, to return (like 20000) then you will just have 1000 results per query in your bucket, regardless that you defined a much larger size.
Using the generator construct as outlined above you can then even get rid of that and create your own scanner so to speak, iterating over all results individually, just what i wanted.
You should read the documentation.
So in your case, this should be like this :
search.aggs.bucket('per_date', 'terms', field='date')\
.bucket('response_time_percentile', 'percentiles', field='total_time',
percents=percentiles, hdr={"number_of_significant_value_digits": 1})[0:50]
response = search.execute()
I have 3 related models:
Program(Model):
... # which aggregates ProgramVersions
ProgramVersion(Model):
program = ForeignKey(Program)
index = IntegerField()
UserProgramVersion(Model):
user = ForeignKey(User)
version = ForeignKey(ProgramVersion)
index = IntegerField()
ProgramVersion and UserProgramVersion are orderable models based on index field - object with highest index in the table is considered latest/newest object (this is handled by some custom logic, not relevant).
I would like to select all latest UserProgramVersion's, i.e. latest UPV's which point to the same Program.
this can be handled by this UserProgramVersion queryset:
def latest_user_program_versions(self):
latest = self\
.order_by('version__program_id', '-version__index', '-index')\
.distinct('version__program_id')
return self.filter(id__in=latest)
this works fine however im looking for a solution which does NOT use .distinct()
I tried something like this:
def latest_user_program_versions(self):
latest = self\
.annotate(
'max_version_index'=Max('version__index'),
'max_index'=Max('index'))\
.filter(
'version__index'=F('max_version_index'),
'index'=F('max_index'))
return self.filter(id__in=latest)
this however does not work
Use Subquery() expressions in Django 1.11. The example in docs is similar and the purpose is also to get the newest item for required parent records.
(You could start probably by that example with your objects, but I wrote also a complete more complicated suggestion to avoid possible performance pitfalls.)
from django.db.models import OuterRef, Subquery
...
def latest_user_program_versions(self, *args, **kwargs):
# You should filter users by args or kwargs here, for performance reasons.
# If you do it here it is applied also to subquery - much faster on a big db.
qs = self.filter(*args, **kwargs)
parent = Program.objects.filter(pk__in=qs.values('version__program'))
newest = (
qs.filter(version__program=OuterRef('pk'))
.order_by('-version__index', '-index')
)
pks = (
parent.annotate(newest_id=Subquery(newest.values('pk')[:1]))
.values_list('newest_id', flat=True)
)
# Maybe you prefer to uncomment this to be it compiled by two shorter SQLs.
# pks = list(pks)
return self.filter(pk__in=pks)
If you considerably improve it, write the solution in your answer.
EDIT Your problem in your second solution:
Nobody can cut a branch below him, neither in SQL, but I can sit on its temporary copy in a subquery, to can survive it :-) That is also why I ask for a filter at the beginning. The second problem is that Max('version__index') and Max('index') could be from two different objects and no valid intersection is found.
EDIT2: Verified: The internal SQL from my query is complicated, but seems correct.
SELECT app_userprogramversion.id,...
FROM app_userprogramversion
WHERE app_userprogramversion.id IN
(SELECT
(SELECT U0.id
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE (U0.user_id = 123 AND U2.program_id = (V0.id))
ORDER BY U2.index DESC, U0.index DESC LIMIT 1
) AS newest_id
FROM app_program V0 WHERE V0.id IN
(SELECT U2.program_id AS Col1
FROM app_userprogramversion U0
INNER JOIN app_programversion U2 ON (U0.version_id = U2.id)
WHERE U0.user_id = 123
)
)
I'm currently running into a problem, trying to build dynamic queries for Elasticsearch in Python. To make a query I use Q shortсut from elasticsearch_dsl. This is something I try to implement
...
s = Search(using=db, index="reestr")
condition = {"attr_1_":"value 1", "attr_2_":"value 2"} # try to build query from this
must = []
for key in condition:
must.append(Q('match',key=condition[key]))
But that in fact results to this condition:
[Q('match',key="value 1"),Q('match',key="value 2")]
However, what I want is:
[Q('match',attr_1_="value 1"),Q('match',attr_2_="value 2")]
IMHO, the way this library does queries is not effective. I think this syntax:
Q("match","attrubute_name"="attribute_value")
is much more powerful and makes it possible to do a lot more things, than this one:
Q("match",attribute_name="attribute_value")
It seems, as if it is impossible to dynamically build attribute_names. Or it is, of course, possible that I do not know the right way to do it.
Suppose,
filters = {'condition1':['value1'],'condition2':['value3','value4']}
Code:
filters = data['filters_data']
must_and = list() # The condition that has only one value
should_or = list() # The condition that has more than 1 value
for key in filters:
if len(filters[key]) > 1:
for item in filters[key]:
should_or.append(Q("match", **{key:item}))
else:
must_and.append(Q("match", **{key:filters[key][0]}))
q1 = Bool(must=must_and)
q2 = Bool(should=should_or)
s = s.query(q1).query(q2)
result = s.execute()
One can also use terms, that can directly accept the list of values and no need of complicated for loops,
Code:
for key in filters:
must_and.append(Q("terms", **{key:filters[key]}))
q1 = Bool(must=must_and)
Scenario
I have a table student. it has following attributes
name,
age,
school_passout_date,
college_start_date
I need a report to know what is the avg number of days student get free between the passing the school and starting college.
Current approach
Currently i am irritating over the range of values finding days for each student and getting its avg.
Problem
That is highly inefficient when the record set gets bigger.
Question
Is there any ability in the Django ORM that gives me totals days between the two dates?
Possibility
I am looking for something like this.
Students.objects.filter(school_passed=True, started_college=True).annotate(total_days_between=Count('school_passout_date', 'college_start_date'), Avg_days=Avg('school_passout_date', 'college_start_date'))
You can do this like so:
Model.objects.annotate(age=Cast(ExtractDay(TruncDate(Now()) - TruncDate(F('created'))), IntegerField()))
This lets you work with the integer value, eg you could then do something like this:
from django.db.models import IntegerField, F
from django.db.models.functions import Cast, ExtractDay, TruncDate
qs = (
Model
.objects
.annotate(age=Cast(ExtractDay(TruncDate(Now()) - TruncDate(F('created'))), IntegerField()))
.annotate(age_bucket=Case(
When(age__lt=30, then=Value('new')),
When(age__lt=60, then=Value('current')),
default=Value('aged'),
output_field=CharField(),
))
)
This question is very old but Django ORM is much more advanced now.
It's possible to do this using F() functions.
from django.db.models import Avg, F
college_students = Students.objects.filter(school_passed=True, started_college=True)
duration = college_students.annotate(avg_no_of_days=Avg( F('college_start_date') - F('school_passout_date') )
Mathematically, according to the (expected) fact that the pass out date is allway later than the start date, you can just get an average off all your start date, and all your pass out date, and make the difference.
This gives you a solution like that one
from django.db.models import Avg
avg_start_date = Students.objects.filter(school_passed=True, started_college=True).aggregate(Avg('school_start_date'))
avg_passout_date = Students.objects.filter(school_passed=True, started_college=True).aggregate(Avg('school_passout_date'))
avg_time_at_college = avg_passout_date - avg_start_date
Django currently only accept aggregation for 4 function : Max, Min, Count, et Average, so this is a little tricky to do.
Then the solution is using the method extra . That way:
Students.objects.
extra(select={'difference': 'school_passout_date' - 'college_start_date'}).
filter('school_passed=True, started_college=True)
But then, you still have to do the average on the server side