While constructing a complexe QuerySet with several annotations, I ran into an issue that I could reproduce with the following simple setup.
Here are the models:
class Player(models.Model):
name = models.CharField(max_length=200)
class Unit(models.Model):
player = models.ForeignKey(Player, on_delete=models.CASCADE,
related_name='unit_set')
rarity = models.IntegerField()
class Weapon(models.Model):
unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
related_name='weapon_set')
With my test database, I obtain the following (correct) results:
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
[{'id': 1, 'name': 'James', 'weapon_count': 23},
{'id': 2, 'name': 'Max', 'weapon_count': 41},
{'id': 3, 'name': 'Bob', 'weapon_count': 26}]
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'rarity_sum': 42},
{'id': 2, 'name': 'Max', 'rarity_sum': 89},
{'id': 3, 'name': 'Bob', 'rarity_sum': 67}]
If I now combine both annotations in the same QuerySet, I obtain a different (inaccurate) results:
Player.objects.annotate(
weapon_count=Count('unit_set__weapon_set', distinct=True),
rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
{'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
{'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]
Notice how rarity_sum have now different values than before. Removing distinct=True does not affect the result. I also tried to use the DistinctSum function from this answer, in which case all rarity_sum are set to 18 (also inaccurate).
Why is this? How can I combine both annotations in the same QuerySet?
Edit: here is the sqlite query generated by the combined QuerySet:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
The data used for the results above is available here.
This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
or
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
ORM does exactly what you expect it to do - join Player with Weapon
SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit"
ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon"
ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
or Player with Unit
SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
and perform either COUNT or SUM aggregation on them.
Note that although the first query has two joins between three tables, the intermediate table Unit is neither in columns referenced in SELECT, nor in the GROUP BY clause. The only role that Unit plays here is to join Player with Weapon.
Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now Unit is referenced in SELECT as there is SUM aggregation for Unit.rarity:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
And this is the crucial difference between the second and the third queries. In the second query, you're joining Player to Unit, so a single Unit will be listed once for each player that it references.
But in the third query you're joining Player to Unit and then Unit to Weapon, so not only a single Unit will be listed once for each player that it references, but also for each weapon that references Unit.
Let's take a look at the simple example:
insert into sandbox_player values (1, "player_1");
insert into sandbox_unit values(1, 10, 1);
insert into sandbox_weapon values (1, 1), (2, 1);
One player, one unit and two weapons that reference the same unit.
Confirm that the problem exists:
>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum
>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>
>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>
>>> Player.objects.annotate(
... weapon_count=Count('unit_set__weapon_set', distinct=True),
... rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>
From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:
sqlite> SELECT "sandbox_player"."id",
...> "sandbox_player"."name",
...> "sandbox_weapon"."id",
...> "sandbox_unit"."rarity"
...> FROM "sandbox_player"
...> LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
...> LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id name id rarity
---------- ---------- ---------- ----------
1 player_1 1 10
1 player_1 2 10
What should you do?
As #ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:
>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
... weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
... rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>
which produces the following SQL
SELECT "sandbox_player"."id", "sandbox_player"."name",
(
SELECT COUNT(U2."id") AS "weapon_count"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
LEFT OUTER JOIN "sandbox_weapon" U2
ON (U1."id" = U2."unit_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name"
) AS "weapon_count",
(
SELECT SUM(U1."rarity") AS "rarity_sum"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"
A few notes to complement rktavi's excellent answer:
1) This issues has apparently been considered a bug for 10 years already. It is even referred to in the official documentation.
2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True counts that always worked correctly) with a Subquery (for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above. This issue can be tricky because another part of your code could add an annotation to your QuerySet (e.g a Table.order_FOO() function), leading to the issue.
3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True counts, of course). I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).
As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.
Based on the excellent answer from #rktavi, I created two helpers classes that simplify the Subquery/Count and Subquery/Sum patterns:
class SubqueryCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = PositiveIntegerField()
class SubquerySum(Subquery):
template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'
def __init__(self, queryset, column, output_field=None, **extra):
if output_field is None:
output_field = queryset.model._meta.get_field(column)
super().__init__(queryset, output_field, column=column, **extra)
One can use these helpers like so:
from django.db.models import OuterRef
weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))
qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
rarity_sum=SubquerySum(units, 'rarity'))
Thanks #rktavi for your amazing answer!!
Here's my use case:
Using Django DRF.
I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.
The Sum and Count were clashing and returning wrong results.
Your answer really helped me put it all together.
The annotate was occasionally returning the dates as strings, so I needed to Cast it to DateTimeField.
donation_filter = Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
queryset = User.objects.annotate(
total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
)
Related
I have a database from https://www.imdb.com. I would like to find which directors (if any) have obtained a score higher than 9 for all their titles?
Crew : 'title_id', 'person_id', category, 'job', 'characters'
[('tt0000060', 'nm0005690', 'director', None, '\\N'),
('tt0000060', 'nm0005658', 'cinematographer', None, '\\N'),
('tt0000361', 'nm0349785', 'director', None, '\\N'),
('tt0000376', 'nm0466448', 'actor', None, '\\N'),
('tt0000376', 'nm0617272', 'actress', None, '["Salome"]')...
Ratings: 'title_id', 'rating', 'votes'
[('tt0000060', 7.8, 59),
('tt0000361', 8.1, 10),
('tt0000376', 7.8, 6),
('tt0000417', 8.2, 38836),
('tt0000505', 8.1, 11),
('tt0000738', 7.8, 11)...
My code is:
import sqlalchemy
from sqlalchemy import create_engine, text, inspect
engine = create_engine('sqlite:///newIMDB.db')
inspector = inspect(engine)
print(inspector.get_table_names()) #['crew', 'episodes', 'people', 'ratings', 'titles']
conn = engine.connect()
stmt = text ("SELECT category,rating FROM(SELECT * FROM crew INNER JOIN ratings
ON crew.title_id = ratings.title_id)
WHERE category=director AND rating > 9 LIMIT 10;" )
result = conn.execute(stmt)
result.fetchall()
Where is my error?
First thing I can notice here is your query is spanning across multiple lines. And for that you have used one time double quotes (i.e., "). But for multiple line strings in python you need to use 3 times quotes.
So can you try with something like...
stmt = text ("""SELECT category,rating FROM(SELECT * FROM crew
INNER JOIN ratings
ON crew.title_id = ratings.title_id)
WHERE category='director' AND rating > 9 LIMIT 10;""" )
There are also other ways to write strings spanning multiple lines as well. Rest of the ways are on you to find out!!!
Yes also any string literal(Like 'director') inside the query needs to be quoted.
(Thank you slothrop for pointing this out)
How could I specify the order of columns in SELECT query in Django ORM?
I am trying to union elements from two tables, but apparently elements in union are matched by the order of columns in SELECT, instead of the names of the columns (even if name of the columns are the same).
Consider following Models:
class Person(models.Model):
first_name = models.CharField(max_length=256)
last_name = models.CharField(max_length=256)
age = models.IntegerField()
class Car(models.Model):
number = models.IntegerField()
brand = models.CharField(max_length=256)
name = models.CharField(max_length=256)
and following piece of code:
Person.objects.create(first_name="John", last_name="Smith", age=25)
Car.objects.create(number=42, name="Cybertruck", brand="Tesla")
q1 = Person.objects.all().annotate(name=F('first_name'), group=F('last_name'), number=F('age')).values(
'name', 'group', 'number')
q2 = Car.objects.all().annotate(group=F('brand')).values('name', 'group', 'number')
data = q1.union(q2)
print(data.query)
assert list(data) == [
{'name': 'John', 'group': 'Smith', 'number': 25},
{'name': 'Cybertruck', 'group': 'Tesla', 'number': 42},
])
As you can see I put correct order in .values().
What one could expect is that columns in union would be matched in the order passed to values (or by column names), but this is what happens:
SELECT "testunion_person"."first_name" AS "name", "testunion_person"."last_name" AS "group", "testunion_person"."age" AS "number" FROM "testunion_person" UNION SELECT "testunion_car"."name", "testunion_car"."number", "testunion_car"."brand" AS "group" FROM "testunion_car"
In the queries "testunion_car"."number" is before "testunion_car"."brand", which makes the Car in UNION have a values:
{'name': 'Cybertruck', 'group': '42', 'number': 'Tesla'}
EDIT: I am using 2.2 (LTS) version of Django
Instead of specifying the alias under annotate(), you can also specify them straight under values():
q1 = Person.objects.all().values(
name=F('first_name'), group=F('last_name'), xnumber=F('age'))
q2 = Car.objects.all().values(
'name', group=F('brand'), xnumber=F('number'))
I noticed that even then, it wasn't ordering the fields properly. I renamed the number field to xnumber to avoid conflicts with the model field of the same name and everything is grouped properly.
You can set the order of the fields using .values_list.
qs1 = Person.objects.values_list('name', 'group', 'number')
qs2 = Car.objects.values_list('brand', 'name', 'number')
qs1.union(qs2)
Check the docs for more detailed explanation.
Not a Django bug. Although query columns not sorted as values, the queryset display the right order:
In [13]: print(data)
<QuerySet [{'name': 'Cybertruck', 'group': 42, 'number': 'Tesla'}, {'name': 'John', 'group': 'Smith', 'number': 25}]>
It is because the data will be sorted after fetch from database. Source code snippet of QuerySet:
class QuerySet:
def __iter__(self):
"""
The queryset iterator protocol uses three nested iterators in the
default case:
1. sql.compiler.execute_sql()
- Returns 100 rows at time (constants.GET_ITERATOR_CHUNK_SIZE)
using cursor.fetchmany(). This part is responsible for
doing some column masking, and returning the rows in chunks.
2. sql.compiler.results_iter()
- Returns one row at time. At this point the rows are still just
tuples. In some cases the return values are converted to
Python values at this location.
3. self.iterator()
- Responsible for turning the rows into model objects.
"""
self._fetch_all()
return iter(self._result_cache)
Let's say I have the following models
class Photo(models.Model):
tags = models.ManyToManyField(Tag)
class Tag(models.Model):
name = models.CharField(max_length=50)
In a view I have a list with active filters called categories.
I want to filter Photo objects which have all tags present in categories.
I tried:
Photo.objects.filter(tags__name__in=categories)
But this matches any item in categories, not all items.
So if categories would be ['holiday', 'summer'] I want Photo's with both a holiday and summer tag.
Can this be achieved?
Summary:
One option is, as suggested by jpic and sgallen in the comments, to add .filter() for each category. Each additional filter adds more joins, which should not be a problem for small set of categories.
There is the aggregation approach. This query would be shorter and perhaps quicker for a large set of categories.
You also have the option of using custom queries.
Some examples
Test setup:
class Photo(models.Model):
tags = models.ManyToManyField('Tag')
class Tag(models.Model):
name = models.CharField(max_length=50)
def __unicode__(self):
return self.name
In [2]: t1 = Tag.objects.create(name='holiday')
In [3]: t2 = Tag.objects.create(name='summer')
In [4]: p = Photo.objects.create()
In [5]: p.tags.add(t1)
In [6]: p.tags.add(t2)
In [7]: p.tags.all()
Out[7]: [<Tag: holiday>, <Tag: summer>]
Using chained filters approach:
In [8]: Photo.objects.filter(tags=t1).filter(tags=t2)
Out[8]: [<Photo: Photo object>]
Resulting query:
In [17]: print Photo.objects.filter(tags=t1).filter(tags=t2).query
SELECT "test_photo"."id"
FROM "test_photo"
INNER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
INNER JOIN "test_photo_tags" T4 ON ("test_photo"."id" = T4."photo_id")
WHERE ("test_photo_tags"."tag_id" = 3 AND T4."tag_id" = 4 )
Note that each filter adds more JOINS to the query.
Using annotation approach:
In [29]: from django.db.models import Count
In [30]: Photo.objects.filter(tags__in=[t1, t2]).annotate(num_tags=Count('tags')).filter(num_tags=2)
Out[30]: [<Photo: Photo object>]
Resulting query:
In [32]: print Photo.objects.filter(tags__in=[t1, t2]).annotate(num_tags=Count('tags')).filter(num_tags=2).query
SELECT "test_photo"."id", COUNT("test_photo_tags"."tag_id") AS "num_tags"
FROM "test_photo"
LEFT OUTER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
WHERE ("test_photo_tags"."tag_id" IN (3, 4))
GROUP BY "test_photo"."id", "test_photo"."id"
HAVING COUNT("test_photo_tags"."tag_id") = 2
ANDed Q objects would not work:
In [9]: from django.db.models import Q
In [10]: Photo.objects.filter(Q(tags__name='holiday') & Q(tags__name='summer'))
Out[10]: []
In [11]: from operator import and_
In [12]: Photo.objects.filter(reduce(and_, [Q(tags__name='holiday'), Q(tags__name='summer')]))
Out[12]: []
Resulting query:
In [25]: print Photo.objects.filter(Q(tags__name='holiday') & Q(tags__name='summer')).query
SELECT "test_photo"."id"
FROM "test_photo"
INNER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
INNER JOIN "test_tag" ON ("test_photo_tags"."tag_id" = "test_tag"."id")
WHERE ("test_tag"."name" = holiday AND "test_tag"."name" = summer )
Another approach that works, although PostgreSQL only, is using django.contrib.postgres.fields.ArrayField:
Example copied from docs:
>>> Post.objects.create(name='First post', tags=['thoughts', 'django'])
>>> Post.objects.create(name='Second post', tags=['thoughts'])
>>> Post.objects.create(name='Third post', tags=['tutorial', 'django'])
>>> Post.objects.filter(tags__contains=['thoughts'])
<QuerySet [<Post: First post>, <Post: Second post>]>
>>> Post.objects.filter(tags__contains=['django'])
<QuerySet [<Post: First post>, <Post: Third post>]>
>>> Post.objects.filter(tags__contains=['django', 'thoughts'])
<QuerySet [<Post: First post>]>
ArrayField has some more powerful features such as overlap and index transforms.
This also can be done by dynamic query generation using Django ORM and some Python magic :)
from operator import and_
from django.db.models import Q
categories = ['holiday', 'summer']
res = Photo.filter(reduce(and_, [Q(tags__name=c) for c in categories]))
The idea is to generate appropriate Q objects for each category and then combine them using AND operator into one QuerySet. E.g. for your example it'd be equal to
res = Photo.filter(Q(tags__name='holiday') & Q(tags__name='summer'))
If you struggled with this problem as i did and nothing mentioned helped you, maybe this one will solve your issue
Instead of chaining filter, in some cases it would be better just to store ids of previous filter
tags = [1, 2]
for tag in tags:
ids = list(queryset.filter(tags__id=tag).values_list("id", flat=True))
queryset = queryset.filter(id__in=ids)
Using this approach will help you to avoid stacking JOIN in SQL query:
I use a little function that iterates filters over a list for a given operator an a column name :
def exclusive_in (cls,column,operator,value_list):
myfilter = column + '__' + operator
query = cls.objects
for value in value_list:
query=query.filter(**{myfilter:value})
return query
and this function can be called like that:
exclusive_in(Photo,'tags__name','iexact',['holiday','summer'])
it also work with any class and more tags in the list; operators can be anyone like 'iexact','in','contains','ne',...
My solution:
let say
author is list of elements that need to match all item in list, so:
for a in author:
queryset = queryset.filter(authors__author_first_name=a)
if not queryset:
break
for category in categories:
query = Photo.objects.filter(tags_name=category)
this piece of code , filters your photos which have all the tags name coming from categories.
If we want to do it dynamically, followed the example:
tag_ids = [t1.id, t2.id]
qs = Photo.objects.all()
for tag_id in tag_ids:
qs = qs.filter(tag__id=tag_id)
print qs
queryset = Photo.objects.filter(tags__name="vacaciones") | Photo.objects.filter(tags__name="verano")
In sqlalchemy (postgresql DB), I would like to create a bounded sum function, for lack of a better term. The goal is to create a running total within a defined range.
Currently, I have something that works great for calculating a running total without the bounds. Something like this:
from sqlalchemy.sql import func
foos = (
db.query(
Foo.id,
Foo.points,
Foo.timestamp,
func.sum(Foo.points).over(order_by=Foo.timestamp).label('running_total')
)
.filter(...)
.all()
)
However, I would like to be able to bound this running total to always be within a specific range, let's say [-100, 100]. So we would get something like this (see running_total):
{'timestamp': 1, 'points': 75, 'running_total': 75}
{'timestamp': 2, 'points': 50, 'running_total': 100}
{'timestamp': 3, 'points': -100, 'running_total': 0}
{'timestamp': 4, 'points': -50, 'running_total': -50}
{'timestamp': 5, 'points': -75, 'running_total': -100}
Any ideas?
Unfortunately, no built-in aggregate can help you achieve your expected output with window function calls.
You could get the expected output with manually calculating the rows one-by-one with a recursive CTE:
with recursive t as (
(select *, points running_total
from foo
order by timestamp
limit 1)
union all
(select foo.*, least(greatest(t.running_total + foo.points, -100), 100)
from foo, t
where foo.timestamp > t.timestamp
order by foo.timestamp
limit 1)
)
select timestamp,
points,
running_total
from t;
Unfortunately, this will be very hard to achieve with SQLAlchemy.
Your other option is, to write a custom aggregate for your specific needs, like:
create function bounded_add(int_state anyelement, next_value anyelement, next_min anyelement, next_max anyelement)
returns anyelement
immutable
language sql
as $func$
select least(greatest(int_state + next_value, next_min), next_max);
$func$;
create aggregate bounded_sum(next_value anyelement, next_min anyelement, next_max anyelement)
(
sfunc = bounded_add,
stype = anyelement,
initcond = '0'
);
With this, you just need to replace your call to sum to be a call to bounded_sum:
select timestamp,
points,
bounded_sum(points, -100.0, 100.0) over (order by timestamp) running_total
from foo;
This latter solution will probably scale better too.
http://rextester.com/LKCUK93113
note my initial answer is wrong, see edit below:
In raw sql, you'd do this using greatest & least functions.
Something like this:
LEAST(GREATEST(SUM(myfield) OVER (window_clause), lower_bound), upper_bound)
sqlalchemy expression language allows one two write that almost identically
import sqlalchemy as sa
import sqlalchemy.ext.declarative as dec
base = dec.declarative_base()
class Foo(base):
__tablename__ = 'foo'
id = sa.Column(sa.Integer, primary_key=True)
points = sa.Column(sa.Integer, nullable=False)
timestamp = sa.Column('tstamp', sa.Integer)
upper_, lower_ = 100, -100
win_expr = func.sum(Foo.points).over(order_by=Foo.timestamp)
bound_expr = sa.func.least(sa.func.greatest(win_expr, lower_), upper_).label('bounded_running_total')
stmt = sa.select([Foo.id, Foo.points, Foo.timestamp, bound_expr])
str(stmt)
# prints output:
# SELECT foo.id, foo.points, foo.tstamp, least(greatest(sum(foo.points) OVER (ORDER BY foo.tstamp), :greatest_1), :least_1) AS bounded_running_total
# FROM foo'
# alternatively using session.query you can also fetch results
from sqlalchemy.orm sessionmaker
DB = sessionmaker()
db = DB()
foos_stmt = dm.query(Foo.id, Foo.points, Foo.timestamp, bound_expr).filter(...)
str(foos_stmt)
# prints output:
# SELECT foo.id, foo.points, foo.tstamp, least(greatest(sum(foo.points) OVER (ORDER BY foo.tstamp), :greatest_1), :least_1) AS bounded_running_total
# FROM foo'
foos = foos_stmt.all()
EDIT As user #pozs points out in the comments, the above does not produce the intended results.
Two alternate approaches have been presented by #pozs. Here, I've adapted the first, recursive query approach, constructed via sqlalchemy.
import sqlalchemy as sa
import sqlalchemy.ext.declarative as dec
import sqlalchemy.orm as orm
base = dec.declarative_base()
class Foo(base):
__tablename__ = 'foo'
id = sa.Column(sa.Integer, primary_key=True)
points = sa.Column(sa.Integer, nullable=False)
timestamp = sa.Column('tstamp', sa.Integer)
upper_, lower_ = 100, -100
t = sa.select([
Foo.timestamp,
Foo.points,
Foo.points.label('bounded_running_sum')
]).order_by(Foo.timestamp).limit(1).cte('t', recursive=True)
t_aliased = orm.aliased(t, name='ta')
bounded_sum = t.union_all(
sa.select([
Foo.timestamp,
Foo.points,
sa.func.greatest(sa.func.least(Foo.points + t_aliased.c.bounded_running_sum, upper_), lower_)
]).order_by(Foo.timestamp).limit(1)
)
stmt = sa.select([bounded_sum])
# inspect the query:
from sqlalchemy.dialects import postgresql
print(stmt.compile(dialect=postgresql.dialect(),
compile_kwargs={'literal_binds': True}))
# prints output:
# WITH RECURSIVE t(tstamp, points, bounded_running_sum) AS
# ((SELECT foo.tstamp, foo.points, foo.points AS bounded_running_sum
# FROM foo ORDER BY foo.tstamp
# LIMIT 1) UNION ALL (SELECT foo.tstamp, foo.points, greatest(least(foo.points + ta.bounded_running_sum, 100), -100) AS greatest_1
# FROM foo, t AS ta ORDER BY foo.tstamp
# LIMIT 1))
# SELECT t.tstamp, t.points, t.bounded_running_sum
# FROM t
I used this link from the documentation as a reference to construct the above, which also highlights how one may use the session instead to work with recursive CTEs
This would be the pure sqlalchemy method to generate the required results.
The 2nd approach suggested by #pozs could also be used via sqlalchemy.
The solution would have to be a variant of this section from the documentation
Let's say I have the following models
class Photo(models.Model):
tags = models.ManyToManyField(Tag)
class Tag(models.Model):
name = models.CharField(max_length=50)
In a view I have a list with active filters called categories.
I want to filter Photo objects which have all tags present in categories.
I tried:
Photo.objects.filter(tags__name__in=categories)
But this matches any item in categories, not all items.
So if categories would be ['holiday', 'summer'] I want Photo's with both a holiday and summer tag.
Can this be achieved?
Summary:
One option is, as suggested by jpic and sgallen in the comments, to add .filter() for each category. Each additional filter adds more joins, which should not be a problem for small set of categories.
There is the aggregation approach. This query would be shorter and perhaps quicker for a large set of categories.
You also have the option of using custom queries.
Some examples
Test setup:
class Photo(models.Model):
tags = models.ManyToManyField('Tag')
class Tag(models.Model):
name = models.CharField(max_length=50)
def __unicode__(self):
return self.name
In [2]: t1 = Tag.objects.create(name='holiday')
In [3]: t2 = Tag.objects.create(name='summer')
In [4]: p = Photo.objects.create()
In [5]: p.tags.add(t1)
In [6]: p.tags.add(t2)
In [7]: p.tags.all()
Out[7]: [<Tag: holiday>, <Tag: summer>]
Using chained filters approach:
In [8]: Photo.objects.filter(tags=t1).filter(tags=t2)
Out[8]: [<Photo: Photo object>]
Resulting query:
In [17]: print Photo.objects.filter(tags=t1).filter(tags=t2).query
SELECT "test_photo"."id"
FROM "test_photo"
INNER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
INNER JOIN "test_photo_tags" T4 ON ("test_photo"."id" = T4."photo_id")
WHERE ("test_photo_tags"."tag_id" = 3 AND T4."tag_id" = 4 )
Note that each filter adds more JOINS to the query.
Using annotation approach:
In [29]: from django.db.models import Count
In [30]: Photo.objects.filter(tags__in=[t1, t2]).annotate(num_tags=Count('tags')).filter(num_tags=2)
Out[30]: [<Photo: Photo object>]
Resulting query:
In [32]: print Photo.objects.filter(tags__in=[t1, t2]).annotate(num_tags=Count('tags')).filter(num_tags=2).query
SELECT "test_photo"."id", COUNT("test_photo_tags"."tag_id") AS "num_tags"
FROM "test_photo"
LEFT OUTER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
WHERE ("test_photo_tags"."tag_id" IN (3, 4))
GROUP BY "test_photo"."id", "test_photo"."id"
HAVING COUNT("test_photo_tags"."tag_id") = 2
ANDed Q objects would not work:
In [9]: from django.db.models import Q
In [10]: Photo.objects.filter(Q(tags__name='holiday') & Q(tags__name='summer'))
Out[10]: []
In [11]: from operator import and_
In [12]: Photo.objects.filter(reduce(and_, [Q(tags__name='holiday'), Q(tags__name='summer')]))
Out[12]: []
Resulting query:
In [25]: print Photo.objects.filter(Q(tags__name='holiday') & Q(tags__name='summer')).query
SELECT "test_photo"."id"
FROM "test_photo"
INNER JOIN "test_photo_tags" ON ("test_photo"."id" = "test_photo_tags"."photo_id")
INNER JOIN "test_tag" ON ("test_photo_tags"."tag_id" = "test_tag"."id")
WHERE ("test_tag"."name" = holiday AND "test_tag"."name" = summer )
Another approach that works, although PostgreSQL only, is using django.contrib.postgres.fields.ArrayField:
Example copied from docs:
>>> Post.objects.create(name='First post', tags=['thoughts', 'django'])
>>> Post.objects.create(name='Second post', tags=['thoughts'])
>>> Post.objects.create(name='Third post', tags=['tutorial', 'django'])
>>> Post.objects.filter(tags__contains=['thoughts'])
<QuerySet [<Post: First post>, <Post: Second post>]>
>>> Post.objects.filter(tags__contains=['django'])
<QuerySet [<Post: First post>, <Post: Third post>]>
>>> Post.objects.filter(tags__contains=['django', 'thoughts'])
<QuerySet [<Post: First post>]>
ArrayField has some more powerful features such as overlap and index transforms.
This also can be done by dynamic query generation using Django ORM and some Python magic :)
from operator import and_
from django.db.models import Q
categories = ['holiday', 'summer']
res = Photo.filter(reduce(and_, [Q(tags__name=c) for c in categories]))
The idea is to generate appropriate Q objects for each category and then combine them using AND operator into one QuerySet. E.g. for your example it'd be equal to
res = Photo.filter(Q(tags__name='holiday') & Q(tags__name='summer'))
If you struggled with this problem as i did and nothing mentioned helped you, maybe this one will solve your issue
Instead of chaining filter, in some cases it would be better just to store ids of previous filter
tags = [1, 2]
for tag in tags:
ids = list(queryset.filter(tags__id=tag).values_list("id", flat=True))
queryset = queryset.filter(id__in=ids)
Using this approach will help you to avoid stacking JOIN in SQL query:
I use a little function that iterates filters over a list for a given operator an a column name :
def exclusive_in (cls,column,operator,value_list):
myfilter = column + '__' + operator
query = cls.objects
for value in value_list:
query=query.filter(**{myfilter:value})
return query
and this function can be called like that:
exclusive_in(Photo,'tags__name','iexact',['holiday','summer'])
it also work with any class and more tags in the list; operators can be anyone like 'iexact','in','contains','ne',...
My solution:
let say
author is list of elements that need to match all item in list, so:
for a in author:
queryset = queryset.filter(authors__author_first_name=a)
if not queryset:
break
for category in categories:
query = Photo.objects.filter(tags_name=category)
this piece of code , filters your photos which have all the tags name coming from categories.
If we want to do it dynamically, followed the example:
tag_ids = [t1.id, t2.id]
qs = Photo.objects.all()
for tag_id in tag_ids:
qs = qs.filter(tag__id=tag_id)
print qs
queryset = Photo.objects.filter(tags__name="vacaciones") | Photo.objects.filter(tags__name="verano")