I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.
Related
I'm struggling (again) with Django's annotate functionality where the actual SQL query is quite clear to me.
Goal:
I want to get the number of users with a certain let's say status (it could be just any column of the model).
Approach(es):
1) User.objects.values('status').annotate(count=Count('*'))
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.id
ORDER BY usser_user.id ASC
However, this will give me a queryset of all users each "annotated" with the count value. This is kind of the behaviour I would have expected.
2) User.objects.values('status').annotate(count=Count('*')).order_by()
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.status
No ORDER BY, and now the GROUP BY argument is the status column. This is not what I expected, but the result I was looking for.
Question:
Why does Django's order_by() without any argument affect the SQL GROUP BY argument? (Or broader, why does the second approach "work"?)
Some details:
django 2.2.9
postgres 9.4
This is explained here
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call.
There is a simple SQL table with 3 columns: id, sku, last_update
and a very simple SQL statement: SELECT DISTINCT sku FROM product_data ORDER BY last_update ASC
What would be a django view code for the aforesaid SQL statement?
This code:
q = ProductData.objects.values('sku').distinct().order_by('sku')
returns 145 results
whereas this statement:
q = ProductData.objects.values('sku').distinct().order_by('last_update')
returns over 1000 results
Why is it so? Can someone, please, help?
Thanks a lot in advance!
The difference is that in the first query the result is a list of (sku)s, in the second is a list of (sku, last_update)s, this because any fields included in the order_by, are also included in the SQL SELECT, thus the distinct is applied to a different set or records, resulting in a different count.
Take a look to the queries Django generates, they should be something like the followings:
Query #1
>>> str(ProductData.objects.values('sku').distinct().order_by('sku'))
'SELECT DISTINCT "yourproject_productdata"."sku" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."sku" ASC'
Query #2
>>> str(ProductData.objects.values('sku').distinct().order_by('last_update'))
'SELECT DISTINCT "yourproject_productdata"."sku", "yourproject_productdata"."last_update" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."last_update" ASC'
This behaviour is described in the distinct documentation
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
I have a reasonably complex queryset thus, which rolls up data by isoweek:
>>> MyThing.objects.all().count()
30000
>>> qs = MyThing.objects.all().order_by('date').annotate(
dw=DateWeek('date'), # uses WEEK function
dy=ExtractYear('date')
).values(
'dy','dw','group_id'
).annotate(
sum_count=Sum('count')
).values_list('dw', 'dy', 'group_id', 'sum_count')
>>> qs.count()
2000
So far so good. The problem is when I coerce this queryset into a list:
>>> len(list(qs))
30000
Why is this happening? How can I get the list of grouped values that the queryset purports to have when I count() it directly?
To solve this problem, remove the .order_by('date'). Although it is not included in the output, the database backend is still considering it at every row, causing the number of rows to inflate like that.
If you want to order the output, .order_by('dy', 'dw') after adding those annotations.
You can also add an .order_by() with no arguments to clear any ordering set previously, for instance from the Model class definition default ordering.
🦆
The reason for this behavior is explained in the django docs:
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
I have a MySQL table with 13M rows. I can query the db directly as
SELECT DISTINCT(refdate) FROM myTable
The query takes 0.15 seconds and is great.
The equivalent table defined as a Django model and queried as
myTable.objects.values(`refdate`).distinct()
takes a very long time. Is it because there are too many items in the list before distinct(). How do I do this in a manner that doesn't bring everything down?
Thank you #solarissmoke for the pointer to connection.queries.
I was expecting to see
SELECT DISTINCT refdate FROM myTable
Instead, I got
SELECT DISTINCT refdate, itemIndex, itemType FROM myTable ORDER BY itemIndex, refdate, itemType.
I then looked at myTable defined in models.py.
unique_together = (('nodeIndex', 'refdate', 'nodeType'), )
ordering = ['nodeIndex', 'refdate', 'nodeType']
From Interaction with default ordering or order_by
normally you won’t want extra columns playing a part in the result, so clear out the ordering, or at least make sure it’s restricted only to those fields you also select in a values() call.
So I tried order_by() to flush the previously defined ordering and voila!
myTable.objects.values('refdate').order_by().distinct()
You can try this:
myTable.objects.all().distinct('refdate')
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I want to select exactly one item for each category, which is the query syntax for do this?
Your question isn't entirely clear: since you didn't say otherwise, I'm going to assume that you don't care which item is selected for each category, just that you need any one. If that isn't the case, please update the question to clarify.
tl;dr version: there is no documented
way to explicitly use GROUP BY
statements in Django, except by using
a raw query. See the bottom for code to do so.
The problem is that in doing what you're looking for in SQL itself requires a bit of a hack. You can easily try this example with by entering sqlite3 :memory: at the command line:
CREATE TABLE category
(
id INT
);
CREATE TABLE item
(
id INT,
category_id INT
);
INSERT INTO category VALUES (1);
INSERT INTO category VALUES (2);
INSERT INTO category VALUES (3);
INSERT INTO item VALUES (1,1);
INSERT INTO item VALUES (2,2);
INSERT INTO item VALUES (3,3);
INSERT INTO item VALUES (4,1);
INSERT INTO item VALUES (5,2);
SELECT id, category_id, COUNT(category_id) FROM item GROUP BY category_id;
returns
4|1|2
5|2|2
3|3|1
Which is what you're looking for (one item id for each category id), albeit with an extraneous COUNT. The count (or some other aggregate function) is needed in order to apply the GROUP BY.
Note: this will ignore categories that don't contain any items, which seems like sensible behaviour.
Now the question becomes, how to do this in Django?
The obvious answer is to use Django's aggregation/annotation support, in particular, combining annotate with values as is recommend elsewhere to GROUP queries in Django.
Reading those posts, it would seem we could accomplish what we're looking for with
Item.objects.values('id').annotate(unneeded_count=Count('category_id'))
However this doesn't work. What Django does here is not just GROUP BY "category_id", but groups by all fields selected (ie GROUP BY "id", "category_id")1. I don't believe there is a way (in the public API, at least) to change this behaviour.
The solution is to fall back to raw SQL:
qs = Item.objects.raw('SELECT *, COUNT(category_id) FROM myapp_item GROUP BY category_id')
1: Note that you can inspect what queries Django is running with:
from django.db import connection
print connection.queries[-1]
Edit:
There are a number of other possible approaches, but most have (possibly severe) performance problems. Here are a couple:
1. Select an item from each category.
items = []
for c in Category.objects.all():
items.append(c.item_set[0])
This is a more clear and flexible approach, but has the obvious disadvantage of requiring many more database hits.
2. Use select_related
items = Item.objects.select_related()
and then do the grouping/filtering yourself (in Python).
Again, this is perhaps more clear than using raw SQL and only requires one query, but this one query could be very large (it will return all items and their categories) and doing the grouping/filtering yourself is probably less efficient than letting the database do it for you.