There is a simple SQL table with 3 columns: id, sku, last_update
and a very simple SQL statement: SELECT DISTINCT sku FROM product_data ORDER BY last_update ASC
What would be a django view code for the aforesaid SQL statement?
This code:
q = ProductData.objects.values('sku').distinct().order_by('sku')
returns 145 results
whereas this statement:
q = ProductData.objects.values('sku').distinct().order_by('last_update')
returns over 1000 results
Why is it so? Can someone, please, help?
Thanks a lot in advance!
The difference is that in the first query the result is a list of (sku)s, in the second is a list of (sku, last_update)s, this because any fields included in the order_by, are also included in the SQL SELECT, thus the distinct is applied to a different set or records, resulting in a different count.
Take a look to the queries Django generates, they should be something like the followings:
Query #1
>>> str(ProductData.objects.values('sku').distinct().order_by('sku'))
'SELECT DISTINCT "yourproject_productdata"."sku" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."sku" ASC'
Query #2
>>> str(ProductData.objects.values('sku').distinct().order_by('last_update'))
'SELECT DISTINCT "yourproject_productdata"."sku", "yourproject_productdata"."last_update" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."last_update" ASC'
This behaviour is described in the distinct documentation
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
Related
I'm struggling (again) with Django's annotate functionality where the actual SQL query is quite clear to me.
Goal:
I want to get the number of users with a certain let's say status (it could be just any column of the model).
Approach(es):
1) User.objects.values('status').annotate(count=Count('*'))
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.id
ORDER BY usser_user.id ASC
However, this will give me a queryset of all users each "annotated" with the count value. This is kind of the behaviour I would have expected.
2) User.objects.values('status').annotate(count=Count('*')).order_by()
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.status
No ORDER BY, and now the GROUP BY argument is the status column. This is not what I expected, but the result I was looking for.
Question:
Why does Django's order_by() without any argument affect the SQL GROUP BY argument? (Or broader, why does the second approach "work"?)
Some details:
django 2.2.9
postgres 9.4
This is explained here
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call.
I have a reasonably complex queryset thus, which rolls up data by isoweek:
>>> MyThing.objects.all().count()
30000
>>> qs = MyThing.objects.all().order_by('date').annotate(
dw=DateWeek('date'), # uses WEEK function
dy=ExtractYear('date')
).values(
'dy','dw','group_id'
).annotate(
sum_count=Sum('count')
).values_list('dw', 'dy', 'group_id', 'sum_count')
>>> qs.count()
2000
So far so good. The problem is when I coerce this queryset into a list:
>>> len(list(qs))
30000
Why is this happening? How can I get the list of grouped values that the queryset purports to have when I count() it directly?
To solve this problem, remove the .order_by('date'). Although it is not included in the output, the database backend is still considering it at every row, causing the number of rows to inflate like that.
If you want to order the output, .order_by('dy', 'dw') after adding those annotations.
You can also add an .order_by() with no arguments to clear any ordering set previously, for instance from the Model class definition default ordering.
🦆
The reason for this behavior is explained in the django docs:
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
I have a MySQL table with 13M rows. I can query the db directly as
SELECT DISTINCT(refdate) FROM myTable
The query takes 0.15 seconds and is great.
The equivalent table defined as a Django model and queried as
myTable.objects.values(`refdate`).distinct()
takes a very long time. Is it because there are too many items in the list before distinct(). How do I do this in a manner that doesn't bring everything down?
Thank you #solarissmoke for the pointer to connection.queries.
I was expecting to see
SELECT DISTINCT refdate FROM myTable
Instead, I got
SELECT DISTINCT refdate, itemIndex, itemType FROM myTable ORDER BY itemIndex, refdate, itemType.
I then looked at myTable defined in models.py.
unique_together = (('nodeIndex', 'refdate', 'nodeType'), )
ordering = ['nodeIndex', 'refdate', 'nodeType']
From Interaction with default ordering or order_by
normally you won’t want extra columns playing a part in the result, so clear out the ordering, or at least make sure it’s restricted only to those fields you also select in a values() call.
So I tried order_by() to flush the previously defined ordering and voila!
myTable.objects.values('refdate').order_by().distinct()
You can try this:
myTable.objects.all().distinct('refdate')
I have two models, Track and Pair. Each Pair has a track1, track2 and popularity. I'm trying to get an ordered list by popularity (descending) of pairs, with no two pairs having the same track1. Here's what I've tried so far:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me the following error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
...so I tried this:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('popularity', 'track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me entries with duplicate track1__ids. Does anyone know of a way of solving this problem? I'm guessing I'll have to use raw() or something similar but I don't know how I'd approach a problem like this. I'm using PostgreSQL for the DB backend so DISTINCT should be supported.
First, let's clarify: DISTINCT is standard SQL, while DISTINCT ON is a PostgreSQL extension.
The error (DISTINCT ON expressions must match initial ORDER BY expressions) indicates, that you should fix your ORDER BY clause, not the DISTINT ON (if you do that, you'll end up with different results, like you already experienced).
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
This will give you your expected results:
lstPairs = Pair.objects.order_by('track1__id','-popularity').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
In SQL:
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id, popularity DESC
But probably in a wrong order.
If you want your original order, you can use a sub-query here:
SELECT *
FROM (
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id
-- LIMIT here, if necessary
)
ORDER BY popularity DESC, track1__id
See the documentation on distinct.
First:
On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply.
You dont' specify what is your database backend, if it is not PostrgreSQL you have no chance to make it work.
Second:
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I think that you should use raw(), or get the entire list of Pairs ordered by popularity and then make the filtering by track1 uniqueness in Python.
I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.