I'm struggling (again) with Django's annotate functionality where the actual SQL query is quite clear to me.
Goal:
I want to get the number of users with a certain let's say status (it could be just any column of the model).
Approach(es):
1) User.objects.values('status').annotate(count=Count('*'))
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.id
ORDER BY usser_user.id ASC
However, this will give me a queryset of all users each "annotated" with the count value. This is kind of the behaviour I would have expected.
2) User.objects.values('status').annotate(count=Count('*')).order_by()
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.status
No ORDER BY, and now the GROUP BY argument is the status column. This is not what I expected, but the result I was looking for.
Question:
Why does Django's order_by() without any argument affect the SQL GROUP BY argument? (Or broader, why does the second approach "work"?)
Some details:
django 2.2.9
postgres 9.4
This is explained here
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call.
Related
There is a simple SQL table with 3 columns: id, sku, last_update
and a very simple SQL statement: SELECT DISTINCT sku FROM product_data ORDER BY last_update ASC
What would be a django view code for the aforesaid SQL statement?
This code:
q = ProductData.objects.values('sku').distinct().order_by('sku')
returns 145 results
whereas this statement:
q = ProductData.objects.values('sku').distinct().order_by('last_update')
returns over 1000 results
Why is it so? Can someone, please, help?
Thanks a lot in advance!
The difference is that in the first query the result is a list of (sku)s, in the second is a list of (sku, last_update)s, this because any fields included in the order_by, are also included in the SQL SELECT, thus the distinct is applied to a different set or records, resulting in a different count.
Take a look to the queries Django generates, they should be something like the followings:
Query #1
>>> str(ProductData.objects.values('sku').distinct().order_by('sku'))
'SELECT DISTINCT "yourproject_productdata"."sku" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."sku" ASC'
Query #2
>>> str(ProductData.objects.values('sku').distinct().order_by('last_update'))
'SELECT DISTINCT "yourproject_productdata"."sku", "yourproject_productdata"."last_update" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."last_update" ASC'
This behaviour is described in the distinct documentation
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
So the question is "is it possible to make the following query using Django ORM without raw statements?"
SELECT
my_table.column_a + my_table.column_b
FROM
my_table
The example for which it would be suitable from my point of view:
We have a model:
class MyOperations(models.Model):
operation_start_time = models.DateTimeField()
At some point we create a record and set the field value to Now (or we update some existing record. it doesn't matter):
MyOperations.objects.create(operation_start_time=functions.Now())
Now we want to know how much time has already passed. I would expect that Django ORM produces the following SQL statement to request data from the database (let's assume that we use MySQL backend):
SELECT
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
So is it a way to achieve this without raw statements?
For now I settled on the following solution:
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).filter(...)
It produces the following SQL statement:
SELECT
`myapp_myoperations`.`operation_start_time`,
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
Or in the Python response object representation:
{'operation_start_time': datetime(...), 'diff': timedelta(...)}
Is it a way to get the response dict with only diff since this is the only field I am interested in?
Django ORM produced the query which requests operation_start_time just as we had written. But in case I remove the call to values at all it produces query which requests all table columns
Solution which produces the expected SQL
We should just put the call to values to the place in which diff is already known to the query
MyOperations.objects.annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff').filter(...)
You can use values() on your calculated field, so a query like
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff')
should give you a resulting queryset containing only your calculated 'diff'.
I have a MySQL table with 13M rows. I can query the db directly as
SELECT DISTINCT(refdate) FROM myTable
The query takes 0.15 seconds and is great.
The equivalent table defined as a Django model and queried as
myTable.objects.values(`refdate`).distinct()
takes a very long time. Is it because there are too many items in the list before distinct(). How do I do this in a manner that doesn't bring everything down?
Thank you #solarissmoke for the pointer to connection.queries.
I was expecting to see
SELECT DISTINCT refdate FROM myTable
Instead, I got
SELECT DISTINCT refdate, itemIndex, itemType FROM myTable ORDER BY itemIndex, refdate, itemType.
I then looked at myTable defined in models.py.
unique_together = (('nodeIndex', 'refdate', 'nodeType'), )
ordering = ['nodeIndex', 'refdate', 'nodeType']
From Interaction with default ordering or order_by
normally you won’t want extra columns playing a part in the result, so clear out the ordering, or at least make sure it’s restricted only to those fields you also select in a values() call.
So I tried order_by() to flush the previously defined ordering and voila!
myTable.objects.values('refdate').order_by().distinct()
You can try this:
myTable.objects.all().distinct('refdate')
I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.
This is my code:
[app.system_name for app in App.objects.all().distinct('system_name')]
Gives me:
[u'blog', u'files', u'calendar', u'tasks', u'statuses', u'wiki', u'wiki', u'blog
', u'files', u'blog', u'ideas', u'calendar', u'wiki', u'wiki', u'statuses', u'ta
sks', u'survey', u'blog']
As you might expect I want all the unique values of the field system_name, but now I just get all App instances back.
Specifying fields in distinct is only supported in Django 1.4+. If you're running 1.3, it's just ignoring it.
If you are running Django 1.4, you must add an order_by clause that includes and starts with all the fields in distinct.
Even then, specifying fields with distinct is only support on PostgreSQL. If you're running something else, such as MySQL, you're out of luck.
All this information is in the docs.
You have to order by the same field name when using distinct with a field name:
App.objects.order_by('system_name').distinct('system_name')
From the doc:
When you specify field names, you must provide an order_by() in the
QuerySet, and the fields in order_by() must start with the fields in
distinct(), in the same order.
For example, SELECT DISTINCT ON (a) gives you the first row for each
value in column a. If you don't specify an order, you'll get some
arbitrary row.
You can use values_list() when using distinct().
App.objects.values_list('system_name').distinct()