Django Models - SELECT DISTINCT(foo) FROM table is too slow

Django Models - SELECT DISTINCT(foo) FROM table is too slow - python

I have a MySQL table with 13M rows. I can query the db directly as
SELECT DISTINCT(refdate) FROM myTable
The query takes 0.15 seconds and is great.
The equivalent table defined as a Django model and queried as
myTable.objects.values(`refdate`).distinct()
takes a very long time. Is it because there are too many items in the list before distinct(). How do I do this in a manner that doesn't bring everything down?

Thank you #solarissmoke for the pointer to connection.queries.
I was expecting to see
SELECT DISTINCT refdate FROM myTable
Instead, I got
SELECT DISTINCT refdate, itemIndex, itemType FROM myTable ORDER BY itemIndex, refdate, itemType.
I then looked at myTable defined in models.py.
unique_together = (('nodeIndex', 'refdate', 'nodeType'), )
ordering = ['nodeIndex', 'refdate', 'nodeType']
From Interaction with default ordering or order_by
normally you won’t want extra columns playing a part in the result, so clear out the ordering, or at least make sure it’s restricted only to those fields you also select in a values() call.
So I tried order_by() to flush the previously defined ordering and voila!
myTable.objects.values('refdate').order_by().distinct()

You can try this:
myTable.objects.all().distinct('refdate')

Related

Nesting Django QuerySets

Is there a way to create a queryset that operates on a nested queryset?
The simplest example I can think of to explain what I'm trying to accomplish is by demonstration.
I would like to write code something like
SensorReading.objects.filter(reading=1).objects.filter(meter=1)
resulting in SQL looking like
SELECT * FROM (
SELECT * FROM SensorReading WHERE reading=1
) WHERE sensor=1;
More specifically I have a model representing readings from sensors
class SensorReading(models.Model):
sensor=models.PositiveIntegerField()
timestamp=models.DatetimeField()
reading=models.IntegerField()
With this I am creating a queryset that annotates every sensor with the elapsed time since the previous reading in seconds
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
I now want to aggregate those per sensor to see the minimum and maximum elapsed time between readings from every sensor. I initially tried
readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))[0]
however, this fails because window values cannot be used inside the aggregate.
Are there any methods or libraries to solve this without needing to resort to raw SQL? Or have I just overlooked a simpler solution to the problem?

The short answer is Yes you can, using the id__in lookup and a subquery in the filter method. function from the django.db.models module.
The long answer is how? :
you can create a subquery that retrieves the filtered SensorReading objects, and then use that subquery in the main queryset (For example):
from django.db.models import Subquery
subquery = SensorReading.objects.filter(reading=1).values('id')
readings = SensorReading.objects.filter(id__in=Subquery(subquery), meter=1)
The above code will generate SQL that is similar to what you described in your example:
SELECT * FROM SensorReading
WHERE id IN (SELECT id FROM SensorReading WHERE reading=1)
AND meter=1;
Another way is to chain the filter() on the queryset that you have created, and add the second filter on top of it
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
.filter(sensor=1)
)
UPDATE:
As you commented below, you can use the RawSQL function from the django.db.models module to to aggregate the window function values without running the subquery multiple times. This allows you to include raw SQL in the queryset, and use the results of that SQL in further querysets or aggregations.
For example, you can create a raw SQL query that retrieves the filtered SensorReading objects, the previous_read and delta fields with the window function applied, and then use that SQL in the main queryset:
from django.db.models import RawSQL
raw_sql = '''
SELECT id, sensor, timestamp, reading,
LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp) as previous_read,
ABS(EXTRACT(EPOCH FROM timestamp - LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp))) as delta
FROM myapp_sensorreading
WHERE reading = 1
'''
readings = SensorReading.objects.raw(raw_sql)
You can then use the readings queryset to aggregate the data as you need, for example:
aggregated_data = readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))
Just be aware of the security implications of using raw SQL, as it allows you to include user input directly in the query, which could lead to SQL injection attacks. Be sure to properly validate and sanitize any user input that you use in a raw SQL query.

Ended up rolling my own solution, basically introspecting the queryset to create a fake table to use in the creation of a new query set and setting the alias to a node that knows to render the SQL for the inner query
allows me to do something like
readings = (
NestedQuery(
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=[
"timestamp",
],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
.values("sensor")
.annotate(min=Min("delta"), max=Max("delta"))
)
code is available on github, and I've published it on pypi
https://github.com/Simage/django-nestedquery
I have no doubt that I'm leaking the tables or some such nonsense still and this should be considered proof of concept, not any sort of production code.

Django group_by argument depending on order_by

I'm struggling (again) with Django's annotate functionality where the actual SQL query is quite clear to me.
Goal:
I want to get the number of users with a certain let's say status (it could be just any column of the model).
Approach(es):
1) User.objects.values('status').annotate(count=Count('*'))
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.id
ORDER BY usser_user.id ASC
However, this will give me a queryset of all users each "annotated" with the count value. This is kind of the behaviour I would have expected.
2) User.objects.values('status').annotate(count=Count('*')).order_by()
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.status
No ORDER BY, and now the GROUP BY argument is the status column. This is not what I expected, but the result I was looking for.
Question:
Why does Django's order_by() without any argument affect the SQL GROUP BY argument? (Or broader, why does the second approach "work"?)
Some details:
django 2.2.9
postgres 9.4

This is explained here
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call.

What is the Django query for the below SQL statement?

There is a simple SQL table with 3 columns: id, sku, last_update
and a very simple SQL statement: SELECT DISTINCT sku FROM product_data ORDER BY last_update ASC
What would be a django view code for the aforesaid SQL statement?
This code:
q = ProductData.objects.values('sku').distinct().order_by('sku')
returns 145 results
whereas this statement:
q = ProductData.objects.values('sku').distinct().order_by('last_update')
returns over 1000 results
Why is it so? Can someone, please, help?
Thanks a lot in advance!

The difference is that in the first query the result is a list of (sku)s, in the second is a list of (sku, last_update)s, this because any fields included in the order_by, are also included in the SQL SELECT, thus the distinct is applied to a different set or records, resulting in a different count.
Take a look to the queries Django generates, they should be something like the followings:
Query #1
>>> str(ProductData.objects.values('sku').distinct().order_by('sku'))
'SELECT DISTINCT "yourproject_productdata"."sku" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."sku" ASC'
Query #2
>>> str(ProductData.objects.values('sku').distinct().order_by('last_update'))
'SELECT DISTINCT "yourproject_productdata"."sku", "yourproject_productdata"."last_update" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."last_update" ASC'
This behaviour is described in the distinct documentation
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.

Django: Selecting distinct values on maximum foreign key value

I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.

The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.

Query syntax to select exactly one item for each category

class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I want to select exactly one item for each category, which is the query syntax for do this?

Your question isn't entirely clear: since you didn't say otherwise, I'm going to assume that you don't care which item is selected for each category, just that you need any one. If that isn't the case, please update the question to clarify.
tl;dr version: there is no documented
way to explicitly use GROUP BY
statements in Django, except by using
a raw query. See the bottom for code to do so.
The problem is that in doing what you're looking for in SQL itself requires a bit of a hack. You can easily try this example with by entering sqlite3 :memory: at the command line:
CREATE TABLE category
(
id INT
);
CREATE TABLE item
(
id INT,
category_id INT
);
INSERT INTO category VALUES (1);
INSERT INTO category VALUES (2);
INSERT INTO category VALUES (3);
INSERT INTO item VALUES (1,1);
INSERT INTO item VALUES (2,2);
INSERT INTO item VALUES (3,3);
INSERT INTO item VALUES (4,1);
INSERT INTO item VALUES (5,2);
SELECT id, category_id, COUNT(category_id) FROM item GROUP BY category_id;
returns
4|1|2
5|2|2
3|3|1
Which is what you're looking for (one item id for each category id), albeit with an extraneous COUNT. The count (or some other aggregate function) is needed in order to apply the GROUP BY.
Note: this will ignore categories that don't contain any items, which seems like sensible behaviour.
Now the question becomes, how to do this in Django?
The obvious answer is to use Django's aggregation/annotation support, in particular, combining annotate with values as is recommend elsewhere to GROUP queries in Django.
Reading those posts, it would seem we could accomplish what we're looking for with
Item.objects.values('id').annotate(unneeded_count=Count('category_id'))
However this doesn't work. What Django does here is not just GROUP BY "category_id", but groups by all fields selected (ie GROUP BY "id", "category_id")1. I don't believe there is a way (in the public API, at least) to change this behaviour.
The solution is to fall back to raw SQL:
qs = Item.objects.raw('SELECT *, COUNT(category_id) FROM myapp_item GROUP BY category_id')
1: Note that you can inspect what queries Django is running with:
from django.db import connection
print connection.queries[-1]
Edit:
There are a number of other possible approaches, but most have (possibly severe) performance problems. Here are a couple:
1. Select an item from each category.
items = []
for c in Category.objects.all():
items.append(c.item_set[0])
This is a more clear and flexible approach, but has the obvious disadvantage of requiring many more database hits.
2. Use select_related
items = Item.objects.select_related()
and then do the grouping/filtering yourself (in Python).
Again, this is perhaps more clear than using raw SQL and only requires one query, but this one query could be very large (it will return all items and their categories) and doing the grouping/filtering yourself is probably less efficient than letting the database do it for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django Models - SELECT DISTINCT(foo) FROM table is too slow - python

You can try this: myTable.objects.all().distinct('refdate')

Related

Nesting Django QuerySets

Django group_by argument depending on order_by

What is the Django query for the below SQL statement?

Django: Selecting distinct values on maximum foreign key value

Query syntax to select exactly one item for each category

Categories

Resources