How do I select distinct values in Django? - python

This is my code:
[app.system_name for app in App.objects.all().distinct('system_name')]
Gives me:
[u'blog', u'files', u'calendar', u'tasks', u'statuses', u'wiki', u'wiki', u'blog
', u'files', u'blog', u'ideas', u'calendar', u'wiki', u'wiki', u'statuses', u'ta
sks', u'survey', u'blog']
As you might expect I want all the unique values of the field system_name, but now I just get all App instances back.

Specifying fields in distinct is only supported in Django 1.4+. If you're running 1.3, it's just ignoring it.
If you are running Django 1.4, you must add an order_by clause that includes and starts with all the fields in distinct.
Even then, specifying fields with distinct is only support on PostgreSQL. If you're running something else, such as MySQL, you're out of luck.
All this information is in the docs.

You have to order by the same field name when using distinct with a field name:
App.objects.order_by('system_name').distinct('system_name')
From the doc:
When you specify field names, you must provide an order_by() in the
QuerySet, and the fields in order_by() must start with the fields in
distinct(), in the same order.
For example, SELECT DISTINCT ON (a) gives you the first row for each
value in column a. If you don't specify an order, you'll get some
arbitrary row.

You can use values_list() when using distinct().
App.objects.values_list('system_name').distinct()

Related

Django group_by argument depending on order_by

I'm struggling (again) with Django's annotate functionality where the actual SQL query is quite clear to me.
Goal:
I want to get the number of users with a certain let's say status (it could be just any column of the model).
Approach(es):
1) User.objects.values('status').annotate(count=Count('*'))
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.id
ORDER BY usser_user.id ASC
However, this will give me a queryset of all users each "annotated" with the count value. This is kind of the behaviour I would have expected.
2) User.objects.values('status').annotate(count=Count('*')).order_by()
This results into the following SQL query
SELECT users_user.status, COUNT(*) as count
FROM users_user
GROUP BY users_user.status
No ORDER BY, and now the GROUP BY argument is the status column. This is not what I expected, but the result I was looking for.
Question:
Why does Django's order_by() without any argument affect the SQL GROUP BY argument? (Or broader, why does the second approach "work"?)
Some details:
django 2.2.9
postgres 9.4
This is explained here
Fields that are mentioned in the order_by() part of a queryset (or which are used in the default ordering on a model) are used when selecting the output data, even if they are not otherwise specified in the values() call.

Django remove duplicates from .values_list query while preserving order

I have a model say MyModel which contains a CharField type. The model has a default meta ordering which should be preserved. I am using the following query to get the list of types -
MyModel.objects.all().values_list('type', flat=True).distinct()
However, the types are getting repeated. I can do .order_by('type').distinct() but that will change the ordering which I don't want. Is there any way to get the list of types in order without manually creating a list in python? Alternative faster solutions are also welcome.
Django version - 1.11
Distinct is not matching with type because you don't specified it
use this code
MyModel.objects.all().values_list('type', flat=True).distinct("type")
instead of this code
MyModel.objects.all().values_list('type', flat=True).distinct()
You can try for this
MyModel.objects.all().values('type', flat=True).order_by('type').distinct()
it will work for you
You can do this in 2 steps:
First, get the id of the records with unique types and save them in a list:
ids = list(MyModel.objects.values_list('id', flat=True).order_by('type').distinct('type'))
Then do the filter using the ids:
MyModel.objects.values_list('type', flat=True).filter(id__in=ids)

Length of Django queryset result changes when coerced to a list

I have a reasonably complex queryset thus, which rolls up data by isoweek:
>>> MyThing.objects.all().count()
30000
>>> qs = MyThing.objects.all().order_by('date').annotate(
dw=DateWeek('date'), # uses WEEK function
dy=ExtractYear('date')
).values(
'dy','dw','group_id'
).annotate(
sum_count=Sum('count')
).values_list('dw', 'dy', 'group_id', 'sum_count')
>>> qs.count()
2000
So far so good. The problem is when I coerce this queryset into a list:
>>> len(list(qs))
30000
Why is this happening? How can I get the list of grouped values that the queryset purports to have when I count() it directly?
To solve this problem, remove the .order_by('date'). Although it is not included in the output, the database backend is still considering it at every row, causing the number of rows to inflate like that.
If you want to order the output, .order_by('dy', 'dw') after adding those annotations.
You can also add an .order_by() with no arguments to clear any ordering set previously, for instance from the Model class definition default ordering.
🦆
The reason for this behavior is explained in the django docs:
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.

Django: distinct on a foreign key, then ordering

I have two models, Track and Pair. Each Pair has a track1, track2 and popularity. I'm trying to get an ordered list by popularity (descending) of pairs, with no two pairs having the same track1. Here's what I've tried so far:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me the following error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
...so I tried this:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('popularity', 'track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me entries with duplicate track1__ids. Does anyone know of a way of solving this problem? I'm guessing I'll have to use raw() or something similar but I don't know how I'd approach a problem like this. I'm using PostgreSQL for the DB backend so DISTINCT should be supported.
First, let's clarify: DISTINCT is standard SQL, while DISTINCT ON is a PostgreSQL extension.
The error (DISTINCT ON expressions must match initial ORDER BY expressions) indicates, that you should fix your ORDER BY clause, not the DISTINT ON (if you do that, you'll end up with different results, like you already experienced).
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
This will give you your expected results:
lstPairs = Pair.objects.order_by('track1__id','-popularity').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
In SQL:
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id, popularity DESC
But probably in a wrong order.
If you want your original order, you can use a sub-query here:
SELECT *
FROM (
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id
-- LIMIT here, if necessary
)
ORDER BY popularity DESC, track1__id
See the documentation on distinct.
First:
On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply.
You dont' specify what is your database backend, if it is not PostrgreSQL you have no chance to make it work.
Second:
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I think that you should use raw(), or get the entire list of Pairs ordered by popularity and then make the filtering by track1 uniqueness in Python.

Django: Selecting distinct values on maximum foreign key value

I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.

Categories