I have two models, Track and Pair. Each Pair has a track1, track2 and popularity. I'm trying to get an ordered list by popularity (descending) of pairs, with no two pairs having the same track1. Here's what I've tried so far:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me the following error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
...so I tried this:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('popularity', 'track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me entries with duplicate track1__ids. Does anyone know of a way of solving this problem? I'm guessing I'll have to use raw() or something similar but I don't know how I'd approach a problem like this. I'm using PostgreSQL for the DB backend so DISTINCT should be supported.
First, let's clarify: DISTINCT is standard SQL, while DISTINCT ON is a PostgreSQL extension.
The error (DISTINCT ON expressions must match initial ORDER BY expressions) indicates, that you should fix your ORDER BY clause, not the DISTINT ON (if you do that, you'll end up with different results, like you already experienced).
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
This will give you your expected results:
lstPairs = Pair.objects.order_by('track1__id','-popularity').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
In SQL:
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id, popularity DESC
But probably in a wrong order.
If you want your original order, you can use a sub-query here:
SELECT *
FROM (
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id
-- LIMIT here, if necessary
)
ORDER BY popularity DESC, track1__id
See the documentation on distinct.
First:
On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply.
You dont' specify what is your database backend, if it is not PostrgreSQL you have no chance to make it work.
Second:
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I think that you should use raw(), or get the entire list of Pairs ordered by popularity and then make the filtering by track1 uniqueness in Python.
Related
There is a simple SQL table with 3 columns: id, sku, last_update
and a very simple SQL statement: SELECT DISTINCT sku FROM product_data ORDER BY last_update ASC
What would be a django view code for the aforesaid SQL statement?
This code:
q = ProductData.objects.values('sku').distinct().order_by('sku')
returns 145 results
whereas this statement:
q = ProductData.objects.values('sku').distinct().order_by('last_update')
returns over 1000 results
Why is it so? Can someone, please, help?
Thanks a lot in advance!
The difference is that in the first query the result is a list of (sku)s, in the second is a list of (sku, last_update)s, this because any fields included in the order_by, are also included in the SQL SELECT, thus the distinct is applied to a different set or records, resulting in a different count.
Take a look to the queries Django generates, they should be something like the followings:
Query #1
>>> str(ProductData.objects.values('sku').distinct().order_by('sku'))
'SELECT DISTINCT "yourproject_productdata"."sku" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."sku" ASC'
Query #2
>>> str(ProductData.objects.values('sku').distinct().order_by('last_update'))
'SELECT DISTINCT "yourproject_productdata"."sku", "yourproject_productdata"."last_update" FROM "yourproject_productdata" ORDER BY "yourproject_productdata"."last_update" ASC'
This behaviour is described in the distinct documentation
Any fields used in an order_by() call are included in the SQL SELECT
columns. This can sometimes lead to unexpected results when used in
conjunction with distinct(). If you order by fields from a related
model, those fields will be added to the selected columns and they may
make otherwise duplicate rows appear to be distinct. Since the extra
columns don’t appear in the returned results (they are only there to
support ordering), it sometimes looks like non-distinct results are
being returned.
Similarly, if you use a values() query to restrict the columns
selected, the columns used in any order_by() (or default model
ordering) will still be involved and may affect uniqueness of the
results.
The moral here is that if you are using distinct() be careful about
ordering by related models. Similarly, when using distinct() and
values() together, be careful when ordering by fields not in the
values() call.
I don't understand what is the meaning of distinct and how use of it. i have search for related answer but it seems like distinct is somehow related to list. really appreciate the help.
list_of_stocks = db.stocks.distinct("symbol")
As the OP confirmed, this is a PyMongo call to a MongoDB database, which allows for a distinct find, in the form of:
db.collection_name.distinct("property_name")
This returns all distinct values for a given property in a collection.
Optionally, if you specify a document filter (effectively a find() filter) as a second parameter, your query will first be reduced by that filter, then the distinct will be applied. For example:
list_of_stocks = db.stocks.distinct("symbol", {"exchange": "NASDAQ"})
Distinct keyword is used in DB and it is to return record set only with distinct elements for that particular column.
I want to filter by Key value( If I can, by Id)
Developers console offers searching data with key value.
I want to do in my code just like:
DataModel.query(DataModel.key > ndb.Key('DataModel', id_value)).order(
DataModel.date,
DataModel.times).fetch(2000)
Which raise error...
my id_value is integers.
How can I search and filter to get data that have higher id than id_value?
The filtering by inequality on key is fine, what's wrong is that you cannot combine filtering by inequality on one thing, with ordering by another thing. To quote https://cloud.google.com/appengine/docs/python/ndb/queries under Limitations: ...:
combining too many filters, using inequalities for multiple
properties, or combining an inequality with a sort order on a
different property are all currently disallowed.
The last one of these three limitations is what you're running into.
One option is to fetch all the results from the query and then sort them in your program (instead of in the .order() clause).
A second option is to query on all three fields () to reduce the number of results, and then sort them as above:
DataModel.query(DataModel.key > ndb.Key('DataModel', id_value),
DataModel.date > some_date,
DataModel.times > some_times)
A third option is to use MapReduce to process large quantities of data.
I have two classes with a many-to-many relationship, Items and Categories.
Categories have an associated value.
I would like to query for all Items for which the highest Categorie.value (if there is any) is less than a given value.
So far I have tried queries like this:
from sqlalchemy.sql import functions
Session.query(Items).join(Categories,Items.categories).filter(functions.max(Categories.value)<3.14).all()
But in this case I get a (OperationalError) misuse of aggregate function max() error.
Is there a way to make this query?
You need GROUP BY and HAVING instead of just WHERE for filtering on an aggregate.
Session.query(Items).join(Items.categories).group_by(Items.id).having(functions.max(Categories.value)<3.14).all()
Edit: To also include Items without any category, I believe you can do an outer join and put an OR in the HAVING clause:
Session.query(Items).outerjoin(Items.categories).group_by(Items.id)\
.having( (functions.max(Categories.value)<3.14) | (functions.count(Categories.id)==0) )\
.all()
class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I want to select exactly one item for each category, which is the query syntax for do this?
Your question isn't entirely clear: since you didn't say otherwise, I'm going to assume that you don't care which item is selected for each category, just that you need any one. If that isn't the case, please update the question to clarify.
tl;dr version: there is no documented
way to explicitly use GROUP BY
statements in Django, except by using
a raw query. See the bottom for code to do so.
The problem is that in doing what you're looking for in SQL itself requires a bit of a hack. You can easily try this example with by entering sqlite3 :memory: at the command line:
CREATE TABLE category
(
id INT
);
CREATE TABLE item
(
id INT,
category_id INT
);
INSERT INTO category VALUES (1);
INSERT INTO category VALUES (2);
INSERT INTO category VALUES (3);
INSERT INTO item VALUES (1,1);
INSERT INTO item VALUES (2,2);
INSERT INTO item VALUES (3,3);
INSERT INTO item VALUES (4,1);
INSERT INTO item VALUES (5,2);
SELECT id, category_id, COUNT(category_id) FROM item GROUP BY category_id;
returns
4|1|2
5|2|2
3|3|1
Which is what you're looking for (one item id for each category id), albeit with an extraneous COUNT. The count (or some other aggregate function) is needed in order to apply the GROUP BY.
Note: this will ignore categories that don't contain any items, which seems like sensible behaviour.
Now the question becomes, how to do this in Django?
The obvious answer is to use Django's aggregation/annotation support, in particular, combining annotate with values as is recommend elsewhere to GROUP queries in Django.
Reading those posts, it would seem we could accomplish what we're looking for with
Item.objects.values('id').annotate(unneeded_count=Count('category_id'))
However this doesn't work. What Django does here is not just GROUP BY "category_id", but groups by all fields selected (ie GROUP BY "id", "category_id")1. I don't believe there is a way (in the public API, at least) to change this behaviour.
The solution is to fall back to raw SQL:
qs = Item.objects.raw('SELECT *, COUNT(category_id) FROM myapp_item GROUP BY category_id')
1: Note that you can inspect what queries Django is running with:
from django.db import connection
print connection.queries[-1]
Edit:
There are a number of other possible approaches, but most have (possibly severe) performance problems. Here are a couple:
1. Select an item from each category.
items = []
for c in Category.objects.all():
items.append(c.item_set[0])
This is a more clear and flexible approach, but has the obvious disadvantage of requiring many more database hits.
2. Use select_related
items = Item.objects.select_related()
and then do the grouping/filtering yourself (in Python).
Again, this is perhaps more clear than using raw SQL and only requires one query, but this one query could be very large (it will return all items and their categories) and doing the grouping/filtering yourself is probably less efficient than letting the database do it for you.