Django ORM: Joining QuerySets - python

I'm trying to use the Django ORM for a task that requires a JOIN in SQL. I
already have a workaround that accomplishes the same task with multiple queries
and some off-DB processing, but I'm not satisfied by the runtime complexity.
First, I'd like to give you a short introduction to the relevant part of my
model. After that, I'll explain the task in English, SQL and (inefficient) Django ORM.
The Model
In my CMS model, posts are multi-language: For each post and each language, there can be one instance of the post's content. Also, when editing posts, I don't UPDATE, but INSERT new versions of them.
So, PostContent is unique on post, language and version. Here's the class:
class PostContent(models.Model):
""" contains all versions of a post, in all languages. """
language = models.ForeignKey(Language)
post = models.ForeignKey(Post) # the Post object itself only
version = models.IntegerField(default=0) # contains slug and id.
# further metadata and content left out
class Meta:
unique_together = (("resource", "language", "version"),)
The Task in SQL
And this is the task: I'd like to get a list of the most recent versions of all posts in each language, using the ORM. In SQL, this translates to a JOIN on a subquery that does GROUP BY and MAX to get the maximum of version for each unique pair of resource and language. The perfect answer to this question would be a number of ORM calls that produce the following SQL statement:
SELECT
id,
post_id,
version,
v
FROM
cms_postcontent,
(SELECT
post_id as p,
max(version) as v,
language_id as l
FROM
cms_postcontent
GROUP BY
post_id,
language_id
) as maxv
WHERE
post_id=p
AND version=v
AND language_id=l;
Solution in Django
My current solution using the Django ORM does not produce such a JOIN, but two seperate SQL
queries, and one of those queries can become very large. I first execute the subquery (the inner SELECT from above):
maxv = PostContent.objects.values('post','language').annotate(
max_version=Max('version'))
Now, instead of joining maxv, I explicitly ask for every single post in maxv, by
filtering PostContent.objects.all() for each tuple of post, language, max_version. The resulting SQL looks like
SELECT * FROM PostContent WHERE
post=P1 and language=L1 and version=V1
OR post=P2 and language=L2 and version=V2
OR ...;
In Django:
from django.db.models import Q
conjunc = map(lambda pc: Q(version=pc['max_version']).__and__(
Q(post=pc['post']).__and__(
Q(language=pc['language']))), maxv)
result = PostContent.objects.filter(
reduce(lambda disjunc, x: disjunc.__or__(x), conjunc[1:], conjunc[0]))
If maxv is sufficiently small, e.g. when retrieving a single post, this might be
a good solution, but the size of the query and the time to create it grow linearly with
the number of posts. The complexity of parsing the query is also at least linear.
Is there a better way to do this, apart from using raw SQL?

You can join (in the sense of union) querysets with the | operator, as long as the querysets query the same model.
However, it sounds like you want something like PostContent.objects.order_by('version').distinct('language'); as you can't quite do that in 1.3.1, consider using values in combination with distinct() to get the effect you need.

Related

Django join-like query with no where condition

Say I have 3 models in Django
class Instrument(models.Model):
ticker = models.CharField(max_length=30, unique=True, db_index=True)
class Instrument_df(models.Model):
instrument = models.OneToOneField(
Instrument,
on_delete=models.CASCADE,
primary_key=True,
)
class Quote(models.Model):
instrument = models.ForeignKey(Instrument, on_delete=models.CASCADE)
I just want to query all Quotes that correspond to an instrument of 'DF' type. in SQL I would perform the join of Quote and Instrument_df on field id.
Using Django's ORM I came out with
Quote.objects.filter(instrument__instrument_df__instrument_id__gte=-1)
I think this does the job, but I see two drawbacks:
1) I am joining 3 tables, when in fact table Instrument would not need to be involved.
2) I had to insert the trivial id > -1 condition, that holds always. This looks awfully artificial.
How should this query be written?
Thanks!
Assuming Instrument_df has other fields not shown in the snippet (else this table is just useless and could be replaced by a flag in Instrument), a possible solution could be to use either a subquery or two queries:
# with a subquery
dfids = Instrument_df.objects.values_list("instrument", flat=True)
Quote.objects.filter(instrument__in=dfids)
# with two queries (can be faster on MySQL)
dfids = list(Instrument_df.objects.values_list("instrument", flat=True))
Quote.objects.filter(instrument__in=dfids)
Whether this will perform better than your actual solution depends on your db vendor and version (MySQL was known for being very bad at handling subqueries, don't know if it's still the case) and actual content.
But I think the best solution here would be a plain raw query - this is a bit less portable and may require more care in case of a schema update (hint: use a custom manager and write this query as a manager method so you have one single point of truth - you don't want to scatter your views with raw sql queries).

select in (select ..) using ORM django

How can I make a query
select name where id in (select id from ...)
using Django ORM? I think I can make this using some loop for for obtain some result and another loop for, for use this result, but I think that is not practical job, is more simple make a query sql, I think that make this in python should be more simple in python
I have these models:
class Invoice (models.Model):
factura_id = models.IntegerField(unique=True)
created_date = models.DateTimeField()
store_id = models.ForeignKey(Store,blank=False)
class invoicePayments(models.Model):
invoice = models.ForeignKey(Factura)
date = models.DateTimeField()#auto_now = True)
money = models.DecimalField(max_digits=9,decimal_places=0)
I need get the payments of a invoice filter by store_id,date of pay.
I make this query in mysql using a select in (select ...). This a simple query but make some similar using django orm i only think and make some loop for but I don't like this idea:
invoiceXstore = invoice.objects.filter(local=3)
for a in invoiceXstore:
payments = invoicePayments.objects.filter(invoice=a.id,
date__range=["2016-05-01", "2016-05-06"])
You can traverse ForeignKey relations using double underscores (__) in Django ORM. For example, your query could be implemented as:
payments = invoicePayments.objects.filter(invoice__store_id=3,
date__range=["2016-05-01", "2016-05-06"])
I guess you renamed your classes to English before posting here. In this case, you may need to change the first part to factura__local=3.
As a side note, it is recommended to rename your model class to InvoicePayments (with a capital I) to be more compliant with PEP8.
Your mysql raw query is a sub query.
select name where id in (select id from ...)
In mysql this will usually be slower than an INNER JOIN (refer : [http://dev.mysql.com/doc/refman/5.7/en/rewriting-subqueries.html]) thus you can rewrite your raw query as an INNER JOIN which will look like 1.
SELECT ip.* FROM invoicepayments i INNER JOIN invoice i ON
ip.invoice_id = i.id
You can then use a WHERE clause to apply the filtering.
The looping query approach you have tried does work but it is not recommended because it results in a large number of queries being executed. Instead you can do.
InvoicePayments.objects.filter(invoice__local=3,
date__range=("2016-05-01", "2016-05-06"))
I am not quite sure what 'local' stands for because your model does not show any field like that. Please update your model with the correct field or edit the query as appropriate.
To lean about __range see this https://docs.djangoproject.com/en/1.9/ref/models/querysets/#range

Django: Selecting distinct values on maximum foreign key value

I have the following models which I'm testing with SQLite3 and MySQL:
# (various model fields extraneous to discussion removed...)
class Run(models.Model):
runNumber = models.IntegerField()
class Snapshot(models.Model):
t = models.DateTimeField()
class SnapshotRun(models.Model):
snapshot = models.ForeignKey(Snapshot)
run = models.ForeignKey(Run)
# other fields which make it possible to have multiple distinct Run objects per Snapshot
I want a query which will give me a set of runNumbers & snapshot IDs for which the Snapshot.id is below some specified value. Naively I would expect this to work:
print SnapshotRun.objects.filter(snapshot__id__lte=ss_id)\
.order_by("run__runNumber", "-snapshot__id")\
.distinct("run__runNumber", "snapshot__id")\
.values("run__runNumber", "snapshot__id")
But this blows up with
NotImplementedError: DISTINCT ON fields is not supported by this database backend
for both database backends. Postgres is unfortunately not an option.
Time to fall back to raw SQL?
Update:
Since Django's ORM won't help me out of this one (thanks #jknupp) I did manage to get the following raw SQL to work:
cursor.execute("""
SELECT r.runNumber, ssr1.snapshot_id
FROM livedata_run AS r
JOIN livedata_snapshotrun AS ssr1
ON ssr1.id =
(
SELECT id
FROM livedata_snapshotrun AS ssr2
WHERE ssr2.run_id = r.id
AND ssr2.snapshot_id <= %s
ORDER BY snapshot_id DESC
LIMIT 1
);
""", max_ss_id)
Here livedata is the Django app these tables live in.
The note in the Django documentation is pretty clear:
Note:
Any fields used in an order_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct(). If order by fields from a related model, those fields will be added to the selected columns and they may make otherwise duplicate rows appear to be distinct. Since the extra columns don’t appear in the returned results (they are only there to support ordering), it sometimes looks like non-distinct results are being returned.
Similarly, if you use a values() query to restrict the columns selected, the columns used in any order_by() (or default model ordering) will still be involved and may affect uniqueness of the results.
The moral here is that if you are using distinct() be careful about ordering by related models. Similarly, when using distinct() and values() together, be careful when ordering by fields not in the values() call.
Also, below that:
This ability to specify field names (with distinct) is only available in PostgreSQL.

Query syntax to select exactly one item for each category

class Category(models.Model):
pass
class Item(models.Model):
cat = models.ForeignKey(Category)
I want to select exactly one item for each category, which is the query syntax for do this?
Your question isn't entirely clear: since you didn't say otherwise, I'm going to assume that you don't care which item is selected for each category, just that you need any one. If that isn't the case, please update the question to clarify.
tl;dr version: there is no documented
way to explicitly use GROUP BY
statements in Django, except by using
a raw query. See the bottom for code to do so.
The problem is that in doing what you're looking for in SQL itself requires a bit of a hack. You can easily try this example with by entering sqlite3 :memory: at the command line:
CREATE TABLE category
(
id INT
);
CREATE TABLE item
(
id INT,
category_id INT
);
INSERT INTO category VALUES (1);
INSERT INTO category VALUES (2);
INSERT INTO category VALUES (3);
INSERT INTO item VALUES (1,1);
INSERT INTO item VALUES (2,2);
INSERT INTO item VALUES (3,3);
INSERT INTO item VALUES (4,1);
INSERT INTO item VALUES (5,2);
SELECT id, category_id, COUNT(category_id) FROM item GROUP BY category_id;
returns
4|1|2
5|2|2
3|3|1
Which is what you're looking for (one item id for each category id), albeit with an extraneous COUNT. The count (or some other aggregate function) is needed in order to apply the GROUP BY.
Note: this will ignore categories that don't contain any items, which seems like sensible behaviour.
Now the question becomes, how to do this in Django?
The obvious answer is to use Django's aggregation/annotation support, in particular, combining annotate with values as is recommend elsewhere to GROUP queries in Django.
Reading those posts, it would seem we could accomplish what we're looking for with
Item.objects.values('id').annotate(unneeded_count=Count('category_id'))
However this doesn't work. What Django does here is not just GROUP BY "category_id", but groups by all fields selected (ie GROUP BY "id", "category_id")1. I don't believe there is a way (in the public API, at least) to change this behaviour.
The solution is to fall back to raw SQL:
qs = Item.objects.raw('SELECT *, COUNT(category_id) FROM myapp_item GROUP BY category_id')
1: Note that you can inspect what queries Django is running with:
from django.db import connection
print connection.queries[-1]
Edit:
There are a number of other possible approaches, but most have (possibly severe) performance problems. Here are a couple:
1. Select an item from each category.
items = []
for c in Category.objects.all():
items.append(c.item_set[0])
This is a more clear and flexible approach, but has the obvious disadvantage of requiring many more database hits.
2. Use select_related
items = Item.objects.select_related()
and then do the grouping/filtering yourself (in Python).
Again, this is perhaps more clear than using raw SQL and only requires one query, but this one query could be very large (it will return all items and their categories) and doing the grouping/filtering yourself is probably less efficient than letting the database do it for you.

How can I make a Django query for the first occurrence of a foreign key in a column?

Basically, I have a table with a bunch of foreign keys and I'm trying to query only the first occurrence of a particular key by the "created" field. Using the Blog/Entry example, if the Entry model has a foreign key to Blog and a foreign key to User, then how can I construct a query to select all Entries in which a particular User has written the first one for the various Blogs?
class Blog(models.Model):
...
class User(models.Model):
...
class Entry(models.Model):
blog = models.Foreignkey(Blog)
user = models.Foreignkey(User)
I assume there's some magic I'm missing to select the first entries of a blog and that I can simple filter further down to a particular user by appending:
query = Entry.objects.magicquery.filter(user=user)
But maybe there's some other more efficient way. Thanks!
query = Entry.objects.filter(user=user).order_by('id')[0]
Basically order by id (lowest to highest), and slice it to get only the first hit from the QuerySet.
I don't have a Django install available right now to test my line, so please check the documentation if somehow I have a type above:
order by
limiting querysets
By the way, interesting note on 'limiting queysets" manual section:
To retrieve a single object rather
than a list (e.g. SELECT foo FROM bar
LIMIT 1), use a simple index instead
of a slice. For example, this returns
the first Entry in the database, after
ordering entries alphabetically by
headline:
Entry.objects.order_by('headline')[0]
EDIT: Ok, this is the best I could come up so far (after yours and mine comment). It doesn't return Entry objects, but its ids as entry_id.
query = Entry.objects.values('blog').filter(user=user).annotate(Count('blog')).annotate(entry_id=Min('id'))
I'll keep looking for a better way.
Ancient question, I realise - #zalew's response is close but will likely result in the error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial
ORDER BY expressions
To correct this, try aligning the ordering and distinct parts of the query:
Entry.objects.filter(user=user).distinct("blog").order_by("blog", "created")
As a bonus, in case of multiple entries being created at exactly the same time (unlikely, but you never know!), you could add determinism by including id in the order_by:
To correct this, try aligning the ordering and distinct parts of the query:
Entry.objects.filter(user=user).distinct("blog").order_by("blog", "created", "id")
Can't test it in this particular moment
Entry.objects.filter(user=user).distinct("blog").order_by("id")

Categories