Nesting Django QuerySets - python

Is there a way to create a queryset that operates on a nested queryset?
The simplest example I can think of to explain what I'm trying to accomplish is by demonstration.
I would like to write code something like
SensorReading.objects.filter(reading=1).objects.filter(meter=1)
resulting in SQL looking like
SELECT * FROM (
SELECT * FROM SensorReading WHERE reading=1
) WHERE sensor=1;
More specifically I have a model representing readings from sensors
class SensorReading(models.Model):
sensor=models.PositiveIntegerField()
timestamp=models.DatetimeField()
reading=models.IntegerField()
With this I am creating a queryset that annotates every sensor with the elapsed time since the previous reading in seconds
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
I now want to aggregate those per sensor to see the minimum and maximum elapsed time between readings from every sensor. I initially tried
readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))[0]
however, this fails because window values cannot be used inside the aggregate.
Are there any methods or libraries to solve this without needing to resort to raw SQL? Or have I just overlooked a simpler solution to the problem?

The short answer is Yes you can, using the id__in lookup and a subquery in the filter method. function from the django.db.models module.
The long answer is how? :
you can create a subquery that retrieves the filtered SensorReading objects, and then use that subquery in the main queryset (For example):
from django.db.models import Subquery
subquery = SensorReading.objects.filter(reading=1).values('id')
readings = SensorReading.objects.filter(id__in=Subquery(subquery), meter=1)
The above code will generate SQL that is similar to what you described in your example:
SELECT * FROM SensorReading
WHERE id IN (SELECT id FROM SensorReading WHERE reading=1)
AND meter=1;
Another way is to chain the filter() on the queryset that you have created, and add the second filter on top of it
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
.filter(sensor=1)
)
UPDATE:
As you commented below, you can use the RawSQL function from the django.db.models module to to aggregate the window function values without running the subquery multiple times. This allows you to include raw SQL in the queryset, and use the results of that SQL in further querysets or aggregations.
For example, you can create a raw SQL query that retrieves the filtered SensorReading objects, the previous_read and delta fields with the window function applied, and then use that SQL in the main queryset:
from django.db.models import RawSQL
raw_sql = '''
SELECT id, sensor, timestamp, reading,
LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp) as previous_read,
ABS(EXTRACT(EPOCH FROM timestamp - LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp))) as delta
FROM myapp_sensorreading
WHERE reading = 1
'''
readings = SensorReading.objects.raw(raw_sql)
You can then use the readings queryset to aggregate the data as you need, for example:
aggregated_data = readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))
Just be aware of the security implications of using raw SQL, as it allows you to include user input directly in the query, which could lead to SQL injection attacks. Be sure to properly validate and sanitize any user input that you use in a raw SQL query.

Ended up rolling my own solution, basically introspecting the queryset to create a fake table to use in the creation of a new query set and setting the alias to a node that knows to render the SQL for the inner query
allows me to do something like
readings = (
NestedQuery(
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=[
"timestamp",
],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
.values("sensor")
.annotate(min=Min("delta"), max=Max("delta"))
)
code is available on github, and I've published it on pypi
https://github.com/Simage/django-nestedquery
I have no doubt that I'm leaking the tables or some such nonsense still and this should be considered proof of concept, not any sort of production code.

Related

Update Django model based on the row number of rows produced by a subquery on the same model

I have a PostgreSQL UPDATE query which updates a field (global_ranking) of every row in a table, based on the ROW_NUMBER() of each row in that same table sorted by another field (rating). Additionally, the update is partitioned, so that the ranking of each row is relative only to those rows which belong to the same language.
In short, I'm updating the ranking of each player in a game, based on their current rating.
The PostgreSQL query looks like this:
UPDATE stats_userstats
SET
global_ranking = sub.row_number
FROM (
SELECT id, ROW_NUMBER() OVER (
PARTITION BY language
ORDER BY rating DESC
) AS row_number
FROM stats_userstats
) sub
WHERE stats_userstats.id = sub.id;
I'm also using Django, and it'd be fun to learn how to express this query using the Django ORM, if possible.
At first, it seemed like Django had everything necessary to express the query, including the ability to use PostgreSQL's ROW_NUMBER() windowing function, but my best attempt updates all rows ranking with 1:
from django.db.models import F, OuterRef, Subquery
from django.db.models.expressions import Window
from django.db.models.functions import RowNumber
UserStats.objects.update(
global_ranking=Subquery(
UserStats.objects.filter(
id=OuterRef('id')
).annotate(
row_number=Window(
expression=RowNumber(),
partition_by=[F('language')],
order_by=F('rating').desc()
)
).values('row_number')
)
)
I used from django.db import connection; print(connection.queries) to see the query produced by that Django ORM statement, and got this vaguely similar SQL statement:
UPDATE "stats_userstats"
SET "global_ranking" = (
SELECT ROW_NUMBER() OVER (
PARTITION BY U0."language"
ORDER BY U0."rating" DESC
) AS "row_number"
FROM "stats_userstats" U0
WHERE U0."id" = "stats_userstats"."id"
It looks like what I need to do is move the subquery from the SET portion of the query to the FROM, but it's unclear to me how to restructure the Django ORM statement to achieve that.
Any help is greatly appreciated. Thank you!
Subquery filters qs by provided OuterRef. You're always getting 1 as each user is in fact first in any ranking if only them are considered.
A "correct" query would be:
UserStats.objects.alias(
row_number=Window(
expression=RowNumber(),
partition_by=[F('language')],
order_by=F('rating').desc()
)
).update(global_ranking=F('row_number'))
But Django will not allow that:
django.core.exceptions.FieldError: Window expressions are not allowed in this query
Related Django ticket: https://code.djangoproject.com/ticket/25643
I think you might comment there with your use case.

Is it possible to use Django ORM to select no table fields receiving only values calculated based on real columns?

So the question is "is it possible to make the following query using Django ORM without raw statements?"
SELECT
my_table.column_a + my_table.column_b
FROM
my_table
The example for which it would be suitable from my point of view:
We have a model:
class MyOperations(models.Model):
operation_start_time = models.DateTimeField()
At some point we create a record and set the field value to Now (or we update some existing record. it doesn't matter):
MyOperations.objects.create(operation_start_time=functions.Now())
Now we want to know how much time has already passed. I would expect that Django ORM produces the following SQL statement to request data from the database (let's assume that we use MySQL backend):
SELECT
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
So is it a way to achieve this without raw statements?
For now I settled on the following solution:
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).filter(...)
It produces the following SQL statement:
SELECT
`myapp_myoperations`.`operation_start_time`,
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
Or in the Python response object representation:
{'operation_start_time': datetime(...), 'diff': timedelta(...)}
Is it a way to get the response dict with only diff since this is the only field I am interested in?
Django ORM produced the query which requests operation_start_time just as we had written. But in case I remove the call to values at all it produces query which requests all table columns
Solution which produces the expected SQL
We should just put the call to values to the place in which diff is already known to the query
MyOperations.objects.annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff').filter(...)
You can use values() on your calculated field, so a query like
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff')
should give you a resulting queryset containing only your calculated 'diff'.

Django Models - SELECT DISTINCT(foo) FROM table is too slow

I have a MySQL table with 13M rows. I can query the db directly as
SELECT DISTINCT(refdate) FROM myTable
The query takes 0.15 seconds and is great.
The equivalent table defined as a Django model and queried as
myTable.objects.values(`refdate`).distinct()
takes a very long time. Is it because there are too many items in the list before distinct(). How do I do this in a manner that doesn't bring everything down?
Thank you #solarissmoke for the pointer to connection.queries.
I was expecting to see
SELECT DISTINCT refdate FROM myTable
Instead, I got
SELECT DISTINCT refdate, itemIndex, itemType FROM myTable ORDER BY itemIndex, refdate, itemType.
I then looked at myTable defined in models.py.
unique_together = (('nodeIndex', 'refdate', 'nodeType'), )
ordering = ['nodeIndex', 'refdate', 'nodeType']
From Interaction with default ordering or order_by
normally you won’t want extra columns playing a part in the result, so clear out the ordering, or at least make sure it’s restricted only to those fields you also select in a values() call.
So I tried order_by() to flush the previously defined ordering and voila!
myTable.objects.values('refdate').order_by().distinct()
You can try this:
myTable.objects.all().distinct('refdate')

Django ORM: Joining QuerySets

I'm trying to use the Django ORM for a task that requires a JOIN in SQL. I
already have a workaround that accomplishes the same task with multiple queries
and some off-DB processing, but I'm not satisfied by the runtime complexity.
First, I'd like to give you a short introduction to the relevant part of my
model. After that, I'll explain the task in English, SQL and (inefficient) Django ORM.
The Model
In my CMS model, posts are multi-language: For each post and each language, there can be one instance of the post's content. Also, when editing posts, I don't UPDATE, but INSERT new versions of them.
So, PostContent is unique on post, language and version. Here's the class:
class PostContent(models.Model):
""" contains all versions of a post, in all languages. """
language = models.ForeignKey(Language)
post = models.ForeignKey(Post) # the Post object itself only
version = models.IntegerField(default=0) # contains slug and id.
# further metadata and content left out
class Meta:
unique_together = (("resource", "language", "version"),)
The Task in SQL
And this is the task: I'd like to get a list of the most recent versions of all posts in each language, using the ORM. In SQL, this translates to a JOIN on a subquery that does GROUP BY and MAX to get the maximum of version for each unique pair of resource and language. The perfect answer to this question would be a number of ORM calls that produce the following SQL statement:
SELECT
id,
post_id,
version,
v
FROM
cms_postcontent,
(SELECT
post_id as p,
max(version) as v,
language_id as l
FROM
cms_postcontent
GROUP BY
post_id,
language_id
) as maxv
WHERE
post_id=p
AND version=v
AND language_id=l;
Solution in Django
My current solution using the Django ORM does not produce such a JOIN, but two seperate SQL
queries, and one of those queries can become very large. I first execute the subquery (the inner SELECT from above):
maxv = PostContent.objects.values('post','language').annotate(
max_version=Max('version'))
Now, instead of joining maxv, I explicitly ask for every single post in maxv, by
filtering PostContent.objects.all() for each tuple of post, language, max_version. The resulting SQL looks like
SELECT * FROM PostContent WHERE
post=P1 and language=L1 and version=V1
OR post=P2 and language=L2 and version=V2
OR ...;
In Django:
from django.db.models import Q
conjunc = map(lambda pc: Q(version=pc['max_version']).__and__(
Q(post=pc['post']).__and__(
Q(language=pc['language']))), maxv)
result = PostContent.objects.filter(
reduce(lambda disjunc, x: disjunc.__or__(x), conjunc[1:], conjunc[0]))
If maxv is sufficiently small, e.g. when retrieving a single post, this might be
a good solution, but the size of the query and the time to create it grow linearly with
the number of posts. The complexity of parsing the query is also at least linear.
Is there a better way to do this, apart from using raw SQL?
You can join (in the sense of union) querysets with the | operator, as long as the querysets query the same model.
However, it sounds like you want something like PostContent.objects.order_by('version').distinct('language'); as you can't quite do that in 1.3.1, consider using values in combination with distinct() to get the effect you need.

Group by date in a particular format in SQLAlchemy

I have a table called logs which has a datetime field.
I want to select the date and count of rows based on a particular date format.
How do I do this using SQLAlchemy?
I don't know of a generic SQLAlchemy answer. Most databases support some form of date formatting, typically via functions. SQLAlchemy supports calling functions via sqlalchemy.sql.func. So for example, using SQLAlchemy over a Postgres back end, and a table my_table(foo varchar(30), when timestamp) I might do something like
my_table = metadata.tables['my_table']
foo = my_table.c['foo']
the_date = func.date_trunc('month', my_table.c['when'])
stmt = select(foo, the_date).group_by(the_date)
engine.execute(stmt)
To group by date truncated to month. But keep in mind that in that example, date_trunc() is a Postgres datetime function. Other databases will be different. You didn't mention the underlyig database. If there's a database independent way to do it I've never found one. In my case I run production and test aginst Postgres and unit tests aginst SQLite and have resorted to using SQLite user defined functions in my unit tests to emulate Postgress datetime functions.
Does counting yield the same result when you just group by the unformatted datetime column? If so, you could just run the query and use Python date's strftime() method afterwards. i.e.
query = select([logs.c.datetime, func.count(logs.c.datetime)]).group_by(logs.c.datetime)
results = session.execute(query).fetchall()
results = [(t[0].strftime("..."), t[1]) for t in results]
I don't know SQLAlchemy, so I could be off-target. However, I think that all you need is:
SELECT date_formatter(datetime_field, "format-specification") AS dt_field, COUNT(*)
FROM logs
GROUP BY date_formatter(datetime_field, "format-specification")
ORDER BY 1;
OK, maybe you don't need the ORDER BY, and maybe it would be better to re-specify the date expression. There are likely to be alternatives, such as:
SELECT dt_field, COUNT(*)
FROM (SELECT date_formatter(datetime_field, "format-specification") AS dt_field
FROM logs) AS necessary
GROUP BY dt_field
ORDER BY dt_field;
And so on and so forth. Basically, you format the datetime field and then proceed to do the grouping etc on the formatted value.

Categories