Using jsonb_array_elements in sqlalchemy query - python

I'm using SQLAlchemy ORM and trying to figure out how to produce a PostgreSQL query something along the lines of:
SELECT
payments.*
FROM
payments,
jsonb_array_elements(payments.data #> '{refunds}') refunds
WHERE
(refunds ->> 'created_at')
BETWEEN '2018-12-01T19:30:38Z' AND '2018-12-02T19:30:38Z';
Though with the start date inclusive and stop date exclusive.
I've been able to get close with:
refundsInDB = db_session.query(Payment).\
filter(Payment.data['refunds', 0, 'created_at'].astext >= startTime,
Payment.data['refunds', 0, 'created_at'].astext < stopTime ).\
all()
However, this only works if the refund (which is a nested array in the JSONB data) is the first element in the list of {'refunds':[]} whereas the SQL query above will work regardless of the position in the refund list.
After a good bit of searching it looks like there are some temporary recipes in an old SQLAlchemy github issue, one of which talks about using jsonb_array_elements in a query, but I haven't been able to quite make it work in the way I'd like.
If it helps my Payment.data JSONB is exactly like the Payment object from the Square Connect v1 API and I am trying to search the data using the created_at date found in the nested refunds list of refund objects.

Use Query.select_from() and a function expression alias to perform the query:
# Create a column object for referencing the `value` attribute produced by
# the set of jsonb_array_elements
val = db.column('value', type_=JSONB)
refundsInDB = db_session.query(Payment).\
select_from(
Payment,
func.jsonb_array_elements(Payment.data['refunds']).alias()).\
filter(val['created_at'].astext >= startTime,
val['created_at'].astext < stopTime ).\
all()
Note that unnesting the jsonb array and filtering based on it may produce multiple rows per Payment, but SQLAlchemy will only return distinct entities, when querying for a single entity.

Related

Nesting Django QuerySets

Is there a way to create a queryset that operates on a nested queryset?
The simplest example I can think of to explain what I'm trying to accomplish is by demonstration.
I would like to write code something like
SensorReading.objects.filter(reading=1).objects.filter(meter=1)
resulting in SQL looking like
SELECT * FROM (
SELECT * FROM SensorReading WHERE reading=1
) WHERE sensor=1;
More specifically I have a model representing readings from sensors
class SensorReading(models.Model):
sensor=models.PositiveIntegerField()
timestamp=models.DatetimeField()
reading=models.IntegerField()
With this I am creating a queryset that annotates every sensor with the elapsed time since the previous reading in seconds
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
I now want to aggregate those per sensor to see the minimum and maximum elapsed time between readings from every sensor. I initially tried
readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))[0]
however, this fails because window values cannot be used inside the aggregate.
Are there any methods or libraries to solve this without needing to resort to raw SQL? Or have I just overlooked a simpler solution to the problem?
The short answer is Yes you can, using the id__in lookup and a subquery in the filter method. function from the django.db.models module.
The long answer is how? :
you can create a subquery that retrieves the filtered SensorReading objects, and then use that subquery in the main queryset (For example):
from django.db.models import Subquery
subquery = SensorReading.objects.filter(reading=1).values('id')
readings = SensorReading.objects.filter(id__in=Subquery(subquery), meter=1)
The above code will generate SQL that is similar to what you described in your example:
SELECT * FROM SensorReading
WHERE id IN (SELECT id FROM SensorReading WHERE reading=1)
AND meter=1;
Another way is to chain the filter() on the queryset that you have created, and add the second filter on top of it
readings = (
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=["timestamp",],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
.filter(sensor=1)
)
UPDATE:
As you commented below, you can use the RawSQL function from the django.db.models module to to aggregate the window function values without running the subquery multiple times. This allows you to include raw SQL in the queryset, and use the results of that SQL in further querysets or aggregations.
For example, you can create a raw SQL query that retrieves the filtered SensorReading objects, the previous_read and delta fields with the window function applied, and then use that SQL in the main queryset:
from django.db.models import RawSQL
raw_sql = '''
SELECT id, sensor, timestamp, reading,
LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp) as previous_read,
ABS(EXTRACT(EPOCH FROM timestamp - LAG(timestamp) OVER (PARTITION BY sensor ORDER BY timestamp))) as delta
FROM myapp_sensorreading
WHERE reading = 1
'''
readings = SensorReading.objects.raw(raw_sql)
You can then use the readings queryset to aggregate the data as you need, for example:
aggregated_data = readings.values("sensor").annotate(max=Max('delta'),min=Min('delta'))
Just be aware of the security implications of using raw SQL, as it allows you to include user input directly in the query, which could lead to SQL injection attacks. Be sure to properly validate and sanitize any user input that you use in a raw SQL query.
Ended up rolling my own solution, basically introspecting the queryset to create a fake table to use in the creation of a new query set and setting the alias to a node that knows to render the SQL for the inner query
allows me to do something like
readings = (
NestedQuery(
SensorReading.objects.filter(**filters)
.annotate(
previous_read=Window(
expression=window.Lead("timestamp"),
partition_by=[F("sensor"),],
order_by=[
"timestamp",
],
frame=RowRange(start=-1, end=0),
)
)
.annotate(delta=Abs(Extract(F("timestamp") - F("previous_read"), "epoch")))
)
.values("sensor")
.annotate(min=Min("delta"), max=Max("delta"))
)
code is available on github, and I've published it on pypi
https://github.com/Simage/django-nestedquery
I have no doubt that I'm leaking the tables or some such nonsense still and this should be considered proof of concept, not any sort of production code.

Is it possible to use Django ORM to select no table fields receiving only values calculated based on real columns?

So the question is "is it possible to make the following query using Django ORM without raw statements?"
SELECT
my_table.column_a + my_table.column_b
FROM
my_table
The example for which it would be suitable from my point of view:
We have a model:
class MyOperations(models.Model):
operation_start_time = models.DateTimeField()
At some point we create a record and set the field value to Now (or we update some existing record. it doesn't matter):
MyOperations.objects.create(operation_start_time=functions.Now())
Now we want to know how much time has already passed. I would expect that Django ORM produces the following SQL statement to request data from the database (let's assume that we use MySQL backend):
SELECT
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
So is it a way to achieve this without raw statements?
For now I settled on the following solution:
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).filter(...)
It produces the following SQL statement:
SELECT
`myapp_myoperations`.`operation_start_time`,
TIMESTAMPDIFF(MICROSECOND, `myapp_myoperations`.`operation_start_time`, CURRENT_TIMESTAMP) AS `time_spent`
FROM
`myapp_myoperations`
WHERE ...
Or in the Python response object representation:
{'operation_start_time': datetime(...), 'diff': timedelta(...)}
Is it a way to get the response dict with only diff since this is the only field I am interested in?
Django ORM produced the query which requests operation_start_time just as we had written. But in case I remove the call to values at all it produces query which requests all table columns
Solution which produces the expected SQL
We should just put the call to values to the place in which diff is already known to the query
MyOperations.objects.annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff').filter(...)
You can use values() on your calculated field, so a query like
MyOperations.objects.values('operation_start_time').annotate(
diff=ExpressionWrapper(functions.Now() - F('operation_start_time'),
output_field=DurationField()
)).values('diff')
should give you a resulting queryset containing only your calculated 'diff'.

How to filter on calculated column of a query and meanwhile preserve mapped entities

I have a query which selects an entity A and some calculated fields
q = session.query(Recipe,func.avg(Recipe.somefield).join(.....)
I then use what I select in a way which assumes I can subscript result with "Recipe" string:
for entry in q.all():
recipe=entry.Recipe # Access KeyedTuple by Recipe attribute
...
Now I need to wrap my query in an additional select, say to filter by calculated field AVG:
q=q.subquery();
q=session.query(q).filter(q.c.avg_1 > 1)
And now I cannot access entry.Recipe anymore!
Is there a way to make SQLAlchemy adapt a query to an enclosing one, like aliased(adapt_on_names=True) orselect_from_entity()`?
I tried using those but was given an error
As Michael Bayer mentioned in a relevant Google Group thread, such adaptation is already done via Query.from_self() method. My problem was that in this case I didn't know how to refer a column which I want to filter on
This is due to the fact, that it is calculated i.e. there is no table to refer to!
I might resort to using literals(.filter('avg_1>10')), but 'd prefer to stay in the more ORM-style
So, this is what I came up with - an explicit column expression
row_number_column = func.row_number().over(
partition_by=Recipe.id
).label('row_number')
query = query.add_column(
row_number_column
)
query = query.from_self().filter(row_number_column == 1)

SQLAlchemy Filter based on a function of an a field of a table

I am trying to filter a query based on a function of one of the properties of the table. For example assume, I have a table Days which has a property, day of the type DateTime. now I just want to select the rows that the day happens to be Monday, something like:
db.query(Days).filter(Days.days.strftime('%w')=='1')
but this does not work! SQLAlchemy somehow thinks the days is another table and strftime is not a property of that table. What is the correct way of doing it?
When writing queries, you have to realize that the columns you are accessing are SQLAlchemy columns, not Python values. They will only be Python values once you are looking at an actual row from the result of the query. When writing a query, you need to phrase it in terms of SQL expressions.
The date expressions differ between databases. For PostgreSQL, use the extract() function, which returns a number 0 (Sunday) through 6 (Saturday). For MySQL, use the dayofweek function, which returns a number 1 (Sunday) through 7 (Saturday).
# PostgreSQL
from sqlalchemy.sql import extract
session.query(MyTable).filter(extract('dow', MyTable.my_date) == 1)
# MySQL
from sqlalchemy import func
session.query(MyTable).filter(func.dayofweek(MyTable.my_date) == 2)

Group by date in a particular format in SQLAlchemy

I have a table called logs which has a datetime field.
I want to select the date and count of rows based on a particular date format.
How do I do this using SQLAlchemy?
I don't know of a generic SQLAlchemy answer. Most databases support some form of date formatting, typically via functions. SQLAlchemy supports calling functions via sqlalchemy.sql.func. So for example, using SQLAlchemy over a Postgres back end, and a table my_table(foo varchar(30), when timestamp) I might do something like
my_table = metadata.tables['my_table']
foo = my_table.c['foo']
the_date = func.date_trunc('month', my_table.c['when'])
stmt = select(foo, the_date).group_by(the_date)
engine.execute(stmt)
To group by date truncated to month. But keep in mind that in that example, date_trunc() is a Postgres datetime function. Other databases will be different. You didn't mention the underlyig database. If there's a database independent way to do it I've never found one. In my case I run production and test aginst Postgres and unit tests aginst SQLite and have resorted to using SQLite user defined functions in my unit tests to emulate Postgress datetime functions.
Does counting yield the same result when you just group by the unformatted datetime column? If so, you could just run the query and use Python date's strftime() method afterwards. i.e.
query = select([logs.c.datetime, func.count(logs.c.datetime)]).group_by(logs.c.datetime)
results = session.execute(query).fetchall()
results = [(t[0].strftime("..."), t[1]) for t in results]
I don't know SQLAlchemy, so I could be off-target. However, I think that all you need is:
SELECT date_formatter(datetime_field, "format-specification") AS dt_field, COUNT(*)
FROM logs
GROUP BY date_formatter(datetime_field, "format-specification")
ORDER BY 1;
OK, maybe you don't need the ORDER BY, and maybe it would be better to re-specify the date expression. There are likely to be alternatives, such as:
SELECT dt_field, COUNT(*)
FROM (SELECT date_formatter(datetime_field, "format-specification") AS dt_field
FROM logs) AS necessary
GROUP BY dt_field
ORDER BY dt_field;
And so on and so forth. Basically, you format the datetime field and then proceed to do the grouping etc on the formatted value.

Categories