Coalesce results in a QuerySet - python

I have the following models:
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
# Check against the owner's specified availability
available_periods = self.propertyavailability_set \
.filter(date_from__lte=avail_date_from, \
date_to__gte=avail_date_to) \
.count()
if available_periods == 0:
return False
return True
class PropertyAvailability(models.Model):
de_property = models.ForeignKey(Property, verbose_name='Property')
date_from = models.DateField(verbose_name='From')
date_to = models.DateField(verbose_name='To')
rate_sun_to_thurs = models.IntegerField(verbose_name='Nightly rate: Sun to Thurs')
rate_fri_to_sat = models.IntegerField(verbose_name='Nightly rate: Fri to Sat')
rate_7_night_stay = models.IntegerField(blank=True, null=True, verbose_name='Weekly rate')
minimum_stay_length = models.IntegerField(default=1, verbose_name='Min. length of stay')
class Meta:
unique_together = ('date_from', 'date_to')
Essentially, each Property has its availability specified with instances of PropertyAvailability. From this, the Property.is_available() method checks to see if the Property is available during a given period by querying against PropertyAvailability.
This code works fine except for the following scenario:
Example data
Using the current Property.is_available() method, if I were to search for availability between the 2nd of Jan, 2017 and the 5th of Jan, 2017 it'd work because it matches #1.
But if I were to search between the 4th of Jan, 2017 and the 8th of Jan, 2017, it wouldn't return anything because the date range is overlapping between multiple results - it matches neither #1 or #2.
I read this earlier (which introduced a similar problem and solution through coalescing results) but had trouble writing that using Django's ORM or getting it to work with raw SQL.
So, how can I write a query (preferably using the ORM) that will do this? Or perhaps there's a better solution that I'm unaware of?
Other notes
Both avail_date_from and avail_date_to must match up with PropertyAvailability's date_from and date_to fields:
avail_date_from must be >= PropertyAvailability.date_from
avail_date_to must be <= PropertyAvailability.date_to
This is because I need to query that a Property is available within a given period.
Software specs
Django 1.11
PostgreSQL 9.3.16

My solution would be to check whether the date_from or the date_to fields of PropertyAvailability are contained in the period we're interested in. I do this using Q objects. As mentioned in the comments above, we also need to include the PropertyAvailability objects that encompass the entire period we're interested in. If we find more than one instance, we must check if the availability objects are continuous.
from datetime import timedelta
from django.db.models import Q
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
date_range = (avail_date_from, avail_date_to)
# Check against the owner's specified availability
query_filter = (
# One of the records' date fields falls within date_range
Q(date_from__range=date_range) |
Q(date_to__range=date_range) |
# OR date_range falls between one record's date_from and date_to
Q(date_from__lte=avail_date_from, date_to__gte=avail_date_to)
)
available_periods = self.propertyavailability_set \
.filter(query_filter) \
.order_by('date_from')
# BEWARE! This might suck up a lot of memory if the number of returned rows is large!
# I do this because negative indexing of a `QuerySet` is not supported.
available_periods = list(available_periods)
if len(available_periods) == 1:
# must check if availability matches the range
return (
available_periods[0].date_from <= avail_date_from and
available_periods[0].date_to >= avail_date_to
)
elif len(available_periods) > 1:
# must check if the periods are continuous and match the range
if (
available_periods[0].date_from > avail_date_from or
available_periods[-1].date_to < avail_date_to
):
return False
period_end = available_periods[0].date_to
for available_period in available_periods[1:]:
if available_period.date_from - period_end > timedelta(days=1):
return False
else:
period_end = available_period.date_to
return True
else:
return False
I feel the need to mention though, that the database model does not guarantee that there are no overlapping PropertyAvailability objects in your database. In addition, the unique constraint should most likely contain the de_property field.

What you should be able to do is aggregate the data you wish to query against, and combine any overlapping (or adjacent) ranges.
Postgres doesn't have any way of doing this: it has operators for union and combining adjacent ranges, but nothing that will aggregate collections of overlapping/adjacent ranges.
However, you can write a query that will combine them, although how to do it with the ORM is not obvious (yet).
Here is one solution (left as a comment on http://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/#comment-2834554302, and tweaked to combine adjacent ranges, which appears to be what you want):
SELECT int4range(MIN(LOWER(value)), MAX(UPPER(value))) AS value
FROM (SELECT value,
MAX(new_start) OVER (ORDER BY value) AS left_edge
FROM (SELECT value,
CASE WHEN LOWER(value) <= MAX(le) OVER (ORDER BY value)
THEN NULL
ELSE LOWER(value) END AS new_start
FROM (SELECT value,
lag(UPPER(value)) OVER (ORDER BY value) AS le
FROM range_test
) s1
) s2
) s3
GROUP BY left_edge;
One way to make this queryable from within the ORM is to put it in a Postgres VIEW, and have a model that references this.
However, it is worth noting that this queries the whole source table, so you may want to have filtering applied; probably by de_property.
Something like:
CREATE OR REPLACE VIEW property_aggregatedavailability AS (
SELECT de_property
MIN(date_from) AS date_from,
MAX(date_to) AS date_to
FROM (SELECT date_from,
date_to,
MAX(new_from) OVER (PARTITION BY de_property
ORDER BY date_from) AS left_edge
FROM (SELECT de_property,
date_from,
date_to,
CASE WHEN date_from <= MAX(le) OVER (PARTITION BY de_property
ORDER BY date_from)
THEN NULL
ELSE date_from
END AS new_from
FROM (SELECT de_property,
date_from,
date_to,
LAG(date_to) OVER (PARTITION BY de_property
ORDER BY date_from) AS le
FROM property_propertyavailability
) s1
) s2
) s3
GROUP BY de_property, left_edge
)
As an aside, you might want to consider using Postgres's date range objects, because then you can prevent start > finish (automatically), but also prevent overlapping periods for a given property, using exclusion constraints.
Finally, an alternative solution might be to have a derived table, that stores unavailability, based on taking the available periods, and reversing them. This makes writing the query simpler, as you can write a direct overlap, but negate (i.e., a property is available for a given period iff there are no overlapping unavailable periods). I do that in a production system for staff availability/unavailability, where many checks need to be made. Note that is a denormalised solution, and relies on trigger functions (or other updates) to ensure it is kept in sync.

Related

How do I include columns properly in a GROUP BY clause of my Django aggregate query?

I'm using Django and Python 3.7. I have the following model ...
class ArticleStat(models.Model):
objects = ArticleStatManager()
article = models.ForeignKey(Article, on_delete=models.CASCADE, related_name='articlestats')
elapsed_time_in_seconds = models.IntegerField(default=0, null=False)
score = models.FloatField(default=0, null=False)
I want to write a MAX/GROUP BY query subject to certain conditions. Specifically I want each row to contain
MAX(ArticleStat.elapsed_time_in_seconds)
ArticleStat.Article.id
ArticleStat.Article.title
ArticleStat.score
in which the columns "ArticleStat.Article.id," "ArticleStat.Article.title," and "ArticleStat.score" are unique per result set row. So I tried this ...
def get_current_articles(self, article):
qset = ArticleStat.objects.values('article__id', 'article__title', 'score').filter(
article__article=article).values('elapsed_time_in_seconds').annotate(\
max_date=Max('elapsed_time_in_seconds'))
print(qset.query)
return qset
However, the resulting SQL does not include the values I want to use in my GROUP BY clause (notice that neither "article" nor "score" is in the GROUP BY) ...
SELECT "myproject_articlestat"."elapsed_time_in_seconds",
MAX("myproject_articlestat"."elapsed_time_in_seconds") AS "max_date"
FROM "myproject_articlestat"
INNER JOIN "myproject_article" ON ("myproject_articlestat"."article_id" = "myproject_article"."id")
WHERE ("myproject_article"."article_id" = 2) GROUP BY "myproject_articlestat"."elapsed_time_in_seconds"
How do I modify my Django query to generate SQL consistent with what I want?
I don't think the answer by #Oleg will work but it's close..
A Subquery expression can be used to select a single value from another queryset. To accomplish this you sort by the value you wish to target then select the first value.
sq = (
ArticleStat.objects
.filter(article=OuterRef('pk'))
.order_by('-elapsed_time_in_seconds')
)
articles = (
Article.objects
.annotate(
max_date=Subquery(sq.values('elapsed_time_in_seconds')[:1]),
score=Subquery(sq.values('score')[:1]),
)
# .values('id', 'path', 'title', 'score', 'max_date')
)
You should not use the 'elapsed_time_in_seconds' in your .values(..) clause, and add the GROUP BY property in the .order_by(..) clause:
qset = ArticleStat.objects.filter(
article__article=article
).values(
'article__path', 'article__title', 'score'
).annotate(
max_date=Max('elapsed_time_in_seconds')
).order_by('article__path', 'article__title', 'score')
This will thus make a QuerySet of dictionaries such that each dictionary contains four elements: 'article__path', 'article__title', 'score', and 'max_date'.
As far as I see, there is no one-step way of doing this in Django.
One way of doing it is with 2 queries and to get use of .annotate() to
add the max id of the related object revision for each revision, then
get all those objectrevisions
Example:
objects = Object.objects.all().annotate(revision_id=Max
('objectrevision__id'))
objectrevisions = ObjectRevision.objects.filter(id__in=
[o.revision_id for o in objects])
This is untested and also its a bit slow, so may be you can also try to write custom SQL as mentioned by Wolfram Kriesing in the blog here
If I understand correctly from all the comments:
Result is to get Articles (id, path, title) filtered by the article argument of get_current_articles method with additional data - maximum elapsed_time_in_seconds from all the ArticleStats of each filtered article and also score of its ArticleStat with maximum elapsed_time.
If so, when the base query can be on Article model: Article.objects.filter(article=article).
Which we can annotate with the Max() of corresponding ArticleStats. This can be done directly on main query .annotate(max_date=Max(articlestats__elapsed_time_in_seconds)) or with Subquery on ArticleStat also filtered the same way as base query on article (we want subquery to run on the same set of article objects as the main query), i.e.
max_sq = ArticleStat.objects.filter(
article__article=OuterRef('article')
).annotate(max_date=Max('elapsed_time_in_seconds'))
Now, to add score column to the result. Max() is aggregate function and it has no row info. In order to get score for the maximum elapsed_time we can make another subquery and filter by max elapsed_time from previous column.
Note: This filter can return multiple ArticleStats objects for the same maximum elapsed_time and article but we will use only the first one. It is for your data structure to make sure that filter returns only one row or provide additional filtering or ordering such that first result will be the one required.
score_sq = ArticleStat.objects.filter(
article__article=OuterRef('article'),
elapsed_time_in_seconds=OuterRef('max_date')
)
And use our subqueries in the main query
qset = Article.objects.filter(
article=article
).annotate(
max_date=Max('articlestats__elapsed_time_in_seconds'),
""" or
max_date=Subquery(
max_sq.values('max_date')[:1]
),
"""
score=Subquery(
score_sq.values('score')[:1]
)
).values(
'id', 'path', 'title', 'score', 'max_date'
)
And somewhat tricky option without using Max() function
but emulating it with ORDER BY and fetching the first row.
artstat_sq = ArticleStat.objects.filter(
article__article=OuterRef('article')
).order_by().order_by('-elapsed_time_in_seconds', '-score')
# empty order_by() to clear any base ordering
qset = Article.objects.filter(
article=article
).annotate(
max_date=Subquery(
artstat_sq.values('elapsed_time_in_seconds')[:1]
),
score=Subquery(
artstat_sq.values('score')[:1]
)
).values(
'id', 'path', 'title', 'score', 'max_date'
)

Better Alternate instead of using chained union queries in Django ORM

I needed to achieve something like this in Django ORM :
(SELECT * FROM `stats` WHERE MODE = 1 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 2 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 3 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 6 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 5 AND is_completed != 3 ORDER BY DATE DESC)
# mode 5 can return more than 100 records so NO LIMIT here
for which i wrote this :
query_run_now_job_ids = Stats.objects.filter(mode=5).exclude(is_completed=3).order_by('-date')
list_of_active_job_ids = Stats.objects.filter(mode=1).order_by('-date')[:2].union(
Stats.objects.filter(mode=2).order_by('-date')[:2],
Stats.objects.filter(mode=3).order_by('-date')[:2],
Stats.objects.filter(mode=6).order_by('-date')[:2],
query_run_now_job_ids)
but somehow list_of_active_job_ids returned is unordered i.e list_of_active_job_ids.ordered returns False due to which when this query is passed to Paginator class it gives :
UnorderedObjectListWarning:
Pagination may yield inconsistent results with an unordered object_list
I have already set ordering in class Meta in models.py
class Meta:
ordering = ['-date']
Without paginator query works fine and page loads but using paginator , view never loads it keeps on loading .
Is there any better alternate for achieving this without using chain of union .
So I tried another alternate for above mysql query but i'm stuck in another problem to write up condition for mode = 5 in this query :
SELECT
MODE ,
SUBSTRING_INDEX(GROUP_CONCAT( `job_id` SEPARATOR ',' ),',',2) AS job_id_list,
SUBSTRING_INDEX(GROUP_CONCAT( `total_calculations` SEPARATOR ',' ),',',2) AS total_calculations
FROM `stats`
ORDER BY DATE DESC
Even if I was able to write this Query it would lead me to another challenging situation i.e to convert this query for Django ORM .
So why My Query is not ordered even when i have set it in Class Meta .
Also if not this query , Is there any better alternate for achieving this ?
Help would be appreciated ! .
I'm using Python 2.7 and Django 1.11 .
While subqueries may be ordered, the resulting union data is not. You need to explicitly define the ordering.
from django.db import models
def make_query(mode, index):
return (
Stats.objects.filter(mode=mode).
annotate(_sort=models.Value(index, models.IntegerField())).
order_by('-date')
)
list_of_active_job_ids = make_query(1, 1)[:2].union(
make_query(2, 2)[:2],
make_query(3, 3)[:2],
make_query(6, 4)[:2],
make_query(5, 5).exclude(is_completed=3)
).order_by('_sort', '-date')
All I did was add a new, literal value field _sort that has a different value for each subquery and then ordered by it in the final query.The rest of the code is just to reduce duplication. It would have been even cleaner if it wasn't for that mode=6 subquery.

Sqlalchemy mySQL optimize query

General:
I need to create a statistic tool from a given DB with many hundreds of thousands entries. So I never need to write to the DB, only get data.
Problem:
I have a user table, in my case i select 20k users (between two dates). Now I need to select only users, who at least spent money once (from these 20k).
To do so I have 3 different tables where the data is saved whether a user spent money. (So we work here with 4 tables in total):
User, Transaction_1, Transaction_2, Transaction_3
What I did so far:
In the model of the User class I have created a property which checks whether the user appears once in one of the Transaction tables:
#property
def spent_money_once(self):
spent_money_atleast_once = False
in_transactions = Transaction_1.query.filter(Transaction_1.user_id == self.id).first()
if in_transactions:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsVK = Transaction_2.query.filter(Transaction_2.user_id == self.id).first()
if in_transactionsVK:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsStripe = Transaction_3.query.filter(Transaction_3.user_id == self.id).first()
if in_transactionsStripe:
spent_money_atleast_once = True
return spent_money_atleast_once
return spent_money_atleast_once
Then I created two counters for male and female users, so I can count how many of these 20k users spent money at least once:
males_payed_atleast_once = 0
females_payed_atleast_once = 0
for male_user in male_users.all():
if male_user.spent_money_once is True:
males_payed_atleast_once += 1
for female_user in female_users.all():
if female_user.spent_money_once is True:
females_payed_atleast_once += 1
But this takes really long to calculate, arround 40-60 min. I have never worked with such huge data amounts, maybe it is normal?
Additional info:
In case you wonder how male_users and female_users look like:
# Note: is this even efficient, if all() completes the query than I need to store the .all() into variables, otherwise everytime I call .all() it takes time
global all_users
global male_users
global female_users
all_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date)
male_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "1")
female_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "2")
I am trying to save certain queries in global variables to improve performance.
I am using Python 3 | Flask | Sqlalchemy for this task. The DB is MySQL.
I tryed now a completely different approach and used join and it is now way faster, it completes the query in 10 sec, which took 60 min.~:
# males
paying_males_1 = male_users.join(Transaction_1, Transaction_1.user_id == Users.id).all()
paying_males_2 = male_users.join(Transaction_2, Transaction_2.user_id == Users.id).all()
paying_males_3 = male_users.join(Transaction_3, Transaction_3.user_id == Users.id).all()
males_payed_all = paying_males_1 + paying_males_2 + paying_males_3
males_payed_atleast_once = len(set(males_payed_all))
I am simply joining each table and use .all(), the results are simple lists. After that I am merging all lists and typecasting them to set. Now I have only the unique users. The last step is to count them by using len() on the set.
Assuming you need to aggregate the info of the 3 tables together before counting, this will be a bit faster:
SELECT userid, SUM(ct) AS total
FROM (
( SELECT userid, COUNT(*) AS ct FROM trans1 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans2 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans3 GROUP BY userid )
)
GROUP BY userid
HAVING total >= 2
Recommend you test this in the mysql commandline tool, then figure out how to convert it to Python 3 | Flask | Sqlalchemy
Funny thing about packages that "hide the database" --; you still need to understand how the database works if you are going to do anything non-trivial.

How to convert SQL scalar subquery to SQLAlchemy expression

I need a litle help with expressing in SQLAlchemy language my code like this:
SELECT
s.agent_id,
s.property_id,
p.address_zip,
(
SELECT v.valuation
FROM property_valuations v WHERE v.zip_code = p.address_zip
ORDER BY ABS(DATEDIFF(v.as_of, s.date_sold))
LIMIT 1
) AS back_valuation,
FROM sales s
JOIN properties p ON s.property_id = p.id
Inner subquery aimed to get property value from table propert_valuations with columns (zip_code INT, valuation DECIMAL, as_if DATE) closest to the date of sale from table sales. I know how to rewrite it but I completely stuck on order_by expression - I cannot prepare subquery to pass ordering member later.
Currently I have following queries:
subquery = (
session.query(PropertyValuation)
.filter(PropertyValuation.zip_code == Property.address_zip)
.order_by(func.abs(func.datediff(PropertyValuation.as_of, Sale.date_sold)))
.limit(1)
)
query = session.query(Sale).join(Sale.property_)
How to combine these queries together?
How to combine these queries together?
Use as_scalar(), or label():
subquery = (
session.query(PropertyValuation.valuation)
.filter(PropertyValuation.zip_code == Property.address_zip)
.order_by(func.abs(func.datediff(PropertyValuation.as_of, Sale.date_sold)))
.limit(1)
)
query = session.query(Sale.agent_id,
Sale.property_id,
Property.address_zip,
# `subquery.as_scalar()` or
subquery.label('back_valuation'))\
.join(Property)
Using as_scalar() limits returned columns and rows to 1, so you cannot get the whole model object using it (as query(PropertyValuation) is a select of all the attributes of PropertyValuation), but getting just the valuation attribute works.
but I completely stuck on order_by expression - I cannot prepare subquery to pass ordering member later.
There's no need to pass it later. Your current way of declaring the subquery is fine as it is, since SQLAlchemy can automatically correlate FROM objects to those of an enclosing query. I tried creating models that somewhat represent what you have, and here's how the query above works out (with added line-breaks and indentation for readability):
In [10]: print(query)
SELECT sale.agent_id AS sale_agent_id,
sale.property_id AS sale_property_id,
property.address_zip AS property_address_zip,
(SELECT property_valuations.valuation
FROM property_valuations
WHERE property_valuations.zip_code = property.address_zip
ORDER BY abs(datediff(property_valuations.as_of, sale.date_sold))
LIMIT ? OFFSET ?) AS back_valuation
FROM sale
JOIN property ON property.id = sale.property_id

extra column based on aggregated results in Django

I have a table storing KLine information including security id, max/min price and date and want to calculate the gains for each security in a certain period. Here's my function
def get_rising_rate(start, end):
return models.KLine.objects.\
filter(start_time__gt=start). \
filter(start_time__lt=end).\
values("security__id").\
annotate(min_price=django_models.Min("min_price_in_cent"),
max_price=django_models.Max("max_price_in_cent")).\
extra(select={"gain": "max_price_in_cent/min_price_in_cent"})
order_by("gain")
But I got the following error:
django.db.utils.OperationalError: (1247, "Reference 'max_price' not supported (reference to group function)")
I can do the query with raw SQL like
SELECT
`security_id`,
`min_price`,
`max_price`,
`max_price`/`min_price` AS gain
FROM(
SELECT
`security_id`,
MIN(`min_price_in_cent`) AS `min_price`,
MAX(`max_price_in_cent`) AS `max_price`
FROM `stocks_kline`
WHERE `start_time` > '2014-12-31 16:00:00' AND `start_time` < '2015-12-31 16:00:00'
GROUP BY `security_id`
) AS A
ORDER BY gain DESC
But I wonder if there's a more "django" way to get it done. I've searched "django join queryset", "django queryset as derived tables" but can't get a solution.
Thanks in advance.
Can you try this:
def get_rising_rate(start, end):
return models.KLine.objects.\
filter(start_time__gt=start). \
filter(start_time__lt=end).\
values("security_id").\
annotate(min_price=django_models.Min("min_price_in_cent"),
max_price=django_models.Max("max_price_in_cent")).\
extra(select={"gain": "max(max_price_in_cent)/min(min_price_in_cent)"})
order_by("gain")

Categories