SQLAlchemy ignoring specific fields in a query - python

I am using SQLAlchemy with Flask to talk to a postgres DB. I have a Customer model that has a date_of_birth field defined like this
class Customer(Base):
__tablename__ = 'customer'
id = Column(Integer, primary_key=True)
date_of_birth = Column(Date)
Now, I am trying to filter these by a minimum age like this:
q.filter(date.today().year - extract('year', Customer.date_of_birth) - cast((today.month, today.day) < (extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth)), Integer) >= 5)
But the generated SQL seems to ignore the day part and replace it with a constant. It looks like this
SELECT customer.date_of_birth AS customer_date_of_birth,
FROM customer
WHERE (2017 - EXTRACT(year FROM customer.date_of_birth)) - CAST(EXTRACT(month FROM customer.date_of_birth) > 2 AS INTEGER) >= 5
The generated SQL is exactly the same when I remove the day part from the query. Why is sqlalchemy ignoring it?

This is because you're comparing two tuples:
(today.month, today.day) < (extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth))
The way tuples compare is, compare the first element, and if they're not equal then return the result, otherwise return the comparison of the second element. So, in your case, it's the same as comparing the first elements together.
Instead of two tuples, what you should compare is the tuple_ construct, like this:
(today.month, today.day) < tuple_(extract('month', Customer.date_of_birth), extract('day', Customer.date_of_birth))

Related

Django ORM: How to group on a value and get a different value of last element in that group

I have been trying to tackle this problem all week but I just can't seem to find the solution.
Basically I want to group on 2 values (user and assignment), then take the last element based on date and get a sum of these scores. Below a description of the problem.
With Postgres this would be easily solved by using the .distinct("value") but unfortunately I do not use Postgres.
Any help would be much appreciated!!
UserAnswer
- user
- assignment
- date
- answer
- score
So I want to group on all user / assignment combinations. Then I want to get the score of each last element in that group. So basically:
user_1, assignment_1, 2019, score 1
user_1, assignment_1, 2020, score 2 <- Take this one
user_2, assignment_1, 2020, score 1
user_2, assignment_1, 2021, score 2 <- Take this one
My best attempt is using annotation but then I do not have the score value anymore:
UserAnswer.objects.filter(user=student, assignment__in=assignments)
.values("user", "assignment")
.annotate(latest_date=Max('date'))
At the end, I had to use raw query rather than django's ORM.
subquery2 = UserAnswer.objects.raw("\
SELECT id, user_id, assignment_id, score, MAX(date) AS latest_date\
FROM soforms_useranswer \
GROUP BY user_id, assignment_id\
")
# the raw queryset from above raw query
# is very similar to queryset you get from django ORM query.
# The difference is now we add 'id' and 'score' to the fields,
# so later we can retrieve them, like below.
sum2= 0
for obj in subquery2:
print(obj.score)
sum2 += obj.score
print('sum2 is')
print(sum2)
Here, I assumed that both user and assignment are foreinkeys. Something liek below:
class Assignment(models.Model):
name = models.CharField(max_length=50)
class UserAnswer(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE, related_name='answers')
assignment = models.ForeignKey(Assignment, on_delete=models.CASCADE)
#assignment = models.CharField(max_length=200)
score = models.IntegerField()
date = models.DateTimeField(default=timezone.now)

Identify what values in a list doesn't exist in a Table column using SQLAlchemy

I have a list cities = ['Rome', 'Barcelona', 'Budapest', 'Ljubljana']
Then,
I have a sqlalchemy model as follows -
class Fly(Base):
__tablename__ = 'fly'
pkid = Column('pkid', INTEGER(unsigned=True), primary_key=True, nullable=False)
city = Column('city', VARCHAR(45), unique=True, nullable=False)
country = Column('country', VARCHAR(45))
flight_no = Column('Flight', VARCHAR(45))
I need to check if ALL the values in given cities list exists in fly table or not using sqlalchemy. Return true only if ALL the cities exists in table. Even if a single city doesn't exist in table, I need to return false and list of cities that doesn't exist. How to do that? Any ideas/hints/suggestions? I'm using MYSQL
One way would be to create a (temporary) relation based on the given list and take the set difference between it and the cities from the fly table. In other words create a union of the values from the list1:
from sqlalchemy import union, select, literal
cities_union = union(*[select([literal(v)]) for v in cities])
Then take the difference:
sq = cities_union.select().except_(select([Fly.city]))
and check that no rows are left after the difference:
res = session.query(~exists(sq)).scalar()
For a list of cities lacking from fly table omit the (NOT) EXISTS:
res = session.execute(sq).fetchall()
1 Other database vendors may offer alternative methods for producing relations from arrays, such as Postgresql and its unnest().

Coalesce results in a QuerySet

I have the following models:
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
# Check against the owner's specified availability
available_periods = self.propertyavailability_set \
.filter(date_from__lte=avail_date_from, \
date_to__gte=avail_date_to) \
.count()
if available_periods == 0:
return False
return True
class PropertyAvailability(models.Model):
de_property = models.ForeignKey(Property, verbose_name='Property')
date_from = models.DateField(verbose_name='From')
date_to = models.DateField(verbose_name='To')
rate_sun_to_thurs = models.IntegerField(verbose_name='Nightly rate: Sun to Thurs')
rate_fri_to_sat = models.IntegerField(verbose_name='Nightly rate: Fri to Sat')
rate_7_night_stay = models.IntegerField(blank=True, null=True, verbose_name='Weekly rate')
minimum_stay_length = models.IntegerField(default=1, verbose_name='Min. length of stay')
class Meta:
unique_together = ('date_from', 'date_to')
Essentially, each Property has its availability specified with instances of PropertyAvailability. From this, the Property.is_available() method checks to see if the Property is available during a given period by querying against PropertyAvailability.
This code works fine except for the following scenario:
Example data
Using the current Property.is_available() method, if I were to search for availability between the 2nd of Jan, 2017 and the 5th of Jan, 2017 it'd work because it matches #1.
But if I were to search between the 4th of Jan, 2017 and the 8th of Jan, 2017, it wouldn't return anything because the date range is overlapping between multiple results - it matches neither #1 or #2.
I read this earlier (which introduced a similar problem and solution through coalescing results) but had trouble writing that using Django's ORM or getting it to work with raw SQL.
So, how can I write a query (preferably using the ORM) that will do this? Or perhaps there's a better solution that I'm unaware of?
Other notes
Both avail_date_from and avail_date_to must match up with PropertyAvailability's date_from and date_to fields:
avail_date_from must be >= PropertyAvailability.date_from
avail_date_to must be <= PropertyAvailability.date_to
This is because I need to query that a Property is available within a given period.
Software specs
Django 1.11
PostgreSQL 9.3.16
My solution would be to check whether the date_from or the date_to fields of PropertyAvailability are contained in the period we're interested in. I do this using Q objects. As mentioned in the comments above, we also need to include the PropertyAvailability objects that encompass the entire period we're interested in. If we find more than one instance, we must check if the availability objects are continuous.
from datetime import timedelta
from django.db.models import Q
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
date_range = (avail_date_from, avail_date_to)
# Check against the owner's specified availability
query_filter = (
# One of the records' date fields falls within date_range
Q(date_from__range=date_range) |
Q(date_to__range=date_range) |
# OR date_range falls between one record's date_from and date_to
Q(date_from__lte=avail_date_from, date_to__gte=avail_date_to)
)
available_periods = self.propertyavailability_set \
.filter(query_filter) \
.order_by('date_from')
# BEWARE! This might suck up a lot of memory if the number of returned rows is large!
# I do this because negative indexing of a `QuerySet` is not supported.
available_periods = list(available_periods)
if len(available_periods) == 1:
# must check if availability matches the range
return (
available_periods[0].date_from <= avail_date_from and
available_periods[0].date_to >= avail_date_to
)
elif len(available_periods) > 1:
# must check if the periods are continuous and match the range
if (
available_periods[0].date_from > avail_date_from or
available_periods[-1].date_to < avail_date_to
):
return False
period_end = available_periods[0].date_to
for available_period in available_periods[1:]:
if available_period.date_from - period_end > timedelta(days=1):
return False
else:
period_end = available_period.date_to
return True
else:
return False
I feel the need to mention though, that the database model does not guarantee that there are no overlapping PropertyAvailability objects in your database. In addition, the unique constraint should most likely contain the de_property field.
What you should be able to do is aggregate the data you wish to query against, and combine any overlapping (or adjacent) ranges.
Postgres doesn't have any way of doing this: it has operators for union and combining adjacent ranges, but nothing that will aggregate collections of overlapping/adjacent ranges.
However, you can write a query that will combine them, although how to do it with the ORM is not obvious (yet).
Here is one solution (left as a comment on http://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/#comment-2834554302, and tweaked to combine adjacent ranges, which appears to be what you want):
SELECT int4range(MIN(LOWER(value)), MAX(UPPER(value))) AS value
FROM (SELECT value,
MAX(new_start) OVER (ORDER BY value) AS left_edge
FROM (SELECT value,
CASE WHEN LOWER(value) <= MAX(le) OVER (ORDER BY value)
THEN NULL
ELSE LOWER(value) END AS new_start
FROM (SELECT value,
lag(UPPER(value)) OVER (ORDER BY value) AS le
FROM range_test
) s1
) s2
) s3
GROUP BY left_edge;
One way to make this queryable from within the ORM is to put it in a Postgres VIEW, and have a model that references this.
However, it is worth noting that this queries the whole source table, so you may want to have filtering applied; probably by de_property.
Something like:
CREATE OR REPLACE VIEW property_aggregatedavailability AS (
SELECT de_property
MIN(date_from) AS date_from,
MAX(date_to) AS date_to
FROM (SELECT date_from,
date_to,
MAX(new_from) OVER (PARTITION BY de_property
ORDER BY date_from) AS left_edge
FROM (SELECT de_property,
date_from,
date_to,
CASE WHEN date_from <= MAX(le) OVER (PARTITION BY de_property
ORDER BY date_from)
THEN NULL
ELSE date_from
END AS new_from
FROM (SELECT de_property,
date_from,
date_to,
LAG(date_to) OVER (PARTITION BY de_property
ORDER BY date_from) AS le
FROM property_propertyavailability
) s1
) s2
) s3
GROUP BY de_property, left_edge
)
As an aside, you might want to consider using Postgres's date range objects, because then you can prevent start > finish (automatically), but also prevent overlapping periods for a given property, using exclusion constraints.
Finally, an alternative solution might be to have a derived table, that stores unavailability, based on taking the available periods, and reversing them. This makes writing the query simpler, as you can write a direct overlap, but negate (i.e., a property is available for a given period iff there are no overlapping unavailable periods). I do that in a production system for staff availability/unavailability, where many checks need to be made. Note that is a denormalised solution, and relies on trigger functions (or other updates) to ensure it is kept in sync.

MySQL, join vs string parsing

I should first put a warning that this can be a little longer question. So please bear with me. One of my projects (That I started very recently) had a table which looked like this
name (varchar)
scores (text)
Example value is like this -
['AAA', '[{"score": 3.0, "subject": "Algebra"}, {"score": 5.0, "subject": "geography"}]']
As you can see the second field is the string representation of a JSON array.
In good faith, I redesigned this table into the following two tables
table-name:
id - Int, auto_inc, primary_key
name - varchar
table-scores:
id - int, auto_inc, primary_key
subject - varchar
score- float
name - int, FK to table-name
I have this following code in my python file to represent the tables (At this point, I assume that you are familiar with Python and SqlAlchemy, and so I will skip the specific imports and all to make it shorter)
Base = declarative_base()
class Name(Base):
__tabelname__ = "name_table"
id = Column(Integer, primary_key=True)
name = Column(String(255), index=True)
class Score(Base):
__tablename__ = "score_table"
id = Column(Integer, primary_key=True)
subject = Column(String(255), index=True)
score = Column(Float)
name = Column(ForeignKey('Name.id'), nullable=False, index=True)
Name = relationship(u'Name')
The first table has ~ 778284 rows whereas the second table has ~ 907214 rows.
After declaring them and populating them using the initial data I went to make an experiment. The goal - To find all the subjects whose score is > 5.0 for a given name. (Here, for a second, please consider that name is unique across the DB), and then run the same process 100 times and then take the average to find out how long this query is taking. Following is what I am doing (Please imagine and session is a valid db session I obtained before calling this function.)
def test_time():
for i in range(0, 100):
scores = session.query(Score, Name.name).join(Name).filter(Name.name=='AAA').filter(Score.score>5.0).all()
array = []
for score in scores:
array.append((score[0].subject, score[0].score))
I am not doing anything with the array I am creating. But I am calling this function which runs this query 100 times and I am using default_timer from timeit to measure the time elapsed. Following is the result for three runs -
Avarage - 0.10969632864
Avarage - 0.105748419762
Avarage - 0.105768380165
Now, as I was curious, so what I did is that I created another quick and dirty python file and declared this following class there -
class DirtyTable(Base):
__tablename__ = "dirty_table"
name = Column(String(255), primary_key=True)
scores = Column(Text)
And then I created the following function to achieve the same goal but this time reading the data from the second field, parse it back to python dict, run a for loop over all the elements of the list, add in the array only those elements whose score value is > 5.0. Here it goes -
def dirty_timer():
for i in range(0,100):
scores = session.query(DirtyTable).filter_by(name='AAA').all()
for score in scores:
x = json.loads(score.scores)
array = []
for member in x:
if x['score'] > 5.0:
array.append((x['subject'], x['score']))
This is the time of three runs -
Avarage - 0.0288228917122
Avarage - 0.0296836185455
Avarage - 0.0298663306236
Am I missing something? Normalizing the DB (I believe this is all what I tried to do by breaking the original table in two tables) gave me worse result. How is that possible. What is wrong with my approach?
Please let me know your thoughts. Sorry for the long post but had to explain everything properly.

Filter query in SQLAlchemy with date and hybrid_property

I've got database table with columns year_id, month_id and day_id (all are NUMBER). I want to make a query which filters dates. In order to simplify it I want to add hybrid_property to join the mentioned fields together into date.
class MyModel(Base):
__table__ = Base.metadata.tables['some_table']
#hybrid_property
def created_on(self):
return date(self.year_id, self.month_id, self.day_id)
But when I make query
session.query(MyModel).filter(MyModel.created_on==date(2010, 2, 2))
I get an error TypeError: an integer is required (got type InstrumentedAttribute).
Is there another way of doing such filter: I can't modify db schema (so the fields should remain the same), but at the same time it's difficult to compare dates against 3 separate columns?
You need to add an expression to your hybrid attribute, but it might be different depending on your RDBMS. See below for an example of one for sqlite:
#hybrid_property
def created_on(self):
return date(self.year_id, self.month_id, self.day_id)
#created_on.expression
def created_on(cls):
# #todo: create RDBMS-specific fucntion
# return func.date(cls.year_id, cls.month_id, cls.day_id)
# below works on Sqlite: format to YYYY-MM-DD
return (func.cast(cls.year_id, String) + '-' +
func.substr('0' + func.cast(cls.month_id, String), -2) + '-' +
func.substr('0' + func.cast(cls.day_id, String), -2))

Categories