The sqlalchemy core query builder appears to unnest and relocate CTE queries to the "top" of the compiled sql.
I'm converting an existing Postgres query that selects deeply joined data as a single JSON object. The syntax is pretty contrived but it significantly reduces network overhead for large queries. The goal is to build the query dynamically using the sqlalchemy core query builder.
Here's a minimal working example of a nested CTE
with res_cte as (
select
account_0.name acct_name,
(
with offer_cte as (
select
offer_0.id
from
offer offer_0
where
offer_0.account_id = account_0.id
)
select
array_agg(offer_cte.id)
from
offer_cte
) as offer_arr
from
account account_0
)
select
acct_name::text, offer_arr::text
from res_cte
Result
acct_name, offer_arr
---------------------
oliver, null
rachel, {3}
buddy, {4,5}
(my incorrect use of) the core query builder attempts to unnest offer_cte and results in every offer.id being associated with every account_name in the result.
There's no need to re-implement this exact query in an answer, any example that results in a similarly nested CTE would be perfect.
I just implemented the nesting cte feature. It should land with 1.4.24 release.
Pull request: https://github.com/sqlalchemy/sqlalchemy/pull/6709
import sqlalchemy as sa
from sqlalchemy.ext.declarative import declarative_base
# Model declaration
Base = declarative_base()
class Offer(Base):
__tablename__ = "offer"
id = sa.Column(sa.Integer, primary_key=True)
account_id = sa.Column(sa.Integer, nullable=False)
class Account(Base):
__tablename__ = "account"
id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column(sa.TEXT, nullable=False)
# Query construction
account_0 = sa.orm.aliased(Account)
# Watch the nesting keyword set to True
offer_cte = (
sa.select(Offer.id)
.where(Offer.account_id == account_0.id)
.select_from(Offer)
.correlate(account_0).cte("offer_cte", nesting=True)
)
offer_arr = sa.select(sa.func.array_agg(offer_cte.c.id).label("offer_arr"))
res_cte = sa.select(
account_0.name.label("acct_name"),
offer_arr.scalar_subquery().label("offer_arr"),
).cte("res_cte")
final_query = sa.select(
sa.cast(res_cte.c.acct_name, sa.TEXT),
sa.cast(res_cte.c.offer_arr, sa.TEXT),
)
It constructs this query that returns the result you expect:
WITH res_cte AS
(
SELECT
account_1.name AS acct_name
, (
WITH offer_cte AS
(
SELECT
offer.id AS id
FROM
offer
WHERE
offer.account_id = account_1.id
)
SELECT
array_agg(offer_cte.id) AS offer_arr
FROM
offer_cte
) AS offer_arr
FROM
account AS account_1
)
SELECT
CAST(res_cte.acct_name AS TEXT) AS acct_name
, CAST(res_cte.offer_arr AS TEXT) AS offer_arr
FROM
res_cte
Related
I have a table of events generated by devices, with this structure:
class Events(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
timestamp_event = db.Column(db.DateTime, nullable=False, index=True)
device_id = db.Column(db.Integer, db.ForeignKey('devices.id'), nullable=True)
which I have to query joined to:
class Devices(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
dev_name = db.Column(db.String(50))
so I can retrieve Device data for every Event.
I´m doing a ranking of the 20 top max events generated in a single hour. It already works, but as my Events table grows (over 1M rows now) the query gets slower and slower. This is my code. Any ideas on how to optimize the query? Maybe a composite index device.id + timestamp_event? Would that work even if searching for a part of the timedate column?
pkd = db.session.query(db.func.count(Events.id),
db.func.date_format(Events.timestamp_event,'%d/%m %H'),\
Devices.dev_name).select_from(Events).join(Devices)\
.filter(Events.timestamp_event >= (datetime.now() - timedelta(days=peak_days)))\
.group_by(db.func.date_format(Events.timestamp_event,'%Y%M%D%H'))\
.group_by(Events.device_id)\
.order_by(db.func.count(Events.id).desc()).limit(20).all()
Here´s sample output of first 3 rows of the query: Number of events, when (DD/MM HH), and which device:
[(2710, '15/01 16', 'Device 002'),
(2612, '11/01 17', 'Device 033'),
(2133, '13/01 15', 'Device 002'),...]
and here´s SQL generated by SQLAlchemy:
SELECT count(events.id) AS count_1,
date_format(events.timestamp_event,
%(date_format_2)s) AS date_format_1,
devices.id AS devices_id,
devices.dev_name AS devices_dev_name
FROM events
INNER JOIN devices ON devices.id = events.device_id
WHERE events.timestamp_event >= %(timestamp_event_1)s
GROUP BY date_format(events.timestamp_event, %(date_format_3)s), events.device_id
ORDER BY count(events.id) DESC
LIMIT %(param_1)s
# This example is for postgresql.
# I'm not sure what db you are using but the date formatting
# is different.
with Session(engine) as session:
# Use subquery to select top 20 event creating device ids
# for each hour since the beginning of the peak.
hour_fmt = "dd/Mon HH24"
hour_col = func.to_char(Event.created_on, hour_fmt).label('event_hour')
event_count_col = func.count(Event.id).label('event_count')
sub_q = select(
event_count_col,
hour_col,
Event.device_id
).filter(
Event.created_on > get_start_of_peak()
).group_by(
hour_col, Event.device_id
).order_by(
event_count_col.desc()
).limit(
20
).alias()
# Now join in the devices to the top ids to get the names.
results = session.execute(
select(
sub_q.c.event_count,
sub_q.c.event_hour,
Device.name
).join_from(
sub_q,
Device,
sub_q.c.device_id == Device.id
).order_by(
sub_q.c.event_count.desc(),
Device.name
)
).all()
I'm trying to translate the following query to sqlAlchemy and can't seem to figure it out (I'm not even far yet):
"SELECT time, version_id FROM ( \
SELECT \
time, \
version_id, \
LAG(software_version_id) OVER (ORDER BY time) as previous_version_id \
FROM device_checkins WHERE device_id = 001 AND time BETWEEN '2020-01-01' AND '2021-04-01') tt \
WHERE previous_version_id IS NULL OR version_id != previous_version_id;"
As far as I could figure out, I need the select function that sqlAlchemy provides but I'm running into trouble.
Of course, in the python representation, we have a DeviceCheckin model with all the fields that are used here. I'd love all the help you might be able to provide.
class DeviceCheckin(ModelBase):
__tablename__ = "device_checkins"
time = db.Column(DateTimeUtc(), nullable=False)
device_id = db.Column(sa.BigInteger, nullable=False)
device = db.relationship(...)
software_version_id = db.Column(...)
software_version = db.relationship(...)
You could draw up the subquery expression as follows:
import sqlalchemy as sa
#...
entities = [
DeviceCheckin.time,
DeviceCheckin.version_id,
sa.func.lag(
DeviceCheckin.software_version_id
).over(
order_by=DeviceChecking.time
).label('previous_version_id')
]
condition = (
(DeviceCheckin.device_id == '001')
& (DeviceCheckin.time.between('2020-01-01', '2021-04-01'))
)
subq = DeviceCheckin.query.with_entities(entities).filter(condition).subquery()
Then select from it in the following manner:
condition = (
subq.c.previous_version_id.is_(None)
| (subq.c.version_id != subq.c.previous_version_id)
)
entities = [subq.c.time, subq.c.version_id]
query = subq.select(entities).where(condition)
results = db.session.execute(query)
I apologize in advance if my question is banal: I am a total beginner of SQL.
I want to create a simple database, with two tables: Students and Answers.
Basically, each student will answer three question (possible answers are True or False for each question), and his answers will be stored in Answers table.
Students can have two "experience" levels: "Undergraduate" and "Graduate".
What is the best way to obtain all Answers that were given by Students with "Graduate" experience level?
This is how I define SQLAlchemy classes for entries in Students and Answers tables:
import random
from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, String, Date, Boolean, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
db_uri = "sqlite:///simple_answers.db"
db_engine = create_engine(db_uri)
db_connect = db_engine.connect()
Session = sessionmaker()
Session.configure(bind=db_engine)
db_session = Session()
Base = declarative_base()
class Student(Base):
__tablename__ = "Students"
id = Column(Integer, primary_key=True)
experience = Column(String, nullable=False)
class Answer(Base):
__tablename__ = "Answers"
id = Column(Integer, primary_key=True)
student_id = Column(Integer, ForeignKey("Students.id"), nullable=False)
answer = Column(Boolean, nullable=False)
Base.metadata.create_all(db_connect)
Then, I insert some random entries in the database:
categories_experience = ["Undergraduate", "Graduate"]
categories_answer = [True, False]
n_students = 20
n_answers_by_each_student = 3
random.seed(1)
for _ in range(n_students):
student = Student(experience=random.choice(categories_experience))
db_session.add(student)
db_session.commit()
answers = [Answer(student_id=student.id, answer=random.choice(categories_answer))
for _ in range(n_answers_by_each_student)]
db_session.add_all(answers)
db_session.commit()
Then, I obtain Student.id of all "Graduate" students:
ids_graduates = db_session.query(Student.id).filter(Student.experience == "Graduate").all()
ids_graduates = [result.id for result in ids_graduates]
And finally, I select Answers from "Graduate" Students using .in_ operator:
answers_graduates = db_session.query(Answer).filter(Answer.student_id.in_(ids_graduates)).all()
I manually checked the answers, and they are right. But, since I am a total beginner of SQL, I suspect that there is some better way to achieve the same result.
Is there such an objectively "best" way (more Pythonic, more efficient...)? I would like to achieve my result with SQLAlchemy, possibly using the ORM interface.
When I asked the question, I was in a hurry.
Since then, I have had the time to study SQLAlchemy ORM documentation.
There are two recommended ways to filter tables based on values from another table.
The first way is actually very similar to what I had originally tried:
query_graduates = (
db_session
.query(User.id)
.filter(User.experience == "Graduate")
)
query_answers_graduates = (
db_session
.query(Answer)
.filter(Answer.user_id.in_(query_graduates))
)
answers_graduates = query_answers_graduates.all()
It uses .in_ operator, which accepts as argument either a list of objects, or another query.
The second way uses .join method:
query_answers_graduates = (
db_session
.query(Answer)
.join(User)
.filter(User.experience == "Graduate")
)
The second approach is more concise. I timed both solutions, and the second approach, which uses .join, is slightly faster.
You are mentioning SQL but I am confused if you want to do this particular step in Python or SQL. If SQL, something like this could work:
select * from Students s
inner join Answers a on s.id = a.student_id
where s.experience = "Graduate";
Updated code
I have never used SQLAlchemy before but something similar to this may work...
sql = """select s.Id, a.answer from Students s
inner join Answers a on s.id = a.student_id
where s.experience = "Graduate";"""
with db_session as con:
rows = con.execute(sql)
for row in rows:
print(row)
I'm trying to write the following sql query with sqlalchemy ORM:
SELECT * FROM
(SELECT *, row_number() OVER(w)
FROM (select distinct on (grandma_id, author_id) * from contents) as c
WINDOW w AS (PARTITION BY grandma_id ORDER BY RANDOM())) AS v1
WHERE row_number <= 4;
This is what I've done so far:
s = Session()
unique_users_contents = (s.query(Content).distinct(Content.grandma_id,
Content.author_id)
.subquery())
windowed_contents = (s.query(Content,
func.row_number()
.over(partition_by=Content.grandma_id,
order_by=func.random()))
.select_from(unique_users_contents)).subquery()
contents = (s.query(Content).select_from(windowed_contents)
.filter(row_number >= 4)) ## how can I reference the row_number() value?
result = contents
for content in result:
print "%s\t%s\t%s" % (content.id, content.grandma_id,
content.author_id)
As you can see it's pretty much modeled, but I have no idea how to reference the row_number() result of the subquery from the outer query where. I tried something like windowed_contents.c.row_number and adding a label() call on the window func but it's not working, couldn't find any similar example in the official docs or in stackoverflow.
How can this be accomplished? And also, could you suggest a better way to do this query?
windowed_contents.c.row_number against a label() is how you'd do it, works for me (note the select_entity_from() method is new in SQLA 0.8.2 and will be needed here in 0.9 vs. select_from()):
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Content(Base):
__tablename__ = 'contents'
grandma_id = Column(Integer, primary_key=True)
author_id = Column(Integer, primary_key=True)
s = Session()
unique_users_contents = s.query(Content).distinct(
Content.grandma_id, Content.author_id).\
subquery('c')
q = s.query(
Content,
func.row_number().over(
partition_by=Content.grandma_id,
order_by=func.random()).label("row_number")
).select_entity_from(unique_users_contents).subquery()
q = s.query(Content).select_entity_from(q).filter(q.c.row_number <= 4)
print q
I'm trying to adapt some part of a MySQLdb application to sqlalchemy in declarative base. I'm only beginning with sqlalchemy.
The legacy tables are defined something like:
student: id_number*, semester*, stateid, condition, ...
choice: id_number*, semester*, choice_id, school, program, ...
We have 3 tables for each of them (student_tmp, student_year, student_summer, choice_tmp, choice_year, choice_summer), so each pair (_tmp, _year, _summer) contains information for a specific moment.
select *
from `student_tmp`
inner join `choice_tmp` using (`id_number`, `semester`)
My problem is the information that is important to me is to get the equivalent of the following select:
SELECT t.*
FROM (
(
SELECT st.*, ct.*
FROM `student_tmp` AS st
INNER JOIN `choice_tmp` as ct USING (`id_number`, `semester`)
WHERE (ct.`choice_id` = IF(right(ct.`semester`, 1)='1', '3', '4'))
AND (st.`condition` = 'A')
) UNION (
SELECT sy.*, cy.*
FROM `student_year` AS sy
INNER JOIN `choice_year` as cy USING (`id_number`, `semester`)
WHERE (cy.`choice_id` = 4)
AND (sy.`condition` = 'A')
) UNION (
SELECT ss.*, cs.*
FROM `student_summer` AS ss
INNER JOIN `choice_summer` as cs USING (`id_number`, `semester`)
WHERE (cs.`choice_id` = 3)
AND (ss.`condition` = 'A')
)
) as t
* used for shorten the select, but I'm actually only querying for about 7 columns out of the 50 availables.
This information is used in many flavors... "Do I have new students? Do I still have all students from a given date? Which students are subscribed after the given date? etc..." The result of this select statement is to be saved in another database.
Would it be possible for me to achieve this with a single view-like class? The information is read-only so I don't need to be able to modify/create/delte. Or do I have to declare a class for each table (ending up with 6 classes) and every time I need to query, I have to remember to filter?
Thanks for pointers.
EDIT: I don't have modification access to the database (I cannot create a view). Both databases may not be on the same server (so I cannot create a view on my second DB).
My concern is to avoid the full table scan before filtering on condition and choice_id.
EDIT 2: I've set up declarative classes like this:
class BaseStudent(object):
id_number = sqlalchemy.Column(sqlalchemy.String(7), primary_key=True)
semester = sqlalchemy.Column(sqlalchemy.String(5), primary_key=True)
unique_id_number = sqlalchemy.Column(sqlalchemy.String(7))
stateid = sqlalchemy.Column(sqlalchemy.String(12))
condition = sqlalchemy.Column(sqlalchemy.String(3))
class Student(BaseStudent, Base):
__tablename__ = 'student'
choices = orm.relationship('Choice', backref='student')
#class StudentYear(BaseStudent, Base):...
#class StudentSummer(BaseStudent, Base):...
class BaseChoice(object):
id_number = sqlalchemy.Column(sqlalchemy.String(7), primary_key=True)
semester = sqlalchemy.Column(sqlalchemy.String(5), primary_key=True)
choice_id = sqlalchemy.Column(sqlalchemy.String(1))
school = sqlalchemy.Column(sqlalchemy.String(2))
program = sqlalchemy.Column(sqlalchemy.String(5))
class Choice(BaseChoice, Base):
__tablename__ = 'choice'
__table_args__ = (
sqlalchemy.ForeignKeyConstraint(['id_number', 'semester',],
[Student.id_number, Student.semester,]),
)
#class ChoiceYear(BaseChoice, Base): ...
#class ChoiceSummer(BaseChoice, Base): ...
Now, the query that gives me correct SQL for one set of table is:
q = session.query(StudentYear, ChoiceYear) \
.select_from(StudentYear) \
.join(ChoiceYear) \
.filter(StudentYear.condition=='A') \
.filter(ChoiceYear.choice_id=='4')
but it throws an exception...
"Could not locate column in row for column '%s'" % key)
sqlalchemy.exc.NoSuchColumnError: "Could not locate column in row for column '*'"
How do I use that query to create myself a class I can use?
If you can create this view on the database, then you simply map the view as if it was a table. See Reflecting Views.
# DB VIEW
CREATE VIEW my_view AS -- #todo: your select statements here
# SA
my_view = Table('my_view', metadata, autoload=True)
# define view object
class ViewObject(object):
def __repr__(self):
return "ViewObject %s" % str((self.id_number, self.semester,))
# map the view to the object
view_mapper = mapper(ViewObject, my_view)
# query the view
q = session.query(ViewObject)
for _ in q:
print _
If you cannot create a VIEW on the database level, you could create a selectable and map the ViewObject to it. The code below should give you the idea:
student_tmp = Table('student_tmp', metadata, autoload=True)
choice_tmp = Table('choice_tmp', metadata, autoload=True)
# your SELECT part with the columns you need
qry = select([student_tmp.c.id_number, student_tmp.c.semester, student_tmp.stateid, choice_tmp.school])
# your INNER JOIN condition
qry = qry.where(student_tmp.c.id_number == choice_tmp.c.id_number).where(student_tmp.c.semester == choice_tmp.c.semester)
# other WHERE clauses
qry = qry.where(student_tmp.c.condition == 'A')
You can create 3 queries like this, then combine them with union_all and use the resulting query in the mapper:
view_mapper = mapper(ViewObject, my_combined_qry)
In both cases you have to ensure though that a PrimaryKey is properly defined on the view, and you might need to override the autoloaded view, and specify the primary key explicitely (see the link above). Otherwise you will either receive an error, or might not get proper results from the query.
Answer to EDIT-2:
qry = (session.query(StudentYear, ChoiceYear).
select_from(StudentYear).
join(ChoiceYear).
filter(StudentYear.condition == 'A').
filter(ChoiceYear.choice_id == '4')
)
The result will be tuple pairs: (Student, Choice).
But if you want to create a new mapped class for the query, then you can create a selectable as the sample above:
student_tmp = StudentTmp.__table__
choice_tmp = ChoiceTmp.__table__
.... (see sample code above)
This is to show what I ended up doing, any comment welcomed.
class JoinedYear(Base):
__table__ = sqlalchemy.select(
[
StudentYear.id_number,
StudentYear.semester,
StudentYear.stateid,
ChoiceYear.school,
ChoiceYear.program,
],
from_obj=StudentYear.__table__.join(ChoiceYear.__table__),
) \
.where(StudentYear.condition == 'A') \
.where(ChoiceYear.choice_id == '4') \
.alias('YearView')
and I will elaborate from there...
Thanks #van