SQLAlchemy optimize join query time - python

I have a table of events generated by devices, with this structure:
class Events(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
timestamp_event = db.Column(db.DateTime, nullable=False, index=True)
device_id = db.Column(db.Integer, db.ForeignKey('devices.id'), nullable=True)
which I have to query joined to:
class Devices(db.Model):
id = db.Column(db.Integer, primary_key=True, autoincrement=True)
dev_name = db.Column(db.String(50))
so I can retrieve Device data for every Event.
I´m doing a ranking of the 20 top max events generated in a single hour. It already works, but as my Events table grows (over 1M rows now) the query gets slower and slower. This is my code. Any ideas on how to optimize the query? Maybe a composite index device.id + timestamp_event? Would that work even if searching for a part of the timedate column?
pkd = db.session.query(db.func.count(Events.id),
db.func.date_format(Events.timestamp_event,'%d/%m %H'),\
Devices.dev_name).select_from(Events).join(Devices)\
.filter(Events.timestamp_event >= (datetime.now() - timedelta(days=peak_days)))\
.group_by(db.func.date_format(Events.timestamp_event,'%Y%M%D%H'))\
.group_by(Events.device_id)\
.order_by(db.func.count(Events.id).desc()).limit(20).all()
Here´s sample output of first 3 rows of the query: Number of events, when (DD/MM HH), and which device:
[(2710, '15/01 16', 'Device 002'),
(2612, '11/01 17', 'Device 033'),
(2133, '13/01 15', 'Device 002'),...]
and here´s SQL generated by SQLAlchemy:
SELECT count(events.id) AS count_1,
date_format(events.timestamp_event,
%(date_format_2)s) AS date_format_1,
devices.id AS devices_id,
devices.dev_name AS devices_dev_name
FROM events
INNER JOIN devices ON devices.id = events.device_id
WHERE events.timestamp_event >= %(timestamp_event_1)s
GROUP BY date_format(events.timestamp_event, %(date_format_3)s), events.device_id
ORDER BY count(events.id) DESC
LIMIT %(param_1)s

# This example is for postgresql.
# I'm not sure what db you are using but the date formatting
# is different.
with Session(engine) as session:
# Use subquery to select top 20 event creating device ids
# for each hour since the beginning of the peak.
hour_fmt = "dd/Mon HH24"
hour_col = func.to_char(Event.created_on, hour_fmt).label('event_hour')
event_count_col = func.count(Event.id).label('event_count')
sub_q = select(
event_count_col,
hour_col,
Event.device_id
).filter(
Event.created_on > get_start_of_peak()
).group_by(
hour_col, Event.device_id
).order_by(
event_count_col.desc()
).limit(
20
).alias()
# Now join in the devices to the top ids to get the names.
results = session.execute(
select(
sub_q.c.event_count,
sub_q.c.event_hour,
Device.name
).join_from(
sub_q,
Device,
sub_q.c.device_id == Device.id
).order_by(
sub_q.c.event_count.desc(),
Device.name
)
).all()

Related

sqlalchemy join returns results from first table only

so this is my issue, I have the following tables:
class ClientCampaings(Base):
__tablename__ = 'client_campaign'
campaign_id = Column(INTEGER, primary_key=True)
client_id = Column(VARCHAR(50))
campaign_name = Column(VARCHAR(45))
campaign_status = Column(VARCHAR(45))
campaign_type = Column(VARCHAR(45))
registration_date = Column(DATE)
class ClientKpi(Base):
__tablename__ = 'client_kpi'
kpi_id = Column(INTEGER, primary_key=True)
kpi_name = Column(VARCHAR(45))
cost_conv = Column(FLOAT)
quality_score = Column(FLOAT)
class KpiAssigment(Base):
__tablename__ = 'kpi_assigment'
assigment_id = Column(INTEGER, primary_key=True)
kpi_id = Column(INTEGER, ForeignKey("client_kpi.kpi_id"))
campaign_id = Column(INTEGER, ForeignKey("client_campaign.campaign_id"))
assigned_by = Column(VARCHAR(45))
timestamp = Column(TIMESTAMP)
#Basic One To Many relation
client_campaign = relationship("ClientCampaings")
client_kpi = relationship("ClientKpi")
Them I do the following query:
from database.session import MySqlConnection
from database.models import KpiAssigment,ClientKpi
db = MySqlConnection(database='db_goes_here').db_session()
kpi=db.query(KpiAssigment)\
.join(ClientKpi)\
.filter(KpiAssigment.kpi_id==ClientKpi.kpi_id).all()
Which I thought was going to be something like this:
SELECT kpi_assigment.*,
client_kpi.*
FROM kpi_assigment
INNER JOIN client_kpi
ON kpi_assigment.kpi_id=client_kpi.kpi_id
However, when I run the SqlAlchemy query I getting back only the results from the first table:
{'_sa_instance_state': <sqlalchemy.orm.state.InstanceState object at 0x7fb3d0ace910>, 'kpi_id': 1, 'assigned_by': 'xxx#email.net', 'assigment_id': 2, 'campaign_id': XXXXXXXXX, 'timestamp': datetime.datetime(2022, 12, 7, 17, 5, 8)}
I was looking to get an INNER JOIN and have also the data from ClientKpi table.
I read these related issues but still not finding why this isn't working
SqlAlchemy Outer Join Only Returns One Table
How to join data from two tables in SQLAlchemy?
and I did follow their documentation
https://docs.sqlalchemy.org/en/14/orm/query.html#sqlalchemy.orm.Query.join
Any thoughts?
Thanks
The query
kpi=db.query(KpiAssigment)\
.join(ClientKpi)\
.filter(KpiAssigment.kpi_id==ClientKpi.kpi_id).all()
will only select the KpiAssignemnt table; to select the corresponding ClientKpi(s) include the ClientKpi model in the query:
kpi=db.query(KpiAssigment, ClientKpi).join(ClientKpi)
The .filter is redundant as SQLAlchemy will use the declared foreign keys to create the JOIN.
If you want to loop over the child objects for each parent you can use the relationship without an explicit join:
for client_kpi in some_kpi_assignment.client_kpi:
# do something
Still trying to find out if there is a better response for this, in the meantime, I did the following and got what I was looking for:
query = db.query(KpiAssigment.campaign_id,
ClientKpi.cost_conv,
ClientKpi.quality_score)\
.join(ClientKpi)\
.filter(KpiAssigment.kpi_id==ClientKpi.kpi_id)\
.all()
Which returns the resulta has a list of tuples:
print(query)
>>>[(1, 6.0, 2), (2, 6.0, 2)]

SQLAlchemy Nested CTE Query

The sqlalchemy core query builder appears to unnest and relocate CTE queries to the "top" of the compiled sql.
I'm converting an existing Postgres query that selects deeply joined data as a single JSON object. The syntax is pretty contrived but it significantly reduces network overhead for large queries. The goal is to build the query dynamically using the sqlalchemy core query builder.
Here's a minimal working example of a nested CTE
with res_cte as (
select
account_0.name acct_name,
(
with offer_cte as (
select
offer_0.id
from
offer offer_0
where
offer_0.account_id = account_0.id
)
select
array_agg(offer_cte.id)
from
offer_cte
) as offer_arr
from
account account_0
)
select
acct_name::text, offer_arr::text
from res_cte
Result
acct_name, offer_arr
---------------------
oliver, null
rachel, {3}
buddy, {4,5}
(my incorrect use of) the core query builder attempts to unnest offer_cte and results in every offer.id being associated with every account_name in the result.
There's no need to re-implement this exact query in an answer, any example that results in a similarly nested CTE would be perfect.
I just implemented the nesting cte feature. It should land with 1.4.24 release.
Pull request: https://github.com/sqlalchemy/sqlalchemy/pull/6709
import sqlalchemy as sa
from sqlalchemy.ext.declarative import declarative_base
# Model declaration
Base = declarative_base()
class Offer(Base):
__tablename__ = "offer"
id = sa.Column(sa.Integer, primary_key=True)
account_id = sa.Column(sa.Integer, nullable=False)
class Account(Base):
__tablename__ = "account"
id = sa.Column(sa.Integer, primary_key=True)
name = sa.Column(sa.TEXT, nullable=False)
# Query construction
account_0 = sa.orm.aliased(Account)
# Watch the nesting keyword set to True
offer_cte = (
sa.select(Offer.id)
.where(Offer.account_id == account_0.id)
.select_from(Offer)
.correlate(account_0).cte("offer_cte", nesting=True)
)
offer_arr = sa.select(sa.func.array_agg(offer_cte.c.id).label("offer_arr"))
res_cte = sa.select(
account_0.name.label("acct_name"),
offer_arr.scalar_subquery().label("offer_arr"),
).cte("res_cte")
final_query = sa.select(
sa.cast(res_cte.c.acct_name, sa.TEXT),
sa.cast(res_cte.c.offer_arr, sa.TEXT),
)
It constructs this query that returns the result you expect:
WITH res_cte AS
(
SELECT
account_1.name AS acct_name
, (
WITH offer_cte AS
(
SELECT
offer.id AS id
FROM
offer
WHERE
offer.account_id = account_1.id
)
SELECT
array_agg(offer_cte.id) AS offer_arr
FROM
offer_cte
) AS offer_arr
FROM
account AS account_1
)
SELECT
CAST(res_cte.acct_name AS TEXT) AS acct_name
, CAST(res_cte.offer_arr AS TEXT) AS offer_arr
FROM
res_cte

SQLAlchemy: How to keep the record ID and its relationship record in sync for new records (pre-commit)?

When creating new records, I'd expect that foreign key fields, and their relationship object would stay in sync (if I change one the other would change to reflect), but this doesn't seem to be the case. Is this possible to do?
Given the following:
Base = declarative_base();
class User(Base):
__tablename__ = 'user';
id = Column(Integer, primary_key=True);
name = Column(String);
fullname = Column(String);
password = Column(String);
equipment = relationship('Equipment', backref='user');
class Equipment(Base):
__tablename__ = 'equipment';
id = Column(Integer, primary_key=True);
user_id = Column(Integer, ForeignKey('user.id'), nullable=False);
name = Column(String);
engine = create_engine('sqlite:///:memory:', echo=True);
Base.metadata.create_all(engine);
session = sessionmaker(bind=engine);
conn = session();
conn.add_all([
User(name='bill', fullname='Bill W.', password='rlrrlrll'), # id=1
User(name='tony', fullname='Tony I.', password='EADGBe'), # id=2
User(name='ozzy', fullname='Ozzy O.', password='durrrr'), # id=3
User(name='geezer', fullname='Terence B.', password='password'), # id=4
]);
I can create related records in either of the two ways:
guitar = Equipment(
user = conn.query(User).filter(User.name == 'tony').one(),
name = 'Gibson SG');
drums = Equipment(
user_id = 1,
name = 'Ludwigs');
Following these lines I'd expect guitar.user_id to be 2, and drums.user to be the 'bill' object, but in both cases they're None. After I conn.add()/conn.commit() then it starts working a little more like I'd expect (both complementary fields return non-None values).
Is there any way for this to work pre-commit? I'd like to be able to construct new records either way (by ID or by object), and in library functions be able to reliably access the ID or object.
You can do this by flushing:
conn.add(guitar)
conn.add(name)
conn.flush()
Flushing emits the INSERT queries but does not COMMIT, meaning you can ROLLBACK later if you need to.

SQLAlchemy: hybrid_property expression and subquery

I am trying to do a complex hybrid_property using SQLAlchemy: my model is
class Consultation(Table):
patient_id = Column(Integer)
patient = relationship('Patient', backref=backref('consultations', lazy='dynamic'))
class Exam(Table):
consultation_id = Column(Integer)
consultation = relationship('Consultation', backref=backref('exams', lazy='dynamic'))
class VitalSign(Table):
exam_id = Column(Integer)
exam = relationship('Exam', backref=backref('vital', lazy='dynamic'))
vital_type = Column(String)
value = Column(String)
class Patient(Table):
patient_data = Column(String)
#hybrid_property
def last_consultation_validity(self):
last_consultation = self.consultations.order_by(Consultation.created_at.desc()).first()
if last_consultation:
last_consultation_conclusions = last_consultation.exams.filter_by(exam_type='conclusions').first()
if last_consultation_conclusions:
last_consultation_validity = last_consultation_conclusions.vital_signs.filter_by(sign_type='validity_date').first()
if last_consultation_validity:
return last_consultation_validity
return None
#last_consultation_validity.expression
def last_consultation_validity(cls):
subquery = select([Consultation.id.label('last_consultation_id')]).\
where(Consultation.patient_id == cls.id).\
order_by(Consultation.created_at.desc()).limit(1)
j = join(VitalSign, Exam).join(Consultation)
return select([VitalSign.value]).select_from(j).select_from(subquery).\
where(and_(Consultation.id == subquery.c.last_consultation_id, VitalSign.sign_type == 'validity_date'))
As you can see my model is quite complicated.
Patients get Consultations. Exams and VitalSigns are cascading data for the Consultations. The idea is that all consultations do not get a validity but that new consultations make the previous consultations validity not interesting: I only want the validity from the last consultation; if a patient has a validity in previous consultations, I'm not interested.
What I would like to do is to be able to order by the hybrid_property last_consultation_validity.
The output SQL looks ok to me:
SELECT vital_sign.value
FROM (SELECT consultation.id AS last_consultation_id
FROM consultation, patient
WHERE consultation.patient_id = patient.id ORDER BY consultation.created_at DESC
LIMIT ? OFFSET ?), vital_sign JOIN exam ON exam.id = vital_sign.exam_id JOIN consultation ON consultation.id = exam.consultation_id
WHERE consultation.id = last_consultation_id AND vital_sign.sign_type = ?
But when I order the patients by last_consultation_validity, the rows do not get ordered ...
When I execute the same select outside of the hybrid_property, to retrieve the date for each patient (just setting the patient.id), I get the good values. Surprising is that the SQL is slightly different, removing patient in the FROMin the SELECT.
So I'm actually wondering if this is a bug in SQLAlchemy or if I'm doing something wrong ... Any help would be greatly appreciated.

Sqlalchemy ID field isn't populated when relationship with another table is set up

I'm trying to set up Sqlalchemy and am running into problems with setting up relationships between tables. Most likely it's misunderstanding on my part.
A table is set up like so. The important line is the one with two asterisks one either side, setting up the relationship to table "jobs."
class Clocktime(Base):
"""Table for clockin/clockout values
ForeignKeys exist for Job and Employee
many to one -> employee
many to one -> job
"""
__tablename__ = "clocktimes"
id = Column(Integer, primary_key=True)
time_in = Column(DateTime)
time_out = Column(DateTime)
employee_id = Column(Integer, ForeignKey('employees.id'))
**job_id = Column(Integer, ForeignKey('jobs.id'))**
# employee = many to one relationship with Employee
# job = many to one relationship with Job
#property
def timeworked(self):
return self.time_out - self.time_in
#property
def __str__(self):
formatter="Employee: {employee.name}, "\
"Job: {job.abbr}, "\
"Start: {self.time_in}, "\
"End: {self.time_out}, "\
"Hours Worked: {self.timeworked}, "\
"ID# {self.id}"
return formatter.format(employee=self.employee, job=self.job, self=self)
Now, the jobs table follows. Check the asterisked line:
class Job(Base):
"""Table for jobs
one to many -> clocktimes
note that rate is cents/hr"""
__tablename__ = "jobs"
id = Column(Integer, primary_key=True)
name = Column(String(50))
abbr = Column(String(16))
rate = Column(Integer) # cents/hr
**clocktimes = relationship('Clocktime', backref='job', order_by=id)**
def __str__(self):
formatter = "Name: {name:<50} {abbr:>23}\n" \
"Rate: ${rate:<7.2f}/hr {id:>62}"
return formatter.format(name=self.name,
abbr="Abbr: " + str(self.abbr),
rate=self.rate/100.0,
id="ID# " + str(self.id))
When a user starts a new task, the following code is executed in order to write the relevant data to tables jobs and clocktimes:
new_task_job = [Job(abbr=abbrev, name=project_name, rate=p_rate), Clocktime(time_in=datetime.datetime.now())]
for i in new_task_job:
session.add(i)
session.commit()
start_time = datetime.datetime.now()
status = 1
Then, when the user takes a break...
new_break = Clocktime(time_out=datetime.datetime.now())
session.add(new_break)
session.commit()
If you look in the screenshot, the job_id field isn't being populated. Shouldn't it be populated with the primary key (id) from the jobs table, per
job_id = Column(Integer, ForeignKey('jobs.id'))
or am I missing something? I'm assuming that I'm to write code to do that, but I don't want to break anything that Sqlalchemy is trying to do in the backend. This should be a one job to many clocktimes, since a person can spend several days per task.
Checking out the docs it
looks like you've set up a collection of ClockTime objects on Job called clocktimes and a .job attribute on ClockTime that will refer to the parent Job object.
The expected behaviour is,
c1 = ClockTime()
j1 = Job()
>>> j1.clocktimes
[]
>>> print c1.job
None
When you populate j1.clocktimes with an object, you should also see c1.job get a non None value.
j1.clocktimes.append(c1)
>>> j1.clocktimes
[an instance of `ClockTime`]
>>> c1.job
[an instance of `Job`]
Do you find that behaviour? I don't see in your code where you populate clocktimes so the population of job is never triggered.
I think you are expecting the addition of ForeignKey to the column definition to do something it doesn't do. The ForeignKey constraint you put on job_id simply means that it is constrained to be among the values that exist in the id column of the Jobs table. Check here for more details

Categories