I would like to have the row number as a column of my queries. Since I am using MySql, I cannot use the built-in func.row_number() of SqlAlchemy. The result of this query is going to be paginated, therefore I would like to keep the row number before the split happen.
session.query(MyModel.id, MyModel.date, "row_number")
I tried to use an hybrid_property to increment a static variable inside MyModel class that I reset before my query, but it didn't work.
#hybrid_property
def row_number(self):
cls = self.__class__
cls.row_index = cls.row_index + 1
return literal(self.row_index)
#row_number.expression
def row_number(cls):
cls.row_index = cls.row_index + 1
return literal(cls.row_index)
I also tried to mix a subquery with this solution :
session.query(myquery.subquery(), literal("#rownum := #rownum + 1 AS row_number"))
But I didn't find a way to make a textual join for (SELECT #rownum := 0) r.
Any suggestions?
EDIT
For the moment, I am looping on the results of the paginated query and I am assigning the calculated number from the current page to each row.
SQLAlchemy allows you to use text() in some places, but not arbitrarily. I especially cannot find an easy/documented way of using it in columns or joins. However, you can write your entire query in SQL and still get ORM objects out of it. Example:
query = session.query(Foobar, "rownum")
query = query.from_statement(
"select foobar.*, cast(#row := #row + 1 as unsigned) as rownum"
" from foobar, (select #row := 0) as init"
)
That being said, I don't really see the problem with something like enumerate(query.all()) either. Note that if you use a LIMIT expression, the row numbers you get from MySQL will be for the final result and will still need to have the page start index added. That is, it's not "before the split" by default. If you want to have the starting row added for you in MySQL you can do something like this:
prevrow = 42
query = session.query(Foobar, "rownum")
query = query.from_statement(sqlalchemy.text(
"select foobar.*, cast(#row := #row + 1 as unsigned) as rownum"
" from foobar, (select #row := :prevrow) as init"
).bindparams(prevrow=prevrow))
In this case the numbers will start at 43 since it's pre-incrementing.
Related
I am building an api which can return children of resources if the user requests it. For example, user has messages. I want the query to be able to limit the number of message objects that are returned.
I found a useful tip aboutl imiting the number of objects in child collections here. Basically, it indicates the following flow:
class User(...):
# ...
messages = relationship('Messages', order_by='desc(Messages.date)', lazy='dynamic')
user = User.query.one()
users.messages.limit(10)
My use case involves returning sometimes large numbers of users.
If I were to follow the advice in that link and used .limit() then I would need to iterate over the entire collection of users calling .limit() on each one. This is much less efficient then, say, using LIMIT in the original sql expression which created the collection.
My question is whether it is possible using declarative to efficiently(N+0) load a large collection of objects while limiting the number of children in their child collections using sqlalchemy?
UPDATE
To be clear, the below is what I am trying to avoid.
users = User.query.all()
messages = {}
for user in users:
messages[user.id] = user.messages.limit(10).all()
I want to do something more like:
users = User.query.option(User.messages.limit(10)).all()
This answer comes from Mike Bayer on the sqlalchemy google group. I'm posting it here to help folks:
TLDR:
I used version 1 of Mike's answer to solve my problem because, in this case, I do not have foreign keys involved in this relationship and so cannot make use of LATERAL. Version 1 worked great, but be sure to note the effect of offset. It threw me off during testing for a while because I didn't notice it was set to something other than 0.
Code Block for version 1:
subq = s.query(Messages.date).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()
q = s.query(User).join(
Messages,
and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))
Mike's Answer
so you should ignore whether or not it uses "declarative", which has nothing to do with querying, and in fact at first ignore Query too, because first and foremost this is a SQL problem. You want one SQL statement that does this. What query in SQL would load lots of rows from the primary table, joined to the first ten rows of the secondary table for each primary?
LIMIT is tricky because it's not actually part of the usual "relational algebra" calculation. It's outside of that because it's an artificial limit on rows. For example, my first thought on how to do this was wrong:
select * from users left outer join (select * from messages limit 10) as anon_1 on users.id = anon_1.user_id
This is wrong because it only gets the first ten messages in the aggregate, disregarding user. We want to get the first ten messages for each user, which means we need to do this "select from messages limit 10" individually for each user. That is, we need to correlate somehow. A correlated subquery though is not usually allowed as a FROM element, and is only allowed as a SQL expression, it can only return a single column and a single row; we can't normally JOIN to a correlated subquery in plain vanilla SQL. We can however, correlate inside the ON clause of the JOIN to make this possible in vanilla SQL.
But first, if we are on a modern Postgresql version, we can break that usual rule of correlation and use a keyword called LATERAL, which allows correlation in a FROM clause. LATERAL is only supported by modern Postgresql versions, and it makes this easy:
select * from users left outer join lateral
(select * from message where message.user_id = users.id order by messages.date desc limit 10) as anon1 on users.id = anon_1.user_id
we support the LATERAL keyword. The query above looks like this:
subq = s.query(Messages).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).limit(10).subquery().lateral()
q = s.query(User).outerjoin(subq).\
options(contains_eager(User.messages, alias=subq))
Note that above, in order to SELECT both users and messages and produce them into the User.messages collection, the "contains_eager()" option must be used and for that the "dynamic" has to go away. This is not the only option, you can for example build a second relationship for User.messages that doesn't have the "dynamic" or you can just load from query(User, Message) separately and organize the result tuples as needed.
if you aren't using Postgresql, or a version of Postgresql that doesn't support LATERAL, the correlation has to be worked into the ON clause of the join instead. The SQL looks like:
select * from users left outer join messages on
users.id = messages.user_id and messages.date > (select date from messages where messages.user_id = users.id order by date desc limit 1 offset 10)
Here, in order to jam the LIMIT in there, we are actually stepping through the first 10 rows with OFFSET and then doing LIMIT 1 to get the date that represents the lower bound date we want for each user. Then we have to join while comparing on that date, which can be expensive if this column isn't indexed and also can be inaccurate if there are duplicate dates.
This query looks like:
subq = s.query(Messages.date).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()
q = s.query(User).join(
Messages,
and_(User.id == Messages.user_id, Messages.date >= subq)
).options(contains_eager(User.messages))
These kinds of queries are the kind that I don't trust without a good test, so POC below includes both versions including a sanity check.
from sqlalchemy import *
from sqlalchemy.orm import *
from sqlalchemy.ext.declarative import declarative_base
import datetime
Base = declarative_base()
class User(Base):
__tablename__ = 'user'
id = Column(Integer, primary_key=True)
messages = relationship(
'Messages', order_by='desc(Messages.date)')
class Messages(Base):
__tablename__ = 'message'
id = Column(Integer, primary_key=True)
user_id = Column(ForeignKey('user.id'))
date = Column(Date)
e = create_engine("postgresql://scott:tiger#localhost/test", echo=True)
Base.metadata.drop_all(e)
Base.metadata.create_all(e)
s = Session(e)
s.add_all([
User(id=i, messages=[
Messages(id=(i * 20) + j, date=datetime.date(2017, 3, j))
for j in range(1, 20)
]) for i in range(1, 51)
])
s.commit()
top_ten_dates = set(datetime.date(2017, 3, j) for j in range(10, 20))
def run_test(q):
all_u = q.all()
assert len(all_u) == 50
for u in all_u:
messages = u.messages
assert len(messages) == 10
for m in messages:
assert m.user_id == u.id
received = set(m.date for m in messages)
assert received == top_ten_dates
# version 1. no LATERAL
s.close()
subq = s.query(Messages.date).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).\
limit(1).offset(10).correlate(User).as_scalar()
q = s.query(User).join(
Messages,
and_(User.id == Messages.user_id, Messages.date > subq)
).options(contains_eager(User.messages))
run_test(q)
# version 2. LATERAL
s.close()
subq = s.query(Messages).\
filter(Messages.user_id == User.id).\
order_by(Messages.date.desc()).limit(10).subquery().lateral()
q = s.query(User).outerjoin(subq).\
options(contains_eager(User.messages, alias=subq))
run_test(q)
If you apply limit and then call .all() on it, you will get all objects once and it will not get objects one by one , causing performance issue that you mentioned.
simply apply limit and get all objects.
users = User.query.limit(50).all()
print(len(users))
>>50
Or for child objects / relationships
user = User.query.one()
all_messages = user.messages.limit(10).all()
users = User.query.all()
messages = {}
for user in users:
messages[user.id] = user.messages.limit(10).all()
So, I think you'll need to load the messages in a second query and then later associate with your users somehow.
The following is database dependent; as discussed in this question, mysql does not support in queries with limits, but sqlite at least will parse the query. I didn't look at the plan to see if it did a good job.
The following code will find all the message objects you care about. You then need to associate them with users.
I've tested this to confirm that it produces a query sqlite can parse; I have not confirmed that sqlite or any other database does the right thing with this query.
I had to cheat a bit and use the text primitive to refer to the outer user.id column in the select because SQLAlchemy kept wanting to include an additional join to users in the inner select subjquery.
from sqlalchemy import Column, Integer, String, ForeignKey, alias
from sqlalchemy.sql import text
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key = True)
name = Column(String)
class Message(Base):
__tablename__ = 'messages'
user_id = Column(Integer, ForeignKey(User.id), nullable = False)
id = Column(Integer, primary_key = True)
s = Session()
m1 = alias(Message.__table__)
user_query = s.query(User) # add any user filtering you want
inner_query = s.query(m1.c.id).filter(m1.c.user_id == text('users.id')).limit(10)
all_messages_you_want = s.query(Message).join(User).filter(Message.id.in_(inner_query))
To associate the messages with users, you could do something like the following assuming your Message has a user relation and your user objects have a got_child_message method that does whatever you like for this
users_resulting = user_query.all() #load objects into session and hold a reference
for m in all_messages_you_want: m.user.got_child_message(m)
Because you already have the users in the session and because the relation is on User's primary key, m.user resolves to query.get against the identity map.
I hope this helps you get somewhere.
#melchoirs answer is the best. I basically putting this here for futureselves
I played around with the above stated answer, and it works, I needed it more so to limit the number of associations returned before passing into a Marshmallow Serializer.
Some issues for clarification:
the subquery runs per association, hence it finds the corresponding date to base off properly
think about the limit/offset as give me 1 (limit) record starting at the next X (offset). Hence what is the Xth oldest record, and then in the main query it gives everything back from that. Its damn smart
It appears that if the association has less than X records, it returns nothing, as the offset is past the records, and henceforth the main query does not return a record.
Using the above as a template, I came up with the below answer. The initial query/count guard is due to the issue that if the associated records are less than the offset, nothing is found. In addition, I needed to add an outerjoin in the event that there are no associations either.
At the end, I found this query to be a bit or ORM voodoo, and didn't want to go that route. I instead exclude the histories from the device serializer, and require a second history lookup using the device ID. That set can be paginated and makes everything a bit cleaner.
Both methods work, it just comes down to the why you'll need to do the one query versus a couple. In the above, there was probably business reasons to get everythng back more efficiently with the single query. For my use case, readability, and convention trumped the voodoo
#classmethod
def get_limited_histories(cls, uuid, limit=10):
count = DeviceHistory.query.filter(DeviceHistory.device_id == uuid).count()
if count > limit:
sq = db.session.query(DeviceHistory.created_at) \
.filter(DeviceHistory.device_id == Device.uuid) \
.order_by(DeviceHistory.created_at.desc()) \
.limit(1).offset(limit).correlate(Device)
return db.session.query(Device).filter(Device.uuid == uuid) \
.outerjoin(DeviceHistory,
and_(DeviceHistory.device_id == Device.uuid, DeviceHistory.created_at > sq)) \
.options(contains_eager(Device.device_histories)).all()[0]
It then behaves similar to a Device.query.get(id) but Device.get_limited_histories(id)
ENJOY
query = "SELECT serialno from registeredpcs where ipaddress = "
usercheck = query + "'%s'" %fromIP
#print("query"+"-"+usercheck)
print(usercheck)
rs = cursor.execute(usercheck)
print(rs)
row = rs
#print(row)
#rs = cursor.rowcount()
if int(row) == 1:
query = "SELECT report1 from registeredpcs where serialno = "
firstreport = query + "'%s'" %rs
result = cursor.execute(firstreport)
print(result)
elif int(row) == 0:
query_new = "SELECT * from registeredpcs"
cursor.execute(query_new)
newrow = cursor.rowcount()+1
print(new row)
What I am trying to do here is fetch the serialno values from the db when it matches a certain ipaddress. This query if working fine. As it should the query result set rs is 0. Now I am trying to use that value and do something else in the if else construct. Basically I am trying to check for unique values in the db based on the ipaddress value. But I am getting this error
error: uncaptured python exception, closing channel smtpd.SMTPChannel connected
192.168.1.2:3630 at 0x2e47c10 (**class 'TypeError':'int' object is not
callable** [C:\Python34\lib\asyncore.py|read|83]
[C:\Python34\lib\asyncore.py|handle_read_event|442]
[C:\Python34\lib\asynchat.py|handle_read|171]
[C:\Python34\lib\smtpd.py|found_terminator|342] [C:/Users/Dev-
P/PycharmProjects/CR Server Local/LRS|process_message|43])
I know I am making some very basic mistake. I think it's the part in bold thats causing the error. But just can't put my finger on to it. I tried using the rowcount() method didn't help.
rowcount is an attribute, not a method; you shouldn't call it.
"I know I am making some very basic mistake" : well, Daniel Roseman alreay adressed the cause of your main error, but there are a couple other mistakes in your code:
query = "SELECT serialno from registeredpcs where ipaddress = "
usercheck = query + "'%s'" % fromIP
rs = cursor.execute(usercheck)
This part is hard to read (you're using both string concatenation and string formatting for no good reason), brittle (try this with `fromIP = "'foo'"), and very very unsafe. You want to use paramerized queries instead, ie:
# nb check your exact db-api module for the correct placeholder,
# MySQLdb uses '%s' but some other use '?' instead
query = "SELECT serialno from registeredpcs where ipaddress=%s"
params = [fromIP,]
rs = cursor.execute(query, params)
"As it should the query result set rs is 0"
This is actually plain wrong. cursor.execute() returns the number of rows affected (selected, created, updated, deleted) by the query. The "resultset" is really the cursor itself. You can fetch results using cursor.fetchone(), cursor.fetall(), or more simply (and more efficiently if you want to work on the whole resultset with constant memory use) by iterating over the cursor, ie:
for row in cursor:
print row
Let's continue with your code:
row = rs
if int(row) == 1:
# ...
elif int(row) == 0:
# ...
The first line is useless - it only makes row an alias of rs, and badly named - it's not a "row" (one line of results from your query), it's an int. Since it's already an int, converting it to int is also useless. And finally, unless 'ipadress' is a unique key in your table, your query might return more than one row.
If what you want is the effective value(s) for the serialno field for records matching fromIP, you have to fetch the row(s):
row = cursor.fetchone() # first row, as a tuple
then get the value, which in this case will be the first item in row:
serialno = row[0]
I am currently working on getting the following query run with sqlalchemy.
SELECT *
FROM
qsreport_test.JiraFehlerFreigabeReleaseTestData a
WHERE a.dtime = (
SELECT
max(dtime)
FROM
qsreport_test.JiraFehlerFreigabeReleaseTestData b
WHERE
b.key=a.key
AND
b.testplanname=a.testplanname
AND
b.priority=a.priority
)
GROUP BY
a.testplanname, a.priority
I searched here and found
SQLAlchemy - subquery in a WHERE clause
and I also know
http://docs.sqlalchemy.org/en/rel_0_9/orm/tutorial.html#using-subqueries
but I still do not get my query to work.
Here are two of my tries. The first one can be performed but does not give me the same result as the manual executed query. The second one can be performed and is now working (was missing a parenthesis before).
Try 1:
subq = session.query(JiraFehlerFreigabeReleaseTestData.key, JiraFehlerFreigabeReleaseTestData.testplanname, JiraFehlerFreigabeReleaseTestData.priority, func.max(JiraFehlerFreigabeReleaseTestData.dtime).label('max_dtime')).filter(JiraFehlerFreigabeReleaseTestData.key == key).filter(JiraFehlerFreigabeReleaseTestData.dtime <= dtime).subquery()
res = session.query(JiraFehlerFreigabeReleaseTestData).filter(and_(JiraFehlerFreigabeReleaseTestData.key == subq.c.key, JiraFehlerFreigabeReleaseTestData.testplanname == subq.c.testplanname, JiraFehlerFreigabeReleaseTestData.priority == subq.c.priority, JiraFehlerFreigabeReleaseTestData.dtime == subq.c.max_dtime)).group_by(JiraFehlerFreigabeReleaseTestData.testplanname).order_by(JiraFehlerFreigabeReleaseTestData.dtime.desc()).all()
Try 2:
b = aliased(JiraFehlerFreigabeReleaseTestData, name='b')
res = session.query(JiraFehlerFreigabeReleaseTestData).filter(JiraFehlerFreigabeReleaseTestData.dtime == (session.query(func.max(b.dtime)).filter(b.key == JiraFehlerFreigabeReleaseTestData.key).filter(b.testplanname == JiraFehlerFreigabeReleaseTestData.testplanname).filter(b.priority == JiraFehlerFreigabeReleaseTestData.priority))).group_by(JiraFehlerFreigabeReleaseTestData.testplanname,JiraFehlerFreigabeReleaseTestData.priority).order_by(JiraFehlerFreigabeReleaseTestData.dtime.desc()).all()
Please show me what I'm doing wrong. My problem is that I do not completely understand how to reference the table of the main query in the subquery. Especially as the two tables are the same in both queries. Perhaps there is strait what to convert my manual sql query to sqlalchemy orm syntax?
Edit:
My second version is working. I was just missing one paranthesis. I fixed the code above.
But why is my first try giving me a different result?
I've not worked with SQLAlchemy yet, but perhaps you can solve your problem, if your arn't using a subquery. Take a look of the query below. Not tested an free translated from yours ;)
SELECT a.*
FROM qsreport_test.JiraFehlerFreigabeReleaseTestData a
INNER JOIN qsreport_test.JiraFehlerFreigabeReleaseTestData b
ON b.key = a.key
AND b.testplanname=a.testplanname
AND b.priority=a.priority
GROUP BY a.testplanname, a.priority
HAVING a.dtime = max(b.dtime)
Im creating a python program that connects to mysql.
i need to check if a table contains the number 1 to show that it has connected successfully, this is my code thus far:
xcnx.execute('CREATE TABLE settings(status INT(1) NOT NULL)')
xcnx.execute('INSERT INTO settings(status) VALUES(1)')
cnx.commit()
sqlq = "SELECT * FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
results = xcnx.fetchall()
if results =='1':
print 'yep its connected'
else:
print 'nope not connected'
what have i missed? i am an sql noob, thanks guys.
I believe the most efficient "does it exist" query is just to do a count:
sqlq = "SELECT COUNT(1) FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
if xcnx.fetchone()[0]:
# exists
Instead of asking the database to perform any count operations on fields or rows, you are just asking it to return a 1 or 0 if the result produces any matches. This is much more efficient that returning actual records and counting the amount client side because it saves serialization and deserialization on both sides, and the data transfer.
In [22]: c.execute("select count(1) from settings where status = 1")
Out[22]: 1L # rows
In [23]: c.fetchone()[0]
Out[23]: 1L # count found a match
In [24]: c.execute("select count(1) from settings where status = 2")
Out[24]: 1L # rows
In [25]: c.fetchone()[0]
Out[25]: 0L # count did not find a match
count(*) is going to be the same as count(1). In your case because you are creating a new table, it is going to show 1 result. If you have 10,000 matches it would be 10000. But all you care about in your test is whether it is NOT 0, so you can perform a bool truth test.
Update
Actually, it is even faster to just use the rowcount, and not even fetch results:
In [15]: if c.execute("select (1) from settings where status = 1 limit 1"):
print True
True
In [16]: if c.execute("select (1) from settings where status = 10 limit 1"):
print True
In [17]:
This is also how django's ORM does a queryObject.exists().
If all you want to do is check if you have successfully established a connection then why are you trying to create a table, insert a row, and then retrieve data from it?
You could simply do the following...
sqlq = "SELECT * FROM settings WHERE status = '1'"
xcnx.execute(sqlq)
results = xcnx.fetchone()
if results =='1':
print 'yep its connected'
else:
print 'nope not connected'
In fact if your program has not thrown an exception so far indicates that you have established the connection successfully. (Do check the code above, I'm not sure if fetchone will return a tuple, string, or int in this case).
By the way, if for some reason you do need to create the table, I would suggest dropping it before you exit so that your program runs successfully the second time.
When you run results = xcnx.fetchall(), the return value is a sequence of tuples that contain the row values. Therefore when you check if results == '1', you are trying to compare a sequence to a constant, which will return False. In your case, a single row of value 0 will be returned, so you could try this:
results = xcnx.fetchall()
# Get the value of the returned row, which will be 0 with a non-match
if results[0][0]:
print 'yep its connected'
else:
print 'nope not connected'
You could alternatively use a DictCursor (when creating the cursor, use .cursor(MySQLdb.cursors.DictCursor) which would make things a bit easier to interpret codewise, but the result is the same:
if results[0]['COUNT(*)]':
# Continues...
Also, not a big deal in this case, but you are comparing an integer value to a string. MySQL will do the type conversion, but you could use SELECT COUNT(*) FROM settings WHERE status = 1 and save a (very small) bit of processing.
I recently improved my efficiency by instead of querying select, just adding a primary index to the unique column and then adding it. MySQL will only add it if it doesn't exist.
So instead of 2 statements:
Query MySQL for exists:
Query MySQL insert data
Just do 1 and it will only work if it's unique:
Query MySQL insert data
1 Query is better than 2.
Reading: http://code.google.com/appengine/docs/python/datastore/gqlreference.html
I want to use:
:= IN
but am unsure how to make it work. Let's assume the following
class User(db.Model):
name = db.StringProperty()
class UniqueListOfSavedItems(db.Model):
str = db.StringPropery()
datesaved = db.DateTimeProperty()
class UserListOfSavedItems(db.Model):
name = db.ReferenceProperty(User, collection='user')
str = db.ReferenceProperty(UniqueListOfSavedItems, collection='itemlist')
How can I do a query which gets me the list of saved items for a user? Obviously I can do:
q = db.Gql("SELECT * FROM UserListOfSavedItems WHERE name :=", user[0].name)
but that gets me a list of keys. I want to now take that list and get it into a query to get the str field out of UniqueListOfSavedItems. I thought I could do:
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems WHERE := str in q")
but something's not right...any ideas? Is it (am at my day job, so can't test this now):
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems __key__ := str in q)
side note: what a devilishly difficult problem to search on because all I really care about is the "IN" operator.
Since you have a list of keys, you don't need to do a second query - you can do a batch fetch, instead. Try this:
#and this should get me the items that a user saved
useritems = db.get(saveditemkeys)
(Note you don't even need the guard clause - a db.get on 0 entities is short-circuited appropritely.)
What's the difference, you may ask? Well, a db.get takes about 20-40ms. A query, on the other hand (GQL or not) takes about 160-200ms. But wait, it gets worse! The IN operator is implemented in Python, and translates to multiple queries, which are executed serially. So if you do a query with an IN filter for 10 keys, you're doing 10 separate 160ms-ish query operations, for a total of about 1.6 seconds latency. A single db.get, in contrast, will have the same effect and take a total of about 30ms.
+1 to Adam for getting me on the right track. Based on his pointer, and doing some searching at Code Search, I have the following solution.
usersaveditems = User.Gql(“Select * from UserListOfSavedItems where user =:1”, userkey)
saveditemkeys = []
for item in usersaveditems:
#this should create a list of keys (references) to the saved item table
saveditemkeys.append(item.str())
if len(usersavedsearches > 0):
#and this should get me the items that a user saved
useritems = db.Gql(“SELECT * FROM UniqueListOfSavedItems WHERE __key__ in :1’, saveditemkeys)