SQLAlchemy distinct ignoring order_by - python

I'm trying to run a query from my flask application using SQLAlchemy where I order the results by the timestamp in descending order. From the returned order I want to pull the distinct sender_id's. Unfortunately distinct ignores my requested order_by and pulls the data from the standard table layout.
messagesReceived = Message.query.filter_by(recipient_id=user_id).order_by(Message.timestamp.desc()).group_by(Message.sender_id).distinct()
I'm still new to SQLAlchemy and through the tutorials and lessons I have done I haven't encountered this, I tried googling but I don't think I am phrasing it correctly to get the answer. I'm currently trying to wrap my head around sub-queries as I think that might be a way to make it work, but asking here in the meantime.

Your SQL query is illogical. You select the entire message, but group by person_id. Unless it is unique, it is indeterminate which row the values are selected from for the group row. ORDER BY is logically performed after GROUP BY, and since timestamp is now indeterminate, so is the resulting order. Some SQL DBMS do not even allow such a query to run, as it is not allowed by the SQL standard.
To fetch distinct sender_ids ordered by their latest timestamp per sender_id do
messagesReceived = db.session.query(Message.sender_id).\
filter_by(recipient_id=user_id).\
group_by(Message.sender_id).\
order_by(db.func.max(Message.timestamp))

If you're using PostgreSQL,
messagesReceived = Message.query.filter_by(recipient_id=user_id).order_by(Message.sender_id.asc(), Message.timestamp.desc()).distinct(Message.sender_id).all()
There may not be a need to group by sender_id. Distinct can also be applied at the Query level (affects the entire query, not just the column), as described here, so the order_by needs the sender_id, post which, the distinct can be applied on a specific column.
This may be specific to PostgreSQL, however, so if you're using another DB, I would recommend the distinct expression as outlined here.
from sqlalchemy import distinct, func
stmt = select([func.count(distinct(users_table.c.name))])

Related

Does Django provide any built-in way to update PostgreSQL autoincrement counters?

I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)
About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.
One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.

Different WHERE clauses depending on table schema

I'm trying to make a mysql dump for a database using the "mysqldump" command line tool and I'm looking to do something kind of unique. Essentially, if the table contains a certain column, dump according to those constraints, otherwise, dump the whole table. But due to the nature of the mysqldump command line options, this all has to be in one WHERE clause.
In psuedo code this is what I have to do:
if table.columns contains "user_id":
where = "user_id=1" // dump rows where user_id=1
else:
where = "TRUE" // dump whole table
But in all one where clause that will apply to every table mysqldump comes across
The --where option for mysqldump can be used to pass an expression to the SELECT for each table, but it passes the same expression to all tables.
If you need to use different WHERE clauses for different subsets of tables, you'll need to do more than one mysqldump command, and execute each command against a different set of tables
Re your comment:
No, the references to columns in an SQL query must be fixed at the time the query is parsed, and that means any column references must refer to columns that exist in the table. There's no way to make an expression in SQL that uses columns if they do exist and ignores the column reference if the column doesn't exist.
To do what you're describing, you would have to query the INFORMATION_SCHEMA to see what columns exist in the table, and then format the SQL query for the dump conditionally, based on the columns you find do exist in the table.
But in mysqldump, you don't have the opportunity to do that. It would have to be implemented in the code for mysqldump.
Feel free to get the source code for mysqldump and add your own logic to it. Seems like a lot of work though.

flask-sqlalchemy delete query failing with "Could not evaluate current criteria in Python"

I have a query using flask-sqlalchemy in which I want to delete all the stocks from the database where there ticker matches one in a list. This is the current query I have:
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete()
Where new_tickers is a list of str of valid tickers.
The error I am getting is the following:
sqlalchemy.exc.InvalidRequestError: Could not evaluate current criteria in Python: "Cannot evaluate clauselist with operator <function comma_op at 0x1104e4730>". Specify 'fetch' or False for the synchronize_session parameter.
You need to use one of options for bulk delete
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session=False)
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session='evaluate')
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session='fetch')
Basically, SQLAlchemy maintains the session in Python as you issue various SQLAlchemy methods. When you delete entries, how will SQLAlchemy remove any removed rows from the session? This is controlled by a parameter to the delete method, "synchronize_session". synchronize_session has three possible:
'evaluate': it evaluates the produced query directly in Python to determine the objects that need to be removed from the session. This is the default and is very efficient, but is not very robust and complicated queries cannot be be evaluated. If it can't evaluate the query, it raises the sqlalchemy.orm.evaluator.UnevaluatableError condition
'fetch': this performs a select query before the delete and uses that result to determine which objects in the session need to be removed. This is less efficient (potential much less efficient) but will be able to handle any valid query
False: this doesn't attempt to update the session, so it's very efficient, however if you continue to use the session after the delete you may get inaccurate results.
Which option you use is very dependent on how your code uses the session. In most simple queries where you just need to delete rows based on a complicated query, False should work fine. (the example in the question fits this scenario)
SQLAlchemy Delete Method Reference
Try it with this code:
Stock.query.filter(Stock.ticker.in_(new_tickers)).delete(synchronize_session=False)
https://docs.sqlalchemy.org/en/latest/orm/query.html?highlight=delete#sqlalchemy.orm.query.Query.delete

Get last few records put in table in order of ID with SQLAlchemy

I have a table where I would like to get the last 3 records from in order of when they were added to the database. I have the following:
session.query(User).order_by(User.id.desc()).limit(3)
This gets the last 3 records from the database however they are in the backwards order has I descend the ID. I would like to have the records in order however using multiple order queries doesn't seem to work.
Is there a workaround or solution? Thanks.
I don't see any way to do that other than using a subquery to perform the second order by you want, to generate SQL as suggested by this answer: Get another order after limit with mysql
I believe you'll have to do something like this in SQLAlchemy:
session.query(session.query(User).order_by(User.id.desc()).limit(3)\
.subquery().alias('sUser')).order_by('sUser.id')
This is untested code and I'm pretty sure you'll end up with a series of KeyedTuple objects, not instances of your User class, although there might be a way to fix that.
However, as you're doing this with the ORM, I don't see much of a point in doing that with an SQLAlchemy subquery at the SQL level if you can do it with Python. Something like:
reversed(session.query(User).order_by(User.id.desc()).limit(3).all())
Or even this, if you want a list and not a reverse iterator:
session.query(User).order_by(User.id.desc()).limit(3).all()[::-1]
The truth is I have a similar problem but when I saw how you wrote the command I realized what I needed to do and it worked,
then:
First thing thanks
Second thing I think you need to implement the following command:
desc3user = User.query.order_by(User.id.desc()).limit(3)

getting the id of a created record in SQLAlchemy

How can I get the id of the created record in SQLAlchemy?
I'm doing:
engine.execute("insert into users values (1,'john')")
When you execute a plain text statement, you're at the mercy of the DBAPI you're using as to whether or not the new PK value is available and via what means. With SQlite and MySQL DBAPIs you'll have it as result.lastrowid, which just gives you the value of .lastrowid for the cursor. With PG, Oracle, etc., there's no ".lastrowid" - as someone else said you can use "RETURNING" for those in which case results are available via result.fetchone() (although using RETURNING with oracle, again not taking advantage of SQLAlchemy expression constructs, requires several awkward steps), or if RETURNING isn't available you can use direct sequence access (NEXTVAL in pg), or a "post fetch" operation (CURRVAL in PG, ##identity or scope_identity() in MSSQL).
Sounds complicated right ? That's why you're better off using table.insert(). SQLAlchemy's primary system of providing newly generated PKs is designed to work with these constructs. One you're there, the result.last_inserted_ids() method gives you the newly generated (possibly composite) PK in all cases, regardless of backend. The above methods of .lastrowid, sequence execution, RETURNING etc. are all dealt with for you (0.6 uses RETURNING when available).
There's an extra clause you can add: RETURNING
ie
INSERT INTO users (name, address) VALUES ('richo', 'beaconsfield') RETURNING id
Then just retrieve a row like your insert was a SELECT statement.

Categories