Why is SQLAlchemy so slow when dealing with many results? - python

I'm having a problem with the performance of my SQLAlchemy application, and I've noticed that queries are slow even when they are pretty much as simple as they could be:
data = [x[0] for x in db.query(MyTable.id).all()]
# takes ~30 ms after being warmed up, 4500 results
Using raw queries isn't better:
raw = [x[0] for x in db.execute("SELECT id FROM mytable").all()]
# takes ~30 ms after warmup, same results
But I've discovered that aggregating the query to JSON on the DB side, then parsing the JSON on the client is pretty fast:
raw_json = json.loads(db.execute("SELECT JSON_ARRAYAGG(id) FROM mytable").scalar_one())
# takes ~3ms, same results again
This is consistent with the timings I'm getting when using other clients to issue the original SQL statement, but it's very verbose to write my queries like this.
So, my question is:
~~Why is the second approach three times faster than the first one? They are generating the same query. The x3 overhead seems to scale with the amount of data.~~ EDIT: that was a mistake in my testing setup
Why is the third approach ten times faster than the second one? Is there a more efficient method of getting results out of a query other than db.execute(...).all()? (.scalars().all() does not seem to make a difference.)
I'm on SQLAlchemy 1.4 with pymysql 1.0.2 and Python 3.10 right now, but could probably switch.

It takes time to send 4500 rows to the client -- parse query, run Optimizer, fetch, serialize, send through the network, collect rows, and unserialize. Each step is done 4500 times.
Sending one row, even if it took 4500 rows to build that one row, avoids much of the overhead.
(For further discussion, please provide the generated SQL and SHOW CREATE TABLE. There are many other possible issues.)

Related

Loading PostgreSQL table to in-memory cache for x-dimensional lookup in python/django

I'm working on a Python/Django webapp...and I'm really just digging into Python for the first time (typically I'm a .NET/core stack dev)...currently running PostgreSQL on the backend. I have about 9-10 simple (2 dimensional) lookup tables that will be hit very, very often in real time, that I would like to cache them in memory.
Ideally, I'd like to do this with Postgres itself, but it may be that another data engine and/or some other library will be suited to help with this (I'm not super familiar with Python libraries).
Goals would be:
Lookups are handled in memory (data footprint will never be "large").
Ideally, results could be cached after the first pull (by complete parameter signature) to optimize time, although this is somewhat optional as I'm assuming in-memory lookup would be pretty quick anyway....and
Also optional, but ideally, even though the lookup tables are stored separately in the db for importing/human-readability/editing purposes, I would think generating a x-dimensional array for the lookup when loaded into memory would be optimal. Though there are about 9-10 lookup tables total, there only maybe 10-15 values per table (some smaller) and probably a total of only maybe 15 parameters total for the complete lookup against all tables. Basically it's 9-10 tables of a modifier for an equation....so given certain values we lookup x/y values in each table, get the value, and add them together.
So I guess I'm looking for a library and/or suitable backend that handles the in-memory loading and caching (again, total size of this footprint in RAM will never be a factor)... and possibly can automatically resolve the x lookup tables into a single in-memory x-dimensional table for efficiency (rather than making 9-10 look-ups seperately)....and caching these results for repeated use when all parameters match a previous query (unless the lookup performs so quickly this is irrelevant).
Lookup tables are not huge...I would say, if I were to write code to break down each x/y value/range and create one giant x-dimensional lookup table by hand, it would probably endup with maybe 15 fields and 150 rows-ish...so we aren't talking very much data....but it will be hit very, very often and I don't want to perform these lookups everytime against the actual DB.
Recommendations for an engine/library suited best for this (with a preference for still being able to use postgresql for the persistent storage) are greatly appreciated.
You don't have to do anything special to achieve that: if you use the tables frequently, PostgreSQL will make them stay in cache automatically.
If you need to haye the tables in cache right from the start, use pg_prewarm. It allows you to explicitly load certain tables into cache and can automatically restore the state of the cache as it was before the last shutdown.
Once the tables are cached, they will only cause I/O when you write to them.
The efficient in-memory data structures you envision sound like a premature micro-optimization to me. I'd bet that these small lookup tables won't cause a performance problem (if you have indexes on all foreign keys).

Get first AND last element with SQLAlchemy

In my Python (Flask) code, I need to get the first element and the last one sorted by a given variable from a SQLAlchemy query.
I first wrote the following code :
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
# Do other things
As these queries can be heavy for the PostgreSQL database, and as I am duplicating my code, I think it will be better to use only one request, but I don't know SQLAlchemy enough to do it...
(When queries are effectively triggered, for example ?)
What is the best solution to this problem ?
1) How to get First and Last record from a sql query? this is about how to get first and last records in one query.
2) Here are docs on sqlalchemy query. Specifically pay attention to union_all (to implement answers from above).
It also has info on when queries are triggered (basically, queries are triggered when you use methods, that returns results, like first() or all(). That means, Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)) will not emit query to database).
Also, if memory is not a problem, I'd say get all() objects from your first query and just get first and last result via python:
results = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).all()
first_valuation = results[0]
last_valuation = results[-1]
It will be faster than performing two (even unified) queries, but will potentially eat a lot of memory, if your database is large enough.
No need to complicate the process so much.
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
This is what you've and it's good enough. It's not heavy for any database. If you think that it's becoming too heavy, then you can always use some index.
Don't try to get all the results using all() and retrieving from it in list style. When you do, all() it loads everything into the memory which is extremely and extremely bad if you have a lot of results. It's a lot better to execute just two queries to get those items.

Effective memory management while querying from RDBMS with SQLAlchemy

I have web service that need to query 10000 numbers from RDBMS and then return them in form of json. There are two issues here.
How to concatenate them in efficient way? I believe making result += next id is not good idea. Stackoverflow advice is to use .join - it is elegant, but I am not sure how it handles memory allocation
As far as I understand .fetchAll() could be somewht expensive here, because it initiall
Is it a way in Python to fetch row by row, take from row only one number and add to result in some efficient way.
Sample is a bit artificial, for simplicity purpose.
"Probably Memory hog" short solution that I have in mind look approximately like this:
s = text("select id from users where ... ")
connection = engine.connect()
with connection :
rows = connection.execute(s).fetchall()
return "["+','.join(str(r[0]) for r in rows) + "]" # json array
I know all this is looks artificial and it is not a good idea to fetech 10000 records at once, but I want to understand best practices of Python memory managemet.
In Java world where I came from there is class StringBuilder and way of fetching row by row from DB.

memory-efficient built-in SqlAlchemy iterator/generator?

I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:
for thing in session.query(Things):
analyze(thing)
To avoid this, I find I have to build my own iterator that bites off in chunks:
lastThingID = None
while True:
things = query.filter(Thing.id < lastThingID).limit(querySize).all()
if not rows or len(rows) == 0:
break
for thing in things:
lastThingID = row.id
analyze(thing)
Is this normal or is there something I'm missing regarding SA built-in generators?
The answer to this question seems to indicate that the memory consumption is not to be expected.
Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.
But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.
So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.
I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.
The window function approach is on the wiki and I use it with great success.
Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.
I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...
Here is a dummy example:
from sqlalchemy import create_engine, select
conn = create_engine("DB URL...").connect()
q = select([huge_table])
proxy = conn.execution_options(stream_results=True).execute(q)
Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:
while 'batch not empty': # equivalent of 'while True', but clearer
batch = proxy.fetchmany(100000) # 100,000 rows at a time
if not batch:
break
for row in batch:
# Do your stuff here...
proxy.close()
This method allowed me to do all kind of data aggregation without any dangerous memory overhead.
NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...
There is an interesting usecase in this blog post that inspired my above method.
I've been looking into efficient traversal/paging with SQLAlchemy and would like to update this answer.
I think you can use the slice call to properly limit the scope of a query and you could efficiently reuse it.
Example:
window_size = 10 # or whatever limit you like
window_idx = 0
while True:
start,stop = window_size*window_idx, window_size*(window_idx+1)
things = query.slice(start, stop).all()
if things is None:
break
for thing in things:
analyze(thing)
if len(things) < window_size:
break
window_idx += 1
In the spirit of Joel's answer, I use the following:
WINDOW_SIZE = 1000
def qgen(query):
start = 0
while True:
stop = start + WINDOW_SIZE
things = query.slice(start, stop).all()
if len(things) == 0:
break
for thing in things:
yield thing
start += WINDOW_SIZE
Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get.
Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).
Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.
AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.
Thus, using LIMIT on huge tables sounds like a good idea to me.
If you're working with Postgres or an RDBMS that supports cursors, it is very simple to iterate efficiently over a lot of rows:
with db_session() as session:
for partition in session.stream(select(Thing.id)).partitions(1000):
for item in partition:
analyze(item)
This creates a forward cursor that fetches the result in batches of 1000 rows, which results in minimal memory usage on the server and on the client.

Google Application Engine slow in case of Python

I am reading a "table" in Python in GAE that has 1000 rows and the program stops because the time limit is reached. (So it takes at least 20 seconds.)(
Is that possible that GAE is that slow? Is there a way to fix that?
Is this because I use free service and I do not pay for it?
Thank you.
The code itself is this:
liststocks=[]
userall=user.all() # this has three fields username... trying to optimise by this line
stocknamesall=stocknames.all() # 1 field, name of the stocks trying to optimise by this line too
for u in userall: # userall has 1000 users
for stockname in stocknamesall: # 4 stocks
astock= stocksowned() #it is also a "table", no relevance I think
astock.quantity = random.randint(1,100)
astock.nameid = u.key()
astock.stockid = stockname.key()
liststocks.append(astock);
GAE is slow when used inefficiently. Like any framework, sometimes you have to know a little bit about how it works in order to efficiently use it. Luckily, I think there is an easy improvement that will help your code a lot.
It is faster to use fetch() explicitly instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.
In this case, using fetch() will help a lot - you can easily get all your users and stocknames in one round-trip to the datastore each. Right now you're making lots of extra round-trips to the datastore and re-fetching stockname entities too!
Try this (you said your table has 1000 rows, so I use fetch(1000) to make sure you get all the results; use a larger number if needed):
userall=user.all().fetch(1000)
stocknamesall=stocknames.all().fetch(1000)
# rest of the code as-is
To see where you could make additional improvements, please try out AppStats so you can see exactly why your request is taking so long. You might even consider posting a screenshot (like this) of the appstats info about your request along with your post.

Categories