I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:
for thing in session.query(Things):
analyze(thing)
To avoid this, I find I have to build my own iterator that bites off in chunks:
lastThingID = None
while True:
things = query.filter(Thing.id < lastThingID).limit(querySize).all()
if not rows or len(rows) == 0:
break
for thing in things:
lastThingID = row.id
analyze(thing)
Is this normal or is there something I'm missing regarding SA built-in generators?
The answer to this question seems to indicate that the memory consumption is not to be expected.
Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.
But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.
So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.
I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.
The window function approach is on the wiki and I use it with great success.
Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.
I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...
Here is a dummy example:
from sqlalchemy import create_engine, select
conn = create_engine("DB URL...").connect()
q = select([huge_table])
proxy = conn.execution_options(stream_results=True).execute(q)
Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:
while 'batch not empty': # equivalent of 'while True', but clearer
batch = proxy.fetchmany(100000) # 100,000 rows at a time
if not batch:
break
for row in batch:
# Do your stuff here...
proxy.close()
This method allowed me to do all kind of data aggregation without any dangerous memory overhead.
NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...
There is an interesting usecase in this blog post that inspired my above method.
I've been looking into efficient traversal/paging with SQLAlchemy and would like to update this answer.
I think you can use the slice call to properly limit the scope of a query and you could efficiently reuse it.
Example:
window_size = 10 # or whatever limit you like
window_idx = 0
while True:
start,stop = window_size*window_idx, window_size*(window_idx+1)
things = query.slice(start, stop).all()
if things is None:
break
for thing in things:
analyze(thing)
if len(things) < window_size:
break
window_idx += 1
In the spirit of Joel's answer, I use the following:
WINDOW_SIZE = 1000
def qgen(query):
start = 0
while True:
stop = start + WINDOW_SIZE
things = query.slice(start, stop).all()
if len(things) == 0:
break
for thing in things:
yield thing
start += WINDOW_SIZE
Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get.
Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).
Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.
AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.
Thus, using LIMIT on huge tables sounds like a good idea to me.
If you're working with Postgres or an RDBMS that supports cursors, it is very simple to iterate efficiently over a lot of rows:
with db_session() as session:
for partition in session.stream(select(Thing.id)).partitions(1000):
for item in partition:
analyze(item)
This creates a forward cursor that fetches the result in batches of 1000 rows, which results in minimal memory usage on the server and on the client.
Related
I'm having a problem with the performance of my SQLAlchemy application, and I've noticed that queries are slow even when they are pretty much as simple as they could be:
data = [x[0] for x in db.query(MyTable.id).all()]
# takes ~30 ms after being warmed up, 4500 results
Using raw queries isn't better:
raw = [x[0] for x in db.execute("SELECT id FROM mytable").all()]
# takes ~30 ms after warmup, same results
But I've discovered that aggregating the query to JSON on the DB side, then parsing the JSON on the client is pretty fast:
raw_json = json.loads(db.execute("SELECT JSON_ARRAYAGG(id) FROM mytable").scalar_one())
# takes ~3ms, same results again
This is consistent with the timings I'm getting when using other clients to issue the original SQL statement, but it's very verbose to write my queries like this.
So, my question is:
~~Why is the second approach three times faster than the first one? They are generating the same query. The x3 overhead seems to scale with the amount of data.~~ EDIT: that was a mistake in my testing setup
Why is the third approach ten times faster than the second one? Is there a more efficient method of getting results out of a query other than db.execute(...).all()? (.scalars().all() does not seem to make a difference.)
I'm on SQLAlchemy 1.4 with pymysql 1.0.2 and Python 3.10 right now, but could probably switch.
It takes time to send 4500 rows to the client -- parse query, run Optimizer, fetch, serialize, send through the network, collect rows, and unserialize. Each step is done 4500 times.
Sending one row, even if it took 4500 rows to build that one row, avoids much of the overhead.
(For further discussion, please provide the generated SQL and SHOW CREATE TABLE. There are many other possible issues.)
I have a large database of elements each of which has unique key. Every so often (once a minute) I get a load more items which need to be added to the database but if they are duplicates of something already in the database they are discarded.
My question is - is it better to...:
Get Django to give me a list (or set) of all of the unique keys and then, before trying to add each new item, check if its key is in the list or,
have a try/except statement around the save call on the new item and reply on Django catching duplicates?
Cheers,
Jack
If you're using MySQL, you have the power of INSERT IGNORE at your finger tips and that would be the most performant solution. You can execute custom SQL queries using the cursor API directly. (https://docs.djangoproject.com/en/1.9/topics/db/sql/#executing-custom-sql-directly)
If you are using Postgres or some other data-store that does not support INSERT IGNORE then things are going to be a bit more complicated.
In the case of Postgres, you can use rules to essentially make your own version of INSERT IGNORE.
It would look something like this:
CREATE RULE "insert_ignore" AS ON INSERT TO "some_table"
WHERE EXISTS (SELECT 1 FROM some_table WHERE pk=NEW.pk) DO INSTEAD NOTHING;
Whatever you do, avoid the "selecting all rows and checking first approach" as the worst-case performance is O(n) in Python and essentially short-circuits any performance advantage afforded by your database since the check is being performed on the app machine (and also eventually memory-bound).
The try/except approach is marginally better than the "select all rows" approach but it still requires constant hand-off to the app server to deal with each conflict, albeit much quicker. Better to make the database do the work.
In my Python (Flask) code, I need to get the first element and the last one sorted by a given variable from a SQLAlchemy query.
I first wrote the following code :
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
# Do other things
As these queries can be heavy for the PostgreSQL database, and as I am duplicating my code, I think it will be better to use only one request, but I don't know SQLAlchemy enough to do it...
(When queries are effectively triggered, for example ?)
What is the best solution to this problem ?
1) How to get First and Last record from a sql query? this is about how to get first and last records in one query.
2) Here are docs on sqlalchemy query. Specifically pay attention to union_all (to implement answers from above).
It also has info on when queries are triggered (basically, queries are triggered when you use methods, that returns results, like first() or all(). That means, Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)) will not emit query to database).
Also, if memory is not a problem, I'd say get all() objects from your first query and just get first and last result via python:
results = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).all()
first_valuation = results[0]
last_valuation = results[-1]
It will be faster than performing two (even unified) queries, but will potentially eat a lot of memory, if your database is large enough.
No need to complicate the process so much.
first_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.desc(Valuation.date)).first()
# Do some things
last_valuation = Valuation.query.filter_by(..).order_by(sqlalchemy.asc(Valuation.date)).first()
This is what you've and it's good enough. It's not heavy for any database. If you think that it's becoming too heavy, then you can always use some index.
Don't try to get all the results using all() and retrieving from it in list style. When you do, all() it loads everything into the memory which is extremely and extremely bad if you have a lot of results. It's a lot better to execute just two queries to get those items.
I have web service that need to query 10000 numbers from RDBMS and then return them in form of json. There are two issues here.
How to concatenate them in efficient way? I believe making result += next id is not good idea. Stackoverflow advice is to use .join - it is elegant, but I am not sure how it handles memory allocation
As far as I understand .fetchAll() could be somewht expensive here, because it initiall
Is it a way in Python to fetch row by row, take from row only one number and add to result in some efficient way.
Sample is a bit artificial, for simplicity purpose.
"Probably Memory hog" short solution that I have in mind look approximately like this:
s = text("select id from users where ... ")
connection = engine.connect()
with connection :
rows = connection.execute(s).fetchall()
return "["+','.join(str(r[0]) for r in rows) + "]" # json array
I know all this is looks artificial and it is not a good idea to fetech 10000 records at once, but I want to understand best practices of Python memory managemet.
In Java world where I came from there is class StringBuilder and way of fetching row by row from DB.
For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.
Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.
You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.