I need to read and join a lot of rows (~500k) from a PostgreSQL database and write them into a MySQL database.
My naive approach looks like this
entrys = Entry.query.yield_per(500)
for entry in entrys:
for location in entry.locations:
mysql_location = MySQLLocation(entry.url)
mysql_location.id = location.id
mysql_location.entry_id = entry.id
[...]
mysql_location.city = location.city.name
mysql_location.county = location.county.name
mysql_location.state = location.state.name
mysql_location.country = location.country.name
db.session.add(mysql_location)
db.session.commit()
Every Entry has about 1 to 100 Locations.
This script is running now for about 20 hours and already consumes > 4GB of memory since everything in kept in memory till the session is committed.
With my try of committing earlier, I'm running into problems like this.
How do I improve the query performance? It needs to get a lot faster, since the amount of rows will grow to about 2500k over the next months.
Your naive approach is flawed for the very reason that you already know - the stuff eating your memory are the model objects dangling in the memory waiting to be flushed to the mysql.
The easiest way would be to not use the ORM for conversion ops at all. Use the SQLAlchemy table objects directly, as they're also much faster.
Also, what you can do is create 2 sessions, and bind the 2 engines into separate sessions! Then you can commit the mysql session for each batch.
Related
I'm working on a Python/Django webapp...and I'm really just digging into Python for the first time (typically I'm a .NET/core stack dev)...currently running PostgreSQL on the backend. I have about 9-10 simple (2 dimensional) lookup tables that will be hit very, very often in real time, that I would like to cache them in memory.
Ideally, I'd like to do this with Postgres itself, but it may be that another data engine and/or some other library will be suited to help with this (I'm not super familiar with Python libraries).
Goals would be:
Lookups are handled in memory (data footprint will never be "large").
Ideally, results could be cached after the first pull (by complete parameter signature) to optimize time, although this is somewhat optional as I'm assuming in-memory lookup would be pretty quick anyway....and
Also optional, but ideally, even though the lookup tables are stored separately in the db for importing/human-readability/editing purposes, I would think generating a x-dimensional array for the lookup when loaded into memory would be optimal. Though there are about 9-10 lookup tables total, there only maybe 10-15 values per table (some smaller) and probably a total of only maybe 15 parameters total for the complete lookup against all tables. Basically it's 9-10 tables of a modifier for an equation....so given certain values we lookup x/y values in each table, get the value, and add them together.
So I guess I'm looking for a library and/or suitable backend that handles the in-memory loading and caching (again, total size of this footprint in RAM will never be a factor)... and possibly can automatically resolve the x lookup tables into a single in-memory x-dimensional table for efficiency (rather than making 9-10 look-ups seperately)....and caching these results for repeated use when all parameters match a previous query (unless the lookup performs so quickly this is irrelevant).
Lookup tables are not huge...I would say, if I were to write code to break down each x/y value/range and create one giant x-dimensional lookup table by hand, it would probably endup with maybe 15 fields and 150 rows-ish...so we aren't talking very much data....but it will be hit very, very often and I don't want to perform these lookups everytime against the actual DB.
Recommendations for an engine/library suited best for this (with a preference for still being able to use postgresql for the persistent storage) are greatly appreciated.
You don't have to do anything special to achieve that: if you use the tables frequently, PostgreSQL will make them stay in cache automatically.
If you need to haye the tables in cache right from the start, use pg_prewarm. It allows you to explicitly load certain tables into cache and can automatically restore the state of the cache as it was before the last shutdown.
Once the tables are cached, they will only cause I/O when you write to them.
The efficient in-memory data structures you envision sound like a premature micro-optimization to me. I'd bet that these small lookup tables won't cause a performance problem (if you have indexes on all foreign keys).
I'm looking for ideas on how to improve a report that takes up to 30 minutes to process on the server, I'm currently working with Django and MySQL but if there is a solution that requires changing the language or SQL database I'm open to it.
The report I'm talking about reads multiple excel files and insert all the rows from those files into a table (the report table) with a range from 12K to 15K records, the table has around 50 columns. This part doesn't take that much time.
Once I have all the records on the report table I start applying multiple phases of business logic so I end having something like this:
def create_report():
business_logic_1()
business_logic_2()
business_logic_3()
business_logic_4()
Each function of the business_logic_X does something very similar, it starts by doing a ReportModel.objects.all() and then it applies multiple calculations like checking dates, quantities, etc and updates the record. Since it's a 12K record table it quickly starts adding time to the complete report.
The reason I'm going multiple functions separately and no all processing in one pass it's because the logic from the first function needs to be completed so the logic on the next functions works (ex. the first function finds all related records and applies the same status for all of them.
The first thing that I know could be optimized is somehow caching the objects.all() instead of calling it in each function but I'm not sure how to pass it to the next function without saving the records first.
I already optimized the report a bit by using update_fields on the save method of the functions and that saved a bit of time.
My question is, is there a better approach to this kind of problem? Is Django/MySQL the right stack for this?
What takes time is the business logic that you're doing in Django. So it does several round trips between the database and the application.
It sounds like there are several tables involved, so I would suggest that you write your query in raw sql and once you have the results you get that into the application, if you need it.
The orm has a method "raw" that you can use. Or you could drop down to even lower level and interface with Your database directly.
Unless I see more what you do, I can't give any more specific advice
I have a ~10M record MySQL table that I interface with using SqlAlchemy. I have found that queries on large subsets of this table will consume too much memory even though I thought I was using a built-in generator that intelligently fetched bite-sized chunks of the dataset:
for thing in session.query(Things):
analyze(thing)
To avoid this, I find I have to build my own iterator that bites off in chunks:
lastThingID = None
while True:
things = query.filter(Thing.id < lastThingID).limit(querySize).all()
if not rows or len(rows) == 0:
break
for thing in things:
lastThingID = row.id
analyze(thing)
Is this normal or is there something I'm missing regarding SA built-in generators?
The answer to this question seems to indicate that the memory consumption is not to be expected.
Most DBAPI implementations fully buffer rows as they are fetched - so usually, before the SQLAlchemy ORM even gets a hold of one result, the whole result set is in memory.
But then, the way Query works is that it fully loads the given result set by default before returning to you your objects. The rationale here regards queries that are more than simple SELECT statements. For example, in joins to other tables that may return the same object identity multiple times in one result set (common with eager loading), the full set of rows needs to be in memory so that the correct results can be returned otherwise collections and such might be only partially populated.
So Query offers an option to change this behavior through yield_per(). This call will cause the Query to yield rows in batches, where you give it the batch size. As the docs state, this is only appropriate if you aren't doing any kind of eager loading of collections so it's basically if you really know what you're doing. Also, if the underlying DBAPI pre-buffers rows, there will still be that memory overhead so the approach only scales slightly better than not using it.
I hardly ever use yield_per(); instead, I use a better version of the LIMIT approach you suggest above using window functions. LIMIT and OFFSET have a huge problem that very large OFFSET values cause the query to get slower and slower, as an OFFSET of N causes it to page through N rows - it's like doing the same query fifty times instead of one, each time reading a larger and larger number of rows. With a window-function approach, I pre-fetch a set of "window" values that refer to chunks of the table I want to select. I then emit individual SELECT statements that each pull from one of those windows at a time.
The window function approach is on the wiki and I use it with great success.
Also note: not all databases support window functions; you need Postgresql, Oracle, or SQL Server. IMHO using at least Postgresql is definitely worth it - if you're using a relational database, you might as well use the best.
I am not a database expert, but when using SQLAlchemy as a simple Python abstraction layer (ie, not using the ORM Query object) I've came up with a satisfying solution to query a 300M-row table without exploding memory usage...
Here is a dummy example:
from sqlalchemy import create_engine, select
conn = create_engine("DB URL...").connect()
q = select([huge_table])
proxy = conn.execution_options(stream_results=True).execute(q)
Then, I use the SQLAlchemy fetchmany() method to iterate over the results in a infinite while loop:
while 'batch not empty': # equivalent of 'while True', but clearer
batch = proxy.fetchmany(100000) # 100,000 rows at a time
if not batch:
break
for row in batch:
# Do your stuff here...
proxy.close()
This method allowed me to do all kind of data aggregation without any dangerous memory overhead.
NOTE the stream_results works with Postgres and the pyscopg2 adapter, but I guess it won't work with any DBAPI, nor with any database driver...
There is an interesting usecase in this blog post that inspired my above method.
I've been looking into efficient traversal/paging with SQLAlchemy and would like to update this answer.
I think you can use the slice call to properly limit the scope of a query and you could efficiently reuse it.
Example:
window_size = 10 # or whatever limit you like
window_idx = 0
while True:
start,stop = window_size*window_idx, window_size*(window_idx+1)
things = query.slice(start, stop).all()
if things is None:
break
for thing in things:
analyze(thing)
if len(things) < window_size:
break
window_idx += 1
In the spirit of Joel's answer, I use the following:
WINDOW_SIZE = 1000
def qgen(query):
start = 0
while True:
stop = start + WINDOW_SIZE
things = query.slice(start, stop).all()
if len(things) == 0:
break
for thing in things:
yield thing
start += WINDOW_SIZE
Using LIMIT/OFFSET is bad, because you need to find all {OFFSET} columns before, so the larger is OFFSET - the longer request you get.
Using windowed query for me also gives bad results on large table with large amount of data (you wait first results for too long, that it's not good in my case for chunked web response).
Best approach given here https://stackoverflow.com/a/27169302/450103. In my case I resolved problem simply using index on datetime field and fetching next query with datetime>=previous_datetime. Stupid, because I used that index in different cases before, but thought that for fetching all data windowed query would be better. In my case I was wrong.
AFAIK, the first variant still gets all the tuples from the table (with one SQL query) but builds the ORM presentation for each entity when iterating. So it is more efficient than building a list of all entities before iterating but you still have to fetch all the (raw) data into memory.
Thus, using LIMIT on huge tables sounds like a good idea to me.
If you're working with Postgres or an RDBMS that supports cursors, it is very simple to iterate efficiently over a lot of rows:
with db_session() as session:
for partition in session.stream(select(Thing.id)).partitions(1000):
for item in partition:
analyze(item)
This creates a forward cursor that fetches the result in batches of 1000 rows, which results in minimal memory usage on the server and on the client.
Suppose that I have a huge SQLite file (say, 500[MB]). Can 10 different python instances access this file at the same time and update different records of it?. Note, the emphasis here is on different records.
For example, suppose that the SQLite file has say 1M rows:
instance 1 will deal with (and update) rows 0 - 100000
instance 2 will will deal with (and update) rows 100001 - 200000
.........................
instance 10 will deal with (and update) rows 900001 - 1000000
Meaning, each python instance will only be updating a unique subset of the file. Will this work, or will I have serious integrity issues?
Updated, thanks to André Caron.
You can do that, but only read operations supports concurrency in SQLite, since entire database is locked on any write operation. SQLite engine will return SQLITE_BUSY status in this situation (if it exceeds default timeout for access). Also consider that this heavily depends on how good file locking is implemented for given OS and file system. In general I wouldn't recommend to use proposed solution, especially considering that DB file is quite large, but you can try.
It will be better to use server process based database (MySQL, PostgreSQL, etc.) to implement desired app behaviour.
Somewhat. Only one instance can write at any single time, which means concurrent writes will block (refer to the SQLite FAQ). You won't get integrity issues, but I'm not sure you'll really benefit from concurrency, although that depends on how much you write versus how much you read.
I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
Take a look at http://pgbulkload.projects.postgresql.org/
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
\d tablename
ALTER TABLE tablename DROP CONSTRAINT constraint_name;
When you're done with loading, you can put it back using something like:
ALTER TABLE tablename ADD CONSTRAINT constraint_name FOREIGN KEY (other_table) REFERENCES other_table (join_column);
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?