I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
Take a look at http://pgbulkload.projects.postgresql.org/
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
\d tablename
ALTER TABLE tablename DROP CONSTRAINT constraint_name;
When you're done with loading, you can put it back using something like:
ALTER TABLE tablename ADD CONSTRAINT constraint_name FOREIGN KEY (other_table) REFERENCES other_table (join_column);
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?
Related
I'm using postgres and I have multiple schemas with identical tables where they are dynamically added the application code.
foo, bar, baz, abc, xyz, ...,
I want to be able to query all the schemas as if they are a single table
!!! I don't want to query all the schemas one by one and combine the results
I want to "combine"(not sure if this would be considered a huge join) the tables across schemas and then run the query.
For example, an order by query shouldn't be like
1. schema_A.result_1
2. schema_A.result_3
3. schema_B.result_2
4. schema_B.result 4
but instead it should be
1. schema_A.result_1
2. schema_B.result_2
3. schema_A.result_3
4. schema_B.result 4
If possible I don't want to generate a query that goes like
SELECT schema_A.table_X.field_1, schema_B.table_X.field_1 FROM schema_A.table_X, schema_B.table_X
But I want that to be taken care of in postgresql, in the database.
Generating a query with all the schemas(namespaces) appended can make my queries HUGE with ~50 field and ~50 schemas.
Since these tables are generated I also cannot inherit them from some global table and query that instead.
I'd also like to know if this is not really possible in a reasonable speed.
EXTRA:
I'm using django and django-tenants so I'd also accept any answer that actually helps me generate the entire query and run it to get a global queryset EVEN THOUGH it would be really slow.
Your question isn't as much of a question as it is an admission that you've got a really terrible database and applicaiton design. It's as if you parittioned something that iddn't need to be parittioned, or partitioned it in the wrong way.
Since you're doing something awkward, the database itself won't provide you with any elegant solution. Instead, you'll have to get more and more awkward until the regret becomes too much to bear and you redesign your database and/or your application.
I urge you to repent now, the sooner the better.
After that giant caveat based on a haughty moral position, I acknolwedge that the only reason we answer questions here is to get imaginary internet points. And so, my answer is this: use a view that unions all of the values together and presents them as if they came from one table. I can't make any sense of the "order by query", so I just ignore it for now. Maybe you mean that you want the results in a certain order; if so, you can add constants to each SELECT operand of each UNION ALL and ORDER BY that constant column coming out of the union. But if the order of the rows matters, I'd assert that you are showing yet another symptom of a poor database design.
You can programatically update the view whenever it is you update or create the new schemas and their catalogs.
A working example is here: http://sqlfiddle.com/#!17/c09265/1
with this schema creation and population code:
CREATE Schema Fooey;
CREATE SCHEMA Junk;
CREATE TABLE Fooey.Baz (SomeINteger INT);
CREATE TABLE Junk.Baz (SomeINteger INT);
INSERT INTO Fooey.Baz (SomeInteger) VALUES (17), (34), (51);
INSERT INTO Junk.Baz (SomeInteger) VALUES (13), (26), (39);
CREATE VIEW AllOfThem AS
SELECT 'FromFooey' AS SourceSchema, SomeINteger FROM Fooey.Baz
UNION ALL
SELECT 'FromJunk' AS SourceSchema, SomeInteger FROM Junk.Baz;
and this query:
SELECT *
FROM AllOfThem
ORDER BY SourceSchema;
Why are per-tenant schemas a bad design?
This design favors laziness over scalability. If you don't want to make changes to your application, you can simply slam connections to a particular shcema and keep working without any code changes. Adding more tennants means adding more schemas, which it sounds like you've automated. Adding many schemas will eventually make database management cumbersome (what if you have thousands or millions of tenants?) and even if you have only a few, the dynamic nature of the list and the problems in writing system-wide queries is an issue that you've already discovered.
Consider instead combining everything and adding the tenant ID as part of a key on each table. In that case, adding more tenants means adding more rows. Any summary queries trivially come from single tables, and all of the features and power of the database implementation and its query language are at your fingertips without any fuss whatsoever.
It's simply false that a database design can't be changed, even in an existing and busy system. It takes a lot of effort to do it, but it can be done and people do it all the time. That's why getting the database design right as early as possible is important.
The README of the django-tenants package you're using describes thier decision to trade-off towards laziness, and cites a whitpaper that outlines many of the shortcomings and alternatives of that method.
I'm working on a Python/Django webapp...and I'm really just digging into Python for the first time (typically I'm a .NET/core stack dev)...currently running PostgreSQL on the backend. I have about 9-10 simple (2 dimensional) lookup tables that will be hit very, very often in real time, that I would like to cache them in memory.
Ideally, I'd like to do this with Postgres itself, but it may be that another data engine and/or some other library will be suited to help with this (I'm not super familiar with Python libraries).
Goals would be:
Lookups are handled in memory (data footprint will never be "large").
Ideally, results could be cached after the first pull (by complete parameter signature) to optimize time, although this is somewhat optional as I'm assuming in-memory lookup would be pretty quick anyway....and
Also optional, but ideally, even though the lookup tables are stored separately in the db for importing/human-readability/editing purposes, I would think generating a x-dimensional array for the lookup when loaded into memory would be optimal. Though there are about 9-10 lookup tables total, there only maybe 10-15 values per table (some smaller) and probably a total of only maybe 15 parameters total for the complete lookup against all tables. Basically it's 9-10 tables of a modifier for an equation....so given certain values we lookup x/y values in each table, get the value, and add them together.
So I guess I'm looking for a library and/or suitable backend that handles the in-memory loading and caching (again, total size of this footprint in RAM will never be a factor)... and possibly can automatically resolve the x lookup tables into a single in-memory x-dimensional table for efficiency (rather than making 9-10 look-ups seperately)....and caching these results for repeated use when all parameters match a previous query (unless the lookup performs so quickly this is irrelevant).
Lookup tables are not huge...I would say, if I were to write code to break down each x/y value/range and create one giant x-dimensional lookup table by hand, it would probably endup with maybe 15 fields and 150 rows-ish...so we aren't talking very much data....but it will be hit very, very often and I don't want to perform these lookups everytime against the actual DB.
Recommendations for an engine/library suited best for this (with a preference for still being able to use postgresql for the persistent storage) are greatly appreciated.
You don't have to do anything special to achieve that: if you use the tables frequently, PostgreSQL will make them stay in cache automatically.
If you need to haye the tables in cache right from the start, use pg_prewarm. It allows you to explicitly load certain tables into cache and can automatically restore the state of the cache as it was before the last shutdown.
Once the tables are cached, they will only cause I/O when you write to them.
The efficient in-memory data structures you envision sound like a premature micro-optimization to me. I'd bet that these small lookup tables won't cause a performance problem (if you have indexes on all foreign keys).
I was wondering if it's possible to create an SQLite database where I can select rows by rowid with O(1).
I started using sqlite database in one of my projects and discovered that selecting rows from bigger databased takes longer than selecting rows from smaller databases. I started searching online and stumbled upon this article. Apparently, when selecting by rowid, instead of going straight to the rowid, SQLite performs a binary search to get to the requested rowid. This is a very logical solution, because we can delete rows from the database and in this case, going straight to the rowid won't work.
But in my case - I have an "immutable" database, after creating the database I'm not changing it; Thus, all the rowid are present and in the correct order.
So I was wandering if it's possible to either create a special database or use a specific query command which tells SQLite to select by accessing the rowid without any binary search.
If there are other alternatives to SQLite that can perform better for my case please inform me about them (though, for in my project I can't load the db into memory and the access to different db's simultaneously should be instantaneous)
Thanks.
If you do not need the full power of SQLite, you could a simple hashing algorithm with the dbm module. It uses hashing and could perform better than an ISAM index. But you will lose ordering (among other features like SQL...)
I ended up using mmap. Because I had millions of lines of the same length I just saved those lines to a binary file with mmap. Then to access k line I simply asked mmap to read from k * (length_of_line) point.
I used the snippet code from the answer here to test the solution quickly, though I believe it can be optimized further than this simple code.
I have a large database of elements each of which has unique key. Every so often (once a minute) I get a load more items which need to be added to the database but if they are duplicates of something already in the database they are discarded.
My question is - is it better to...:
Get Django to give me a list (or set) of all of the unique keys and then, before trying to add each new item, check if its key is in the list or,
have a try/except statement around the save call on the new item and reply on Django catching duplicates?
Cheers,
Jack
If you're using MySQL, you have the power of INSERT IGNORE at your finger tips and that would be the most performant solution. You can execute custom SQL queries using the cursor API directly. (https://docs.djangoproject.com/en/1.9/topics/db/sql/#executing-custom-sql-directly)
If you are using Postgres or some other data-store that does not support INSERT IGNORE then things are going to be a bit more complicated.
In the case of Postgres, you can use rules to essentially make your own version of INSERT IGNORE.
It would look something like this:
CREATE RULE "insert_ignore" AS ON INSERT TO "some_table"
WHERE EXISTS (SELECT 1 FROM some_table WHERE pk=NEW.pk) DO INSTEAD NOTHING;
Whatever you do, avoid the "selecting all rows and checking first approach" as the worst-case performance is O(n) in Python and essentially short-circuits any performance advantage afforded by your database since the check is being performed on the app machine (and also eventually memory-bound).
The try/except approach is marginally better than the "select all rows" approach but it still requires constant hand-off to the app server to deal with each conflict, albeit much quicker. Better to make the database do the work.
I'm going to have two independent programs (using SqlAlchemy / ORM / Declarative)
that will inevitably try to access the same database-file/table(SQLite) at the same time.
They could both want to read or write to that table.
Will there be a conflict when this happens?
If the answer is yes, how could this be handled?
Sqlite is resistant to any issues as you describe. http://www.sqlite.org/howtocorrupt.html gives you details on what could cause problems, and they're generally isolated from anything the code might accidentally do.
If you're concerned due to the nature of your application data access, use BEGIN TRANSACTION and COMMIT/ROLLBACK as appropriate. If your transactions are single query access (that is, you're not reading a value in one query and then changing it in another relative to what you already read), this should not be necessary.