I have a mysql (actually MariaDB 5.5.52) database described roughly as follows:
CREATE TABLE table1 (
id INT NOT NULL AUTOINCREMENT,
col1 INT,
col2 VARCHAR(32),
col3 VARCAHR(128),
PRIMARY KEY (ID),
UNIQUE KEY index1 (col1, col2, col3)
);
There are more columns, but all are inside the UNIQUE key, and there are no other keys in the table.
I run a multiple threads of a python script that inserts into this database. Each thread does around 100-1000 inserts using mysql.connector's executemany such as
ins_string = "INSERT IGNORE INTO table1 ({0}) VALUES ({1});"
cursor.executemany(ins_string.format(fields, string_symbols), values)
I run into consistent deadlock problems. I presume that these problems are caused because each thread locks between rows of table1 in some semi-random order based on the order in which my python list values is generated. This is somewhat validated by testing; when I build a new database from scratch using 24 threads, the deadlock rate per executemany() statement is > 80%, but by the time there are a million+ rows in the database the deadlock rate is near zero.
I had considered the possibility that the deadlock is a result of threads competing for AUTOINCREMENT, but in the default InnoDB 'consecutive' lock mode, it doesn't seem like this should happen. Each thread is supposed to get a table level lock until the end of the INSERT. However, the way the AUTOINCREMENT and INSERT locks interact is confusing to me, so if I have this wrong, please let me know.
So if the problem is caused by the random-ish ordering of the unique key, I need some way of ordering the insert statements in python before passing them to MySql. The index is hashed in some way by MySql, and then ordered. How I can replicate the hashing/ordering in python?
I'm asking about a solution to my diagnosis of the problem here, but if you see that my diagnosis is wrong, again, please let me know.
Why have ID, since you have a UNIQUE key that could be promoted to PRIMARY?
Regardless, sort the bulk insert rows on (col1, col2, col3) before building the executemany.
If that is not sufficient, then decrease the number of rows in each executemany. 100 rows gets within about 10% of the theoretical best. If 100 decreases the frequency of deadlocks below, say 10%, then you are probably very close to the optimal balance between speed of bulk loading and slowdown due to replaying deadlocks.
How many CPU cores do you have?
Are there other indexes you are not showing us? Every UNIQUE index factors into this problem. Non-unique indexes are not a problem. Please provide the full SHOW CREATE TABLE.
Related
I'm using postgres and I have multiple schemas with identical tables where they are dynamically added the application code.
foo, bar, baz, abc, xyz, ...,
I want to be able to query all the schemas as if they are a single table
!!! I don't want to query all the schemas one by one and combine the results
I want to "combine"(not sure if this would be considered a huge join) the tables across schemas and then run the query.
For example, an order by query shouldn't be like
1. schema_A.result_1
2. schema_A.result_3
3. schema_B.result_2
4. schema_B.result 4
but instead it should be
1. schema_A.result_1
2. schema_B.result_2
3. schema_A.result_3
4. schema_B.result 4
If possible I don't want to generate a query that goes like
SELECT schema_A.table_X.field_1, schema_B.table_X.field_1 FROM schema_A.table_X, schema_B.table_X
But I want that to be taken care of in postgresql, in the database.
Generating a query with all the schemas(namespaces) appended can make my queries HUGE with ~50 field and ~50 schemas.
Since these tables are generated I also cannot inherit them from some global table and query that instead.
I'd also like to know if this is not really possible in a reasonable speed.
EXTRA:
I'm using django and django-tenants so I'd also accept any answer that actually helps me generate the entire query and run it to get a global queryset EVEN THOUGH it would be really slow.
Your question isn't as much of a question as it is an admission that you've got a really terrible database and applicaiton design. It's as if you parittioned something that iddn't need to be parittioned, or partitioned it in the wrong way.
Since you're doing something awkward, the database itself won't provide you with any elegant solution. Instead, you'll have to get more and more awkward until the regret becomes too much to bear and you redesign your database and/or your application.
I urge you to repent now, the sooner the better.
After that giant caveat based on a haughty moral position, I acknolwedge that the only reason we answer questions here is to get imaginary internet points. And so, my answer is this: use a view that unions all of the values together and presents them as if they came from one table. I can't make any sense of the "order by query", so I just ignore it for now. Maybe you mean that you want the results in a certain order; if so, you can add constants to each SELECT operand of each UNION ALL and ORDER BY that constant column coming out of the union. But if the order of the rows matters, I'd assert that you are showing yet another symptom of a poor database design.
You can programatically update the view whenever it is you update or create the new schemas and their catalogs.
A working example is here: http://sqlfiddle.com/#!17/c09265/1
with this schema creation and population code:
CREATE Schema Fooey;
CREATE SCHEMA Junk;
CREATE TABLE Fooey.Baz (SomeINteger INT);
CREATE TABLE Junk.Baz (SomeINteger INT);
INSERT INTO Fooey.Baz (SomeInteger) VALUES (17), (34), (51);
INSERT INTO Junk.Baz (SomeInteger) VALUES (13), (26), (39);
CREATE VIEW AllOfThem AS
SELECT 'FromFooey' AS SourceSchema, SomeINteger FROM Fooey.Baz
UNION ALL
SELECT 'FromJunk' AS SourceSchema, SomeInteger FROM Junk.Baz;
and this query:
SELECT *
FROM AllOfThem
ORDER BY SourceSchema;
Why are per-tenant schemas a bad design?
This design favors laziness over scalability. If you don't want to make changes to your application, you can simply slam connections to a particular shcema and keep working without any code changes. Adding more tennants means adding more schemas, which it sounds like you've automated. Adding many schemas will eventually make database management cumbersome (what if you have thousands or millions of tenants?) and even if you have only a few, the dynamic nature of the list and the problems in writing system-wide queries is an issue that you've already discovered.
Consider instead combining everything and adding the tenant ID as part of a key on each table. In that case, adding more tenants means adding more rows. Any summary queries trivially come from single tables, and all of the features and power of the database implementation and its query language are at your fingertips without any fuss whatsoever.
It's simply false that a database design can't be changed, even in an existing and busy system. It takes a lot of effort to do it, but it can be done and people do it all the time. That's why getting the database design right as early as possible is important.
The README of the django-tenants package you're using describes thier decision to trade-off towards laziness, and cites a whitpaper that outlines many of the shortcomings and alternatives of that method.
In Python, I'm using SQLite's executemany() function with INSERT INTO to insert stuff into a table. If I pass executemany() a list of things to add, can I rely on SQLite inserting those things from the list in order? The reason is because I'm using INTEGER PRIMARY KEY to autoincrement primary keys. For various reasons, I need to know the new auto-incremented primary key of a row around the time I add it to the table (before or after, but around that time), so it would be very convenient to simply be able to assume that the primary key will go up one for every consecutive element of the list I'm passing executemany(). I already have the highest existing primary key before I start adding stuff, so I can increment a variable to keep track of the primary key I expect executemany() to give each inserted row. Is this a sound idea, or does it presume too much?
(I guess the alternative is to use execute() one-by-one with sqlite3_last_insert_rowid(), but that's slower than using executemany() for many thousands of entries.)
Python's sqlite3 module executes the statement with the list values in the correct order.
Note: if the code already knows the to-be-generated ID value, then you should insert this value explicitly so that you get an error if this expectation turns out to be wrong.
I am in the middle of a project involving trying to grab numerous pieces of information out of 70GB worth of xml documents and loading it into a relational database (in this case postgres) I am currently using python scripts and psycopg2 to do this inserts and whatnot. I have found that as the number of rows in the some of the tables increase. (The largest of which is at around 5 million rows) The speed of the script (inserts) has slowed to a crawl. What was once taking a couple of minutes now takes about an hour.
What can I do to speed this up? Was I wrong in using python and psycopg2 for this task? Is there anything I can do to the database that may speed up this process. I get the feeling I am going about this in entirely the wrong way.
Considering the process was fairly efficient before and only now when the dataset grew up it slowed down my guess is it's the indexes. You may try dropping indexes on the table before the import and recreating them after it's done. That should speed things up.
What are the settings for wal_buffers and checkpoint_segments? For large transactions, you have to tweak some settings. Check the manual.
Consider the book PostgreSQL 9.0 High Performance as well, there is much more to tweak than just the database configuration to get high performance.
I'd try to use COPY instead of inserts. This is what backup tools use for fast loading.
Check if all foreign keys from this table do have corresponding index on target table. Or better - drop them temporarily before copying and recreate after.
Increase checkpoint_segments from default 3 (which means3*16MB=48MB) to a much higher number - try for example 32 (512MB). make sure you have enough space for this much additional data.
If you can afford to recreate or restore your database cluster from scratch in case of system crash or power failure then you can start Postgres with "-F" option, which will enable OS write cache.
Take a look at http://pgbulkload.projects.postgresql.org/
There is a list of hints on this topic in the Populating a Database section of the documentation. You might speed up general performance using the hints in Tuning Your PostgreSQL Server as well.
The overhead of checking foreign keys might be growing as the table size increases, which is made worse because you're loading a single record at a time. If you're loading 70GB worth of data, it will be far faster to drop foreign keys during the load, then rebuild them when it's imported. This is particularly true if you're using single INSERT statements. Switching to COPY instead is not a guaranteed improvement either, due to how the pending trigger queue is managed--the issues there are discussed in that first documentation link.
From the psql prompt, you can find the name of the constraint enforcing your foreign key and then drop it using that name like this:
\d tablename
ALTER TABLE tablename DROP CONSTRAINT constraint_name;
When you're done with loading, you can put it back using something like:
ALTER TABLE tablename ADD CONSTRAINT constraint_name FOREIGN KEY (other_table) REFERENCES other_table (join_column);
One useful trick to find out the exact syntax to use for the restore is to do pg_dump --schema-only on your database. The dump from that will show you how to recreate the structure you have right now.
I'd look at the rollback logs. They've got to be getting pretty big if you're doing this in one transaction.
If that's the case, perhaps you can try committing a smaller transaction batch size. Chunk it into smaller blocks of records (1K, 10K, 100K, etc.) and see if that helps.
First 5 mil rows is nothing, difference in inserts should not change is it 100k or 1 mil;
1-2 indexes wont slow it down that much(if fill factor is set 70-90, considering each major import is 1/10 of table ).
python with PSYCOPG2 is quite fast.
a small tip, you cud use database extension XML2 to read/work with data
small example from
https://dba.stackexchange.com/questions/8172/sql-to-read-xml-from-file-into-postgresql-database
duffymo is right, try to commit in chunks of 10000 inserts (committing only at the end or after each insert is quite expensive)
autovacuum might be bloating if you do a lot of deletes and updates, you can turn it off temporary at the start for certain tables. set work_mem and maintenance_work_mem according to your servers available resources ...
for inserts, increase wal_buffers, (9.0 and higher its set auto by default -1) if u use version 8 postgresql, you should increase it manually
cud also turn fsync off and test wal_sync_method(be cautious changing this may make your database crash unsafe if sudden power-failures or hardware crash occurs)
try to drop foreign keys, disable triggers or set conditions for trigger not to run/skip execution;
use prepared statements for inserts, cast variables
you cud try to insert data into an unlogged table to temporary hold data
are inserts having where conditions or values from a sub-query, functions or such alike?
I am looking for a simple way to query an update or insert based on if the row exists in the first place. I am trying to use Python's MySQLdb right now.
This is how I execute my query:
self.cursor.execute("""UPDATE `inventory`
SET `quantity` = `quantity`+{1}
WHERE `item_number` = {0}
""".format(item_number,quantity));
I have seen four ways to accomplish this:
DUPLICATE KEY. Unfortunately the primary key is already taken up as a unique ID so I can't use this.
REPLACE. Same as above, I believe it relies on a primary key to work properly.
mysql_affected_rows(). Usually you can use this after updating the row to see if anything was affected. I don't believe MySQLdb in Python supports this feature.
Of course the last ditch effort: Make a SELECT query, fetchall, then update or insert based on the result. Basically I am just trying to keep the queries to a minimum, so 2 queries instead of 1 is less than ideal right now.
Basically I am wondering if I missed any other way to accomplish this before going with option 4. Thanks for your time.
Mysql DOES allow you to have unique indexes, and INSERT ... ON DUPLICATE UPDATE will do the update if any unique index has a duplicate, not just the PK.
However, I'd probably still go for the "two queries" approach. You are doing this in a transaction, right?
Do the update
Check the rows affected, if it's 0 then do the insert
OR
Attempt the insert
If it failed because of a unique index violation, do the update (NB: You'll want to check the error code to make sure it didn't fail for some OTHER reason)
The former is good if the row will usually exist already, but can cause a race (or deadlock) condition if you do it outside a transaction or have your isolation mode is not high enough.
Creating a unique index on item_number in your inventory table sounds like a good idea to me, because I imagine (without knowing the details of your schema) that one item should only have a single stock level (assuming your system doesn't allow multiple stock locations etc).
DB Table:
id int(6)
message char(5)
I have to add a record (message) to the DB table. In case of duplicate message(this message already exists with different id) I want to delete (or inactivate somehow) the both of the messages and get their ID's in reply.
Is it possible to perform with only one query? Any performance tips ?...
P.S.
I use PostgreSQL.
The main my problem I worried about, is a need to use locks when performing this with two or more queries...
Many thanks!
If you really want to worry about locking do this.
UPDATE table SET status='INACTIVE' WHERE id = 'key';
If this succeeds, there was a duplicate.
INSERT the additional inactive record. Do whatever else you want with your duplicates.
If this fails, there was no duplicate.
INSERT the new active record.
Commit.
This seizes an exclusive lock right away. The alternatives aren't quite as nice.
Start with an INSERT and check for duplicates doesn't seize a lock until you start updating. It's not clear if this is a problem or not.
Start with a SELECT would need to add a LOCK TABLE to assure that the select held the row found so it could be updated. If no row is found, the insert will work fine.
If you have multiple concurrent writers and two writers could attempt access at the same time, you may not be able to tolerate row-level locking.
Consider this.
Process A does a LOCK ROW and a SELECT but finds no row.
Process B does a LOCK ROW and a SELECT but finds no row.
Process A does an INSERT and a COMMIT.
Process B does an INSERT and a COMMIT. You now have duplicate active records.
Multiple concurrent insert/update transactions will only work with table-level locking. Yes, it's a potential slow-down. Three rules: (1) Keep your transactions as short as possible, (2) release the locks as quickly as possible, (3) handle deadlocks by retrying.
You could write a procedure with both of those commands in it, but it may make more sense to use an insert trigger to check for duplicates (or a nightly job, if it's not time-sensitive).
It is a little difficult to understand your exact requirement. Let me rephrase it two ways:
You want both the entries with same messages in the table (with different IDs), and want to know the IDs for some further processing (marking them as inactive, etc.). For this, You could write a procedure with the separate queries. I don't think you can achieve this with one query.
You do not want either of the entries in the table (i got this from 'i want to delete'). For this, you only have to check if the message already exists and then delete the row if it does, else insert it. I don't think this too can be achieved with one query.
If performance is a constraint during insert, you could insert without any checks and then periodically, sanitize the database.