In Python, I'm using SQLite's executemany() function with INSERT INTO to insert stuff into a table. If I pass executemany() a list of things to add, can I rely on SQLite inserting those things from the list in order? The reason is because I'm using INTEGER PRIMARY KEY to autoincrement primary keys. For various reasons, I need to know the new auto-incremented primary key of a row around the time I add it to the table (before or after, but around that time), so it would be very convenient to simply be able to assume that the primary key will go up one for every consecutive element of the list I'm passing executemany(). I already have the highest existing primary key before I start adding stuff, so I can increment a variable to keep track of the primary key I expect executemany() to give each inserted row. Is this a sound idea, or does it presume too much?
(I guess the alternative is to use execute() one-by-one with sqlite3_last_insert_rowid(), but that's slower than using executemany() for many thousands of entries.)
Python's sqlite3 module executes the statement with the list values in the correct order.
Note: if the code already knows the to-be-generated ID value, then you should insert this value explicitly so that you get an error if this expectation turns out to be wrong.
Related
I have a huge list of dicts (or objects), that have exactly the same fields. I need to perform a search within this collection to retrieve a single object based on the given criteria (there could be many matches, but I take only the first one).
The search criteria don't care about all the fields, and the fields are often different, so changing the list into a dictionary with hashed values is impossible.
This looks like a job for a database, so I'm thinking about using an in-memory sqlite database for it. The problem is that I need to write some kind of wrapper that will translate SQL queries into Python API, which makes me think that perhaps there's already a solution for that somewhere.
Maybe someone already had a similar problem, and there's already a tool that will help me with that? Or sqlite is the only way?
I have a mysql (actually MariaDB 5.5.52) database described roughly as follows:
CREATE TABLE table1 (
id INT NOT NULL AUTOINCREMENT,
col1 INT,
col2 VARCHAR(32),
col3 VARCAHR(128),
PRIMARY KEY (ID),
UNIQUE KEY index1 (col1, col2, col3)
);
There are more columns, but all are inside the UNIQUE key, and there are no other keys in the table.
I run a multiple threads of a python script that inserts into this database. Each thread does around 100-1000 inserts using mysql.connector's executemany such as
ins_string = "INSERT IGNORE INTO table1 ({0}) VALUES ({1});"
cursor.executemany(ins_string.format(fields, string_symbols), values)
I run into consistent deadlock problems. I presume that these problems are caused because each thread locks between rows of table1 in some semi-random order based on the order in which my python list values is generated. This is somewhat validated by testing; when I build a new database from scratch using 24 threads, the deadlock rate per executemany() statement is > 80%, but by the time there are a million+ rows in the database the deadlock rate is near zero.
I had considered the possibility that the deadlock is a result of threads competing for AUTOINCREMENT, but in the default InnoDB 'consecutive' lock mode, it doesn't seem like this should happen. Each thread is supposed to get a table level lock until the end of the INSERT. However, the way the AUTOINCREMENT and INSERT locks interact is confusing to me, so if I have this wrong, please let me know.
So if the problem is caused by the random-ish ordering of the unique key, I need some way of ordering the insert statements in python before passing them to MySql. The index is hashed in some way by MySql, and then ordered. How I can replicate the hashing/ordering in python?
I'm asking about a solution to my diagnosis of the problem here, but if you see that my diagnosis is wrong, again, please let me know.
Why have ID, since you have a UNIQUE key that could be promoted to PRIMARY?
Regardless, sort the bulk insert rows on (col1, col2, col3) before building the executemany.
If that is not sufficient, then decrease the number of rows in each executemany. 100 rows gets within about 10% of the theoretical best. If 100 decreases the frequency of deadlocks below, say 10%, then you are probably very close to the optimal balance between speed of bulk loading and slowdown due to replaying deadlocks.
How many CPU cores do you have?
Are there other indexes you are not showing us? Every UNIQUE index factors into this problem. Non-unique indexes are not a problem. Please provide the full SHOW CREATE TABLE.
I am thinking if I don't use auto id as primary id in mysql but use other method to implement, may I replace auto id from bson.objectid.ObjectId in mysql?
According to ObjectId description, it's composed of:
a 4-byte value representing the seconds since the Unix epoch
a 3-byte machine identifier
a 2-byte process id
a 3-byte counter, starting with a random value.
It seems it can provide unique and not duplicate key. Is it a good idea?
You certainly could do this. One issue though is that since this can't be set by the database itself, you'll need to write some Python code to ensure it is set on save.
Since you're not using MongoDB, though, I wonder why you want to use a BSON id. Instead you might want to consider using UUID, which can indeed be set automatically by the db.
I have a large database of elements each of which has unique key. Every so often (once a minute) I get a load more items which need to be added to the database but if they are duplicates of something already in the database they are discarded.
My question is - is it better to...:
Get Django to give me a list (or set) of all of the unique keys and then, before trying to add each new item, check if its key is in the list or,
have a try/except statement around the save call on the new item and reply on Django catching duplicates?
Cheers,
Jack
If you're using MySQL, you have the power of INSERT IGNORE at your finger tips and that would be the most performant solution. You can execute custom SQL queries using the cursor API directly. (https://docs.djangoproject.com/en/1.9/topics/db/sql/#executing-custom-sql-directly)
If you are using Postgres or some other data-store that does not support INSERT IGNORE then things are going to be a bit more complicated.
In the case of Postgres, you can use rules to essentially make your own version of INSERT IGNORE.
It would look something like this:
CREATE RULE "insert_ignore" AS ON INSERT TO "some_table"
WHERE EXISTS (SELECT 1 FROM some_table WHERE pk=NEW.pk) DO INSTEAD NOTHING;
Whatever you do, avoid the "selecting all rows and checking first approach" as the worst-case performance is O(n) in Python and essentially short-circuits any performance advantage afforded by your database since the check is being performed on the app machine (and also eventually memory-bound).
The try/except approach is marginally better than the "select all rows" approach but it still requires constant hand-off to the app server to deal with each conflict, albeit much quicker. Better to make the database do the work.
I am looking for a simple way to query an update or insert based on if the row exists in the first place. I am trying to use Python's MySQLdb right now.
This is how I execute my query:
self.cursor.execute("""UPDATE `inventory`
SET `quantity` = `quantity`+{1}
WHERE `item_number` = {0}
""".format(item_number,quantity));
I have seen four ways to accomplish this:
DUPLICATE KEY. Unfortunately the primary key is already taken up as a unique ID so I can't use this.
REPLACE. Same as above, I believe it relies on a primary key to work properly.
mysql_affected_rows(). Usually you can use this after updating the row to see if anything was affected. I don't believe MySQLdb in Python supports this feature.
Of course the last ditch effort: Make a SELECT query, fetchall, then update or insert based on the result. Basically I am just trying to keep the queries to a minimum, so 2 queries instead of 1 is less than ideal right now.
Basically I am wondering if I missed any other way to accomplish this before going with option 4. Thanks for your time.
Mysql DOES allow you to have unique indexes, and INSERT ... ON DUPLICATE UPDATE will do the update if any unique index has a duplicate, not just the PK.
However, I'd probably still go for the "two queries" approach. You are doing this in a transaction, right?
Do the update
Check the rows affected, if it's 0 then do the insert
OR
Attempt the insert
If it failed because of a unique index violation, do the update (NB: You'll want to check the error code to make sure it didn't fail for some OTHER reason)
The former is good if the row will usually exist already, but can cause a race (or deadlock) condition if you do it outside a transaction or have your isolation mode is not high enough.
Creating a unique index on item_number in your inventory table sounds like a good idea to me, because I imagine (without knowing the details of your schema) that one item should only have a single stock level (assuming your system doesn't allow multiple stock locations etc).