Most effective way how to avoid inserting duplicates into database? - python

I will describe my situation first in order to make the following question as clear as possible.
For simplicity, let say I have a table in MySQL database (InnoDB) with records about dogs with structure as follows:
dog_id (PK) | dog_name
And there is 10,000,000 rows in the table (each represents a unique dog) and index build on the dog_name column.
My program searches through vets records that I need to process. Each record is somehow connected with a dog and there is like 100 records for each dog. And I want to find dogs which have not been inserted to the database yet.
That means that 100 times in a row the record that is being processed can be about a dog which is already in the database and therefore the dog doesn't have to be added to the database. But sometimes it happens (as mentioned before 1:100 ratio) that I need to add a dog to the database because it is the first time the program approached a record about the dog. (I hope this example makes my situation clear)
My question is:
What is the most effective way how to verify that the dog has not beed inserted into the database yet?
Load all the dog names (suppose all the dogs in the world have unique names) to the memory of the program (a set) and check if the dog is in the set or not. When it is in the set I skip the record, when it is not I insert the dog.
Define the column as UNIQUE and try to insert all the records. When there is a database error because of the uniqueness, I just skip the dog and continue.
Query the database to find out if the dog is in the database every time I process a record and if it is in the database I skip the record and if it is not I insert the dog into the table.
To give you as much information as I can. I use Python, SqlAlchemy, MySQL, InnoDB.

You should use dog_name as the primary key, and then use
INSERT INTO dogs (dog_name) VALUES ('[NAME HERE]') ON DUPLICATE KEY UPDATE dog_name='[NAME HERE]';
This will only insert unique dog names. If you still want to use a numerical ID for each dog, you can set that column to auto increment, but the primary key should be the dog names (assuming all are unique).
SQLAlchemy does not have this functionality built in, but can make force it to make a similar query with session.merge().

Something like option 2 or option 3 will work best; they should take similar amounts of time, and which one wins will depend on exactly how MySQL/InnoDB decides that a collision has occurred. I don't actually know; it's possible that insert with a UNIQUE key triggers the same operation as a SELECT. Prototype both and profile performance.
If performance is an issue, you can always hand-code the SELECT statement since it's relatively simple. This cuts out the Python MySQL overhead to construct the SQL; that's normally not a huge issue, but SQLAlchemy can add dozens of layers of function calls that support its ability to construct arbitrary queries. You can short-circuit those calls using Python string formatting.
Assuming that 's' is your SQLAlchemy Session object:
def dog_in_db(dog_name):
q = 'SELECT COUNT (*) FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.first()[0] > 0
You could also try a SELECTing and check whether any rows are returned:
q 'SELECT dog_id FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.rowcount() > 0
Assuming that your option 1 means loading all of the names from the database, it will be slow. MySQL will always perform any single operation it supports faster than Python can; and what you're doing here is exactly the same single operation (finding a member in a list).

Related

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.
How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.
The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

SQLAlchemy Delete And Insert in Same Transaction

I have am using SQLAlchemy and inside one transaction, I want to do the following:
Delete all records that meet a certain criteria (say this is Cars.color == red).
Now, I want to insert all cars that meet a certain criteria (say Cars.type == Honda).
Now lets say that my database is just a table with 3 columns (name, color, andtype) with name as the primary key.
If my database already has cars that are red and of type Honda with name as Bob. I can't just say
Cars.query.filter(Cars.name == red).delete()
// add all Hondas
db.session.commit()
as the // add all Hondas will fail because I could potentially be adding a car with name as Bob and color red. How can I do a deletion and have deletions follow as part of one action?
Reference: I am using MySQL.
You could try this:
db.session.query(Cars).filter(Cars.name == red).delete()
// add all Hondas
for h in Hondas:
db.session.add(h)
db.session.commit()
Caveat lector - I do not think your current code does set up the transaction properly. Apart from that I don't believe that the problem you describe exists - session.commit() flushes the changes in sequences, so no matter whether session.flush() is called, you should be able to insert the Hondas at the point you marked - the red car will be deleted before the insert hits the DB.

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

sqlite tracking IDs - find missing integers in a seq

First I am not even sure whether I am asking the right question, sorry for that. SQL is new to me. I have a table I create in SQLITE like this:
CREATE TABLE ENTRIES "(ID INTEGER PRIMARY KEY AUTOINCREMENT,DATA BLOB NOT NULL)"
Which is all fine if I have only additions for entries. If I create entries, they increment. Let us say I added 7 entries. Now I delete 3 entries:
DELETE FROM NODES WHERE ID = 3
DELETE FROM NODES WHERE ID = 4
DELETE FROM NODES WHERE ID = 5
Entries I now have are:
1,2,6,7.
The next time I add an entry it will have ID=8.
So, my question is:
How do I get the next 3 entries, to get the IDs 3, 4, 5 and only the 4 entry will then get 8? I realize this is similar to SQL: find missing IDs in a table, and it is maybe also a general programming (not just SQL) problem. So, I would be happy to see some Python and SQLite solutions.
Thanks,
Oz
I don't think that's the way auto incrementing fields work. SQLite keeps a counter of the last used integer. It will never 'fill in' the deleted values if you want to get the next 3 rows after
an id you could:
SELECT * FROM NODES WHERE ID > 2 LIMIT 3;
This will give you the next three rows with an id greater than 2
Additionally you could just create a deleted flag or something? so the rows are never actually removed from your database.
You can't. SQLite will never re-use deleted IDs, for database integrity reasons. Let's assume you have a second table which has a foreign key which references the first table. If, for some reason, a corresponding row is removed without removing the rows which reference it (using the primary ID) as well, it will point to the wrong row.
Example: If you remove a person record without removing the purchases as well, the purchase records will point to the wrong person once you re-assign the old ID.
───────────────────── ────────────────────
Table 1 – customers Table 2 – purchase
───────────────────── ────────────────────
*ID <───────┐ *ID
Name │ Item
Address └─────── Customer
Phone Price
This is why pretty much any database engine out there assigns primary IDs strictly incremental. They are database internals, you usually shouldn't touch them. If you need to assign your own IDs, just add a separate column (think twice before doing so).
If you want to keep track of the number of rows, you can query it like this: SELECT Count(*) FROM table_name.

Efficient way of phrasing multiple tuple pair WHERE conditions in SQL statement

I want to perform an SQL query that is logically equivalent to the following:
DELETE FROM pond_pairs
WHERE
((pond1 = 12) AND (pond2 = 233)) OR
((pond1 = 12) AND (pond2 = 234)) OR
((pond1 = 12) AND (pond2 = 8)) OR
((pond1 = 13) AND (pond2 = 6547)) OR
((pond1 = 13879) AND (pond2 = 6))
I will have hundreds of thousands pond1-pond2 pairs. I have an index on (pond1, pond2).
My limited SQL knowledge came up with several approaches:
Run the whole query as is.
Batch the query up into smaller queries with n WHERE conditions
Save the pond1-pond2 pairs into a new table, and do a subquery in the WHERE clause to identify
Convert the python logic which identifies rows to delete into a stored procedure. Note that I am unfamiliar with programming stored procedures and thus this would probably involve a steep learning curve.
I am using postgres if that is relevant.
For a large number of pond1-pond2 pairs to be deleted in a single DELETE, I would create temporary table and join on this table.
-- Create the temp table:
CREATE TEMP TABLE foo AS SELECT * FROM (VALUES(1,2), (1,3)) AS sub (pond1, pond2);
-- Delete
DELETE FROM bar
USING
foo -- the joined table
WHERE
bar.pond1= foo.pond1
AND
bar.pond2 = foo.pond2;
I will do 3. (with JOIN rather than subquery) and measure time of DELETE query (without creating table and inserting). This is good starting point, because JOINing is very common and optimized procedure, so It will be hard to beat that time. Then you can compare that time to your current approach.
Also you can try following approach:
Sort pairs in same way as in index.
Delete using method 2. from your description (probably in single transaction).
Sorting before delete will give better index reading performance, because there's greater chance for hard-drive cache to work.
With hundred of thousands of pairs, you cannot do 1 (run the query as is), because the SQL statement would be too long.
3 is good if you have the pairs already in a table. If not, you would need to insert them first. If you do not need them later, you might just as well run the same amount of DELETE statements instead of INSERT statements.
How about a prepared statement in a loop, maybe batched (if Python supports that)
begin transaction
prepare statement "DELETE FROM pond_pairs WHERE ((pond1 = ?) AND (pond2 = ?))"
loop over your data (in Python), and run the statement with one pair (or add to batch)
commit
Where are the pairs coming from? If you can write a SELECT statements to identify them, you can just move this condition into the WHERE clause of your delete.
DELETE FROM pond_pairs WHERE (pond1, ponds) in (SELECT pond1, pond2 FROM ...... )

Categories