Profile Millions of Text Files In Parallel Using An Sqlite Counter?

Profile Millions of Text Files In Parallel Using An Sqlite Counter? - python

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.

How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.

The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

Related

How can I improve PyODBC performance of a single row select (i.e. a lookup)?

I want to improve the performance of an SQL Select call via ODBC/pyODBC.
This is not against a large database (maybe 10K rows), pulling a unique record (15 columns) from the table. The combined size of the 15 columns is about 500 bytes). I'm using pyODBC, and using fetchone, the fastest I have been able to get it down to is about 2 seconds. It used to be roughly 3.5 seconds. I've set the encoding and decoding to UTF-8 to match the database. I have confirmed that the transaction level is read_uncommitted.
I am using DataDirect ODBC driver on Unbuntu Linux.
I can't seem to get it below 2 seconds, but if I run it from a SQL processor (like db visualizer, or dbeaver) the row returns in 0.3 to 0.4 seconds. It's a very simple query with one where clause that uniquely indexed. No wild cards, no exists, etc.
Is this just a minimum amount of time that pyodbc takes to process a query?
query = 'select order_num, pick_ticket_num, package_id, ship_via, name, contact, address1, address2, address3, city, state, postal_code, country, phone from dbc.v_dmv5 where package_id = ?'
cursor.execute(query, sqlArgs)
row = cursor.fetchone()
I also tried using turbodbc which results in the same performance level of about 2s for the query. But running this exact query in any sql processor is essentially immediate.
It's definitely not the parameterization field either since I've actually hard coded a value in to the where clause, and it still takes 2+ seconds to execute.

Based on the discussion in the comments to the question, especially
If you run it in iSQL - the result for the data comes back instantaneously - like immediately shows up. But the cursor doesnt come back for another 2 seconds.
and
But if I look fetchone 3x from execute to fetch, it literally is 2s, 2s, 2s
it appears that the driver retrieves the last value in the last row, for example ...
so63171038 27b8-1b2c EXIT SQLGetData with return code 0 (SQL_SUCCESS)
HSTMT 0x000000997E788440
UWORD 2
SWORD -8 <SQL_C_WCHAR>
PTR 0x00000099746BD0A0 [ 6] "bar"
SQLLEN 4096
SQLLEN * 0x00000099722EE380 (6)
... and then when it calls SQLFetch again to see if there is more information to retrieve (and there isn't) ...
so63171038 27b8-1b2c ENTER SQLFetch
HSTMT 0x000000997E788440
so63171038 27b8-1b2c EXIT SQLFetch with return code 100 (SQL_NO_DATA_FOUND)
HSTMT 0x000000997E788440
... that's what is introducing the ~2 second delay.
It definitely looks like a driver (or perhaps database) issue to me.

select randow row from cassandra

I have the following table:
CREATE TABLE prosfiles (
name_file text,
beginpros timestamp,
humandate timestamp,
lastpros timestamp,
originalname text,
pros int,
uploaded int,
uploader text,
PRIMARY KEY (name_file)
)
CREATE INDEX prosfiles_pros_idx ON prosfiles (pros);
In this table I keep the location of several csv files wich are processed by a python script, as I have several scripts running at the same time processing those files, I use this table to keep control and avoid two scripts start processing the same file at the same time (in the 'pros' colum 0 means the file has not being processed, 1 for processed files and 1010 for files that are currently being processed by another script)
each file runs the following query to pick the file to process:
"select name_file from prosfiles where pros = 0 limit 1"
but this always returns the first row of the files with that condition
I would like to run a query that returns a randow row from all the ones with pros = 0.
In mysql I've used "order by rand()" but in cassandra I don't know how to random sort the results.

Looks like that you're using Cassandra as a queue and it's not the best usage pattern for it, use rabbitmq/sqs/any-other-queue-service. Also Cassandra does not support sorting at all, and it's done with the idea that:
sort will require a lot of computations inside database if you are trying to sort 1B of rows.
sort is not an easy task in distributed environment: you have to ask all nodes holding the data to perform it.
But if you know what you are doing, you can revisit your database schema to be more suitable for this type of workload:
split your source table into two different tables: first one with full file information and the second one with the queue itself containing only ids of files to process.
your worker process reads random row from queue table (see below how to read ~random row from cassandra by primary key)
worker deletes target id from queue and updates your targets table with processing information.
This way of doing things will lead you to possible errors:
multiple workers can get the same target at once.
if you have a lot of workers and targets, Cassandra's compaction process will kill the performance of your DIY queue.
To read a pseudo-random row from table by it's primary key you can use this query: select * from some_table where token(id_column)>some_random_long_value limit 1, but it will also have it's cons:
if you have a small set of targets, it will sporadically return empty result because your some_random_long_value will be higher than token of any existing key.

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.

You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

Most effective way how to avoid inserting duplicates into database?

I will describe my situation first in order to make the following question as clear as possible.
For simplicity, let say I have a table in MySQL database (InnoDB) with records about dogs with structure as follows:
dog_id (PK) | dog_name
And there is 10,000,000 rows in the table (each represents a unique dog) and index build on the dog_name column.
My program searches through vets records that I need to process. Each record is somehow connected with a dog and there is like 100 records for each dog. And I want to find dogs which have not been inserted to the database yet.
That means that 100 times in a row the record that is being processed can be about a dog which is already in the database and therefore the dog doesn't have to be added to the database. But sometimes it happens (as mentioned before 1:100 ratio) that I need to add a dog to the database because it is the first time the program approached a record about the dog. (I hope this example makes my situation clear)
My question is:
What is the most effective way how to verify that the dog has not beed inserted into the database yet?
Load all the dog names (suppose all the dogs in the world have unique names) to the memory of the program (a set) and check if the dog is in the set or not. When it is in the set I skip the record, when it is not I insert the dog.
Define the column as UNIQUE and try to insert all the records. When there is a database error because of the uniqueness, I just skip the dog and continue.
Query the database to find out if the dog is in the database every time I process a record and if it is in the database I skip the record and if it is not I insert the dog into the table.
To give you as much information as I can. I use Python, SqlAlchemy, MySQL, InnoDB.

You should use dog_name as the primary key, and then use
INSERT INTO dogs (dog_name) VALUES ('[NAME HERE]') ON DUPLICATE KEY UPDATE dog_name='[NAME HERE]';
This will only insert unique dog names. If you still want to use a numerical ID for each dog, you can set that column to auto increment, but the primary key should be the dog names (assuming all are unique).
SQLAlchemy does not have this functionality built in, but can make force it to make a similar query with session.merge().

Something like option 2 or option 3 will work best; they should take similar amounts of time, and which one wins will depend on exactly how MySQL/InnoDB decides that a collision has occurred. I don't actually know; it's possible that insert with a UNIQUE key triggers the same operation as a SELECT. Prototype both and profile performance.
If performance is an issue, you can always hand-code the SELECT statement since it's relatively simple. This cuts out the Python MySQL overhead to construct the SQL; that's normally not a huge issue, but SQLAlchemy can add dozens of layers of function calls that support its ability to construct arbitrary queries. You can short-circuit those calls using Python string formatting.
Assuming that 's' is your SQLAlchemy Session object:
def dog_in_db(dog_name):
q = 'SELECT COUNT (*) FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.first()[0] > 0
You could also try a SELECTing and check whether any rows are returned:
q 'SELECT dog_id FROM dogs WHERE dog_name = %s;' % dog_name
res = s.execute(q)
return res.rowcount() > 0
Assuming that your option 1 means loading all of the names from the database, it will be slow. MySQL will always perform any single operation it supports faster than Python can; and what you're doing here is exactly the same single operation (finding a member in a list).

Python-PostgreSQL psycopg2 interface --> executemany

I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.
Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).
But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.
When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.
Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.
Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.
Thanks again!

just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:
insert into mytable
select * from (
select distinct *
from scratch
) uniq
where not exists (
select 1
from mytable
where mytable.mykey = uniq.mykey
);
This will dedup and runs much faster than any combination of inserts.
-dg

I had the same problem and searched here for many days to collect a lot of hints to form a complete solution. Even if the question outdated, I hope this will be useful to others.
1) Forget things about removing indexes/constraints & recreating them later, benefits are marginal or worse.
2) executemany is better than execute as it makes for you the prepare statement. You can get the same results yourself with a command like the following to gain 300% speed:
# To run only once:
sqlCmd = """PREPARE myInsert (int, timestamp, real, text) AS
INSERT INTO myBigTable (idNumber, date_obs, result, user)
SELECT $1, $2, $3, $4 WHERE NOT EXISTS
(SELECT 1 FROM myBigTable WHERE (idNumber, date_obs, user)=($1, $2, $4));"""
curPG.execute(sqlCmd)
cptInsert = 0 # To let you commit from time to time
#... inside the big loop:
curPG.execute("EXECUTE myInsert(%s,%s,%s,%s);", myNewRecord)
allreadyExists = (curPG.rowcount < 1)
if not allreadyExists:
cptInsert += 1
if cptInsert % 10000 == 0:
conPG.commit()
This dummy table example has an unique constraint on (idNumber, date_obs, user).
3) The best solution is to use COPY_FROM and a TRIGGER to manage the unique key BEFORE INSERT. This gave me 36x more speed. I started with normal inserts at 500 records/sec. and with "copy", I got over 18,000 records/sec. Sample code in Python with Psycopg2:
ioResult = StringIO.StringIO() #To use a virtual file as a buffer
cptInsert = 0 # To let you commit from time to time - Memory has limitations
#... inside the big loop:
print >> ioResult, "\t".join(map(str, myNewRecord))
cptInsert += 1
if cptInsert % 10000 == 0:
ioResult = flushCopyBuffer(ioResult, curPG)
#... after the loop:
ioResult = flushCopyBuffer(ioResult, curPG)
def flushCopyBuffer(bufferFile, cursorObj):
bufferFile.seek(0) # Little detail where lures the deamon...
cursorObj.copy_from(bufferFile, 'myBigTable',
columns=('idNumber', 'date_obs', 'value', 'user'))
cursorObj.connection.commit()
bufferFile.close()
bufferFile = StringIO.StringIO()
return bufferFile
That's it for the Python part. Now the Postgresql trigger to not have exception psycopg2.IntegrityError and then all the COPY command's records rejected:
CREATE OR REPLACE FUNCTION chk_exists()
RETURNS trigger AS $BODY$
DECLARE
curRec RECORD;
BEGIN
-- Check if record's key already exists or is empty (file's last line is)
IF NEW.idNumber IS NULL THEN
RETURN NULL;
END IF;
SELECT INTO curRec * FROM myBigTable
WHERE (idNumber, date_obs, user) = (NEW.idNumber, NEW.date_obs, NEW.user);
IF NOT FOUND THEN -- OK keep it
RETURN NEW;
ELSE
RETURN NULL; -- Oups throw it or update the current record
END IF;
END;
$BODY$ LANGUAGE plpgsql;
Now link this function to the trigger of your table:
CREATE TRIGGER chk_exists_before_insert
BEFORE INSERT ON myBigTable FOR EACH ROW EXECUTE PROCEDURE chk_exists();
This seems like a lot of work but Postgresql is a very fast beast when it doesn't have to interpret SQL over and over. Have fun.

"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."
The question doesn't really make a lot of sense.
Does EVERY block of 1,000 rows fail due to non-unique rows?
Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".
Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?
If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.
Please actually run the actual program with actual database and different size blocks and post the numbers.

using a MERGE statement instead of an INSERT one would solve your problem.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.