I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.
Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).
But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.
When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.
Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.
Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.
Thanks again!
just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:
insert into mytable
select * from (
select distinct *
from scratch
) uniq
where not exists (
select 1
from mytable
where mytable.mykey = uniq.mykey
);
This will dedup and runs much faster than any combination of inserts.
-dg
I had the same problem and searched here for many days to collect a lot of hints to form a complete solution. Even if the question outdated, I hope this will be useful to others.
1) Forget things about removing indexes/constraints & recreating them later, benefits are marginal or worse.
2) executemany is better than execute as it makes for you the prepare statement. You can get the same results yourself with a command like the following to gain 300% speed:
# To run only once:
sqlCmd = """PREPARE myInsert (int, timestamp, real, text) AS
INSERT INTO myBigTable (idNumber, date_obs, result, user)
SELECT $1, $2, $3, $4 WHERE NOT EXISTS
(SELECT 1 FROM myBigTable WHERE (idNumber, date_obs, user)=($1, $2, $4));"""
curPG.execute(sqlCmd)
cptInsert = 0 # To let you commit from time to time
#... inside the big loop:
curPG.execute("EXECUTE myInsert(%s,%s,%s,%s);", myNewRecord)
allreadyExists = (curPG.rowcount < 1)
if not allreadyExists:
cptInsert += 1
if cptInsert % 10000 == 0:
conPG.commit()
This dummy table example has an unique constraint on (idNumber, date_obs, user).
3) The best solution is to use COPY_FROM and a TRIGGER to manage the unique key BEFORE INSERT. This gave me 36x more speed. I started with normal inserts at 500 records/sec. and with "copy", I got over 18,000 records/sec. Sample code in Python with Psycopg2:
ioResult = StringIO.StringIO() #To use a virtual file as a buffer
cptInsert = 0 # To let you commit from time to time - Memory has limitations
#... inside the big loop:
print >> ioResult, "\t".join(map(str, myNewRecord))
cptInsert += 1
if cptInsert % 10000 == 0:
ioResult = flushCopyBuffer(ioResult, curPG)
#... after the loop:
ioResult = flushCopyBuffer(ioResult, curPG)
def flushCopyBuffer(bufferFile, cursorObj):
bufferFile.seek(0) # Little detail where lures the deamon...
cursorObj.copy_from(bufferFile, 'myBigTable',
columns=('idNumber', 'date_obs', 'value', 'user'))
cursorObj.connection.commit()
bufferFile.close()
bufferFile = StringIO.StringIO()
return bufferFile
That's it for the Python part. Now the Postgresql trigger to not have exception psycopg2.IntegrityError and then all the COPY command's records rejected:
CREATE OR REPLACE FUNCTION chk_exists()
RETURNS trigger AS $BODY$
DECLARE
curRec RECORD;
BEGIN
-- Check if record's key already exists or is empty (file's last line is)
IF NEW.idNumber IS NULL THEN
RETURN NULL;
END IF;
SELECT INTO curRec * FROM myBigTable
WHERE (idNumber, date_obs, user) = (NEW.idNumber, NEW.date_obs, NEW.user);
IF NOT FOUND THEN -- OK keep it
RETURN NEW;
ELSE
RETURN NULL; -- Oups throw it or update the current record
END IF;
END;
$BODY$ LANGUAGE plpgsql;
Now link this function to the trigger of your table:
CREATE TRIGGER chk_exists_before_insert
BEFORE INSERT ON myBigTable FOR EACH ROW EXECUTE PROCEDURE chk_exists();
This seems like a lot of work but Postgresql is a very fast beast when it doesn't have to interpret SQL over and over. Have fun.
"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."
The question doesn't really make a lot of sense.
Does EVERY block of 1,000 rows fail due to non-unique rows?
Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".
Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?
If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.
Please actually run the actual program with actual database and different size blocks and post the numbers.
using a MERGE statement instead of an INSERT one would solve your problem.
Related
A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.
How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.
The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).
I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.
I've hit a strange inconsistency problem with SQL Server inserts using a stored procedure. I'm calling a stored procedure from Python via pyodbc by running a loop to call it multiple times for inserting multiple rows in a table.
It seems to work normally most of the time, but after a while it will just stop working in the middle of the loop. At that point even if I try to call it just once via the code it doesn't insert anything. I don't get any error messages in the Python console and I actually get back the incremented identities for the table as though the data were actually inserted, but when I go look at the data, it isn't there.
If I call the stored procedure from within SQL Server Management Studio and pass in data, it inserts it and shows the incremented identity number as though the other records had been inserted even though they are not in the database.
It seems I reach a certain limit on the number of times I can call the stored procedure from Python and it just stops working.
I'm making sure to disconnect after I finish looping through the inserts and other stored procedures written in the same way and sent via the same database connection still work as usual.
I've tried restarting the computer with SQL Server and sometimes it will let me call the stored procedure from Python a few more times, but that eventually stops working as well.
I'm wondering if it is something to do with calling the stored procedure in a loop too quickly, but that doesn't explain why after restarting the computer, it doesn't allow any more inserts from the stored procedure.
I've done lots of searching online, but haven't found anything quite like this.
Here is the stored procedure:
USE [Test_Results]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[insertStepData]
#TestCaseDataId int,
#StepNumber nchar(10),
#StepDateTime nvarchar(50)
AS
SET NOCOUNT ON;
BEGIN TRANSACTION
DECLARE #newStepId int
INSERT INTO TestStepData (
TestCaseDataId,
StepNumber,
StepDateTime
)
VALUES (
#TestCaseDataId,
#StepNumber,
#StepDateTime
)
SET #newStepId = SCOPE_IDENTITY();
SELECT #newStepId
FROM TestStepData
COMMIT TRANSACTION
Here is the method I use to call a stored procedure and get back the id number ('conn' is an active database connection via pyodbc):
def CallSqlServerStoredProc(self, conn, procName, *args):
sql = """DECLARE #ret int
EXEC #ret = %s %s
SELECT #ret""" % (procName, ','.join(['?'] * len(args)))
return int(conn.execute(sql, args).fetchone()[0])
Here is where I'm passing in the stored procedure to insert:
....
for testStep in testStepData:
testStepId = self.CallSqlServerStoredProc(conn, "insertStepData", testCaseId, testStep["testStepNumber"], testStep["testStepDateTime"])
conn.commit()
time.sleep(1)
....
SET #newStepId = SCOPE_IDENTITY();
SELECT #newStepId
FROM StepData
looks mighty suspicious to me:
SCOPE_IDENTITY() returns numeric(38,0) which is larger than int. A conversion error may occur after some time. Update: now that we know the IDENTITY column is int, this is not an issue (SCOPE_IDENTITY() returns the last value inserted into that column in the current scope).
SELECT into variable doesn't guarantee its value if more that one record is returned. Besides, I don't get the idea behind overwriting the identity value we already have. In addition to that, the number of values returned by the last statement is equal to the number of rows in that table which is increasing quickly - this is a likely cause of degradation. In brief, the last statement is not just useless, it's detrimental.
The 2nd statement also makes these statements misbehave:
EXEC #ret = %s %s
SELECT #ret
Since the function doesn't RETURN anything but SELECTs a single time, this chunk actually returns two data sets: 1) a single #newStepId value (from EXEC, yielded by the SELECT #newStepId <...>); 2) a single NULL (from SELECT #ret). fetchone() reads the 1st data set by default so you don't notice this but it doesn't work towards performance or correctness anyway.
Bottom line
Replace the 2nd statement with RETURN #newStepId.
Data not in the database problem
I believe it's caused by RETURN before COMMIT TRANSACTION. Make it the other way round.
In the original form, I believe it was caused by the long-working SELECT and/or possible side-effects from the SELECT not-to-a-variable being inside a transaction.
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.
I want to perform an SQL query that is logically equivalent to the following:
DELETE FROM pond_pairs
WHERE
((pond1 = 12) AND (pond2 = 233)) OR
((pond1 = 12) AND (pond2 = 234)) OR
((pond1 = 12) AND (pond2 = 8)) OR
((pond1 = 13) AND (pond2 = 6547)) OR
((pond1 = 13879) AND (pond2 = 6))
I will have hundreds of thousands pond1-pond2 pairs. I have an index on (pond1, pond2).
My limited SQL knowledge came up with several approaches:
Run the whole query as is.
Batch the query up into smaller queries with n WHERE conditions
Save the pond1-pond2 pairs into a new table, and do a subquery in the WHERE clause to identify
Convert the python logic which identifies rows to delete into a stored procedure. Note that I am unfamiliar with programming stored procedures and thus this would probably involve a steep learning curve.
I am using postgres if that is relevant.
For a large number of pond1-pond2 pairs to be deleted in a single DELETE, I would create temporary table and join on this table.
-- Create the temp table:
CREATE TEMP TABLE foo AS SELECT * FROM (VALUES(1,2), (1,3)) AS sub (pond1, pond2);
-- Delete
DELETE FROM bar
USING
foo -- the joined table
WHERE
bar.pond1= foo.pond1
AND
bar.pond2 = foo.pond2;
I will do 3. (with JOIN rather than subquery) and measure time of DELETE query (without creating table and inserting). This is good starting point, because JOINing is very common and optimized procedure, so It will be hard to beat that time. Then you can compare that time to your current approach.
Also you can try following approach:
Sort pairs in same way as in index.
Delete using method 2. from your description (probably in single transaction).
Sorting before delete will give better index reading performance, because there's greater chance for hard-drive cache to work.
With hundred of thousands of pairs, you cannot do 1 (run the query as is), because the SQL statement would be too long.
3 is good if you have the pairs already in a table. If not, you would need to insert them first. If you do not need them later, you might just as well run the same amount of DELETE statements instead of INSERT statements.
How about a prepared statement in a loop, maybe batched (if Python supports that)
begin transaction
prepare statement "DELETE FROM pond_pairs WHERE ((pond1 = ?) AND (pond2 = ?))"
loop over your data (in Python), and run the statement with one pair (or add to batch)
commit
Where are the pairs coming from? If you can write a SELECT statements to identify them, you can just move this condition into the WHERE clause of your delete.
DELETE FROM pond_pairs WHERE (pond1, ponds) in (SELECT pond1, pond2 FROM ...... )