I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50
Related
My question is about memory and performance with querying large data and then processing.
Long story short, because of a bug. I am querying a table and getting all results between two timestamps. My Python script crashed due to not enough memory - This table is very wide and holds a massive JSON object. So I changed this to only return the Primary_Key of each row.
select id from *table_name*
where updated_on between %(date_one)s and %(date_two)s
order by updated_on asc
From here I loop through and query each row one by the Primary key for the row data.
for primary_key in *query_results*:
row_data = data_helper.get_by_id( primary_key )
# from here I do some formatting and throw a message on a queue processor,
# this is not heavy processing
Example:
queue_helper.put_message_on_queue('update_related_tables', message_dict)
My question is, is this a "good" way of doing this? Do I need to help Python with GC? or will Python clean the memory after each iteration in the loop?
Must be a very wide table? That doesn't seem like too crazy of a number of rows. Anyway you can make a lazy function to yield the data x number of rows at a time. It's not stated how you're executing your query, but this is a sqlalchemy/psycopg implementation:
with engine.connect() as conn:
result = conn.execute(*query*)
while True:
chunk = result.fetchmany(x)
if not chunk:
break
for row in chunk:
heavy_processing(row)
This is pretty similar to #it's-yer-boy-chet's answer, except it's just using the lower-level psycopg2 library instead of sqlalchemy. The iterator over conn.execute() will implicitly call the cursor.fetchone() method, returning one row at a time which keeps the memory footprint relatively small provided there aren't thousands and thousands of columns returned by the query. Not sure if it necessarily provides any performance benefits over sqlalchemy, it might be doing basically the same thing under the hood.
If you still need more performance after that I'd look into a different database connection library like asyncpg
conn = psycopg2.connect(user='user', password='password', host='host', database='database')
cursor = conn.cursor()
for row in cursor.execute(query):
message_dict = format_message(row)
queue_helper.put_message_on_queue('update_related_tables', message_dict)
I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.
As part of building a Data Warehouse, I have to query a source database table for about 75M rows.
What I want to do with the 75M rows is some processing and then adding the result into another database. Now, this is quite a lot of data, and I've had success with mainly two approaches:
1) Exporting the query to a CSV file using the "SELECT ... INTO" capabilities of MySQL and using the fileinput module of python to read it, and
2) connecting to the MySQL database using MySQLdb's SScursor (the default cursor puts the query in the memory, killing the python script) and fetch the results in chunks of about 10k rows (which is the chunk size I've found to be the fastest).
The first approach is a SQL query executed "by hand" (takes about 6 minutes) followed by a python script reading the csv-file and processing it. The reason I use fileinput to read the file is that fileinput doesn't load the whole file into the memory from the beginning, and works well with larger files. Just traversing the file (reading every line in the file and calling pass) takes about 80 seconds, that is 1M rows/s.
The second approach is a python script executing the same query (also takes about 6 minutes, or slightly longer) and then a while-loop fetching chunks of rows for as long as there is any left in the SScursor. Here, just reading the lines (fetching one chunk after another and not doing anything else) takes about 15 minutes, or approximately 85k rows/s.
The two numbers (rows/s) above are perhaps not really comparable, but when benchmarking the two approaches in my application, the first one takes about 20 minutes (of which about five is MySQL dumping into a CSV file), and the second one takes about 35 minutes (of which about five minutes is the query being executed). This means that dumping and reading to/from a CSV file is about twice as fast as using an SScursor directly.
This would be no problem, if it did not restrict the portability of my system: a "SELECT ... INTO" statement requires MySQL to have writing privileges, and I suspect that is is not as safe as using cursors. On the other hand, 15 minutes (and growing, as the source database grows) is not really something I can spare on every build.
So, am I missing something? Is there any known reason for SScursor to be so much slower than dumping/reading to/from a CSV file, such that fileinput is C optimized where SScursor is not? Any ideas on how to proceed with this problem? Anything to test? I would belive that SScursor could be as fast as the first approach, but after reading all I can find about the matter, I'm stumped.
Now, to the code:
Not that I think the query is of any problem (it's as fast as I can ask for and takes similar time in both approaches), but here it is for the sake of completeness:
SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;
The primary code in the first approach is something like this
import fileinput
INPUT_PATH = 'path/to/csv/dump/dump.csv'
event_list = []
ID = -1
for line in fileinput.input([INPUT_PATH]):
split_line = line.split(';')
if split_line[0] == ID:
event_list.append(split_line[1:])
else:
process_function(ID,event_list)
event_list = [ split_line[1:] ]
ID = split_line[0]
process_function(ID,event_list)
The primary code in the second approach is:
import MySQLdb
...opening connection, defining SScursor called ssc...
CHUNK_SIZE = 100000
query_stmt = """SELECT LT.SomeID, LT.weekID, W.monday,
GREATEST(LT.attr1, LT.attr2)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC"""
ssc.execute(query_stmt)
event_list = []
ID = -1
data_chunk = ssc.fetchmany(CHUNK_SIZE)
while data_chunk:
for row in data_chunk:
if row[0] == ID:
event_list.append([ row[1], row[2], row[3] ])
else:
process_function(ID,event_list)
event_list = [[ row[1], row[2], row[3] ]]
ID = row[0]
data_chunk = ssc.fetchmany(CHUNK_SIZE)
process_function(ID,event_list)
At last, I'm on Ubuntu 13.04 with MySQL server 5.5.31. I use Python 2.7.4 with MySQLdb 1.2.3. Thank you for staying with me this long!
After using cProfile I found a lot of time being spent implicitly constructing Decimal objects, since that was the numeric type returned from the SQL query into my Python script. In the first approach, the Decimal value was written to the CSV file as an integer and then read as such by the Python script. The CSV file I/O "flattened" the data, making the script faster. The two scripts are now about the same speed (the second approach is still just a tad slower).
I also did some conversion of the date in the MySQL database to integer type. My query is now:
SELECT LT.SomeID,
LT.weekID,
CAST(DATE_FORMAT(W.monday,'%Y%m%d') AS UNSIGNED),
CAST(GREATEST(LT.attr1, LT.attr2) AS UNSIGNED)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;
This almost eliminates the difference in processing time between the two approaches.
The lesson here is that when doing large queries, post processing of data types DOES MATTER! Rewriting the query to reducing function calls in Python can improve the overall processing speed significantly.
I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?
Here is my sample code:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
conn.commit()
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()
conn.close()
Update:
Figured I would give an update with some code in case anyone runs across this problem.
I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.
After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:
self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF')
In addition, I commit after processing 1000 lines from my text file.
if lineNum % 1000 == 0:
self.conn.commit()
With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.
Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.
If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:') and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump
import sqlite3, os
in_memory = sqlite3.connect(':memory:')
# do stuff
con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
con.execute(line)
Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.
In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT statement, the database starts a transaction, executes the statement and commits. So your commit is - in this case - meaningless. See sqlite3 on python docs.
But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.