Python MySQL query not completing - python

I am having problems with a Python script which is basically just analysing a CSV file line-by-line and then inserting each line into a MySQL table using a FOR loop:
f = csv.reader(open(filePath, "r"))
i = 1
for line in f:
if (i > skipLines):
vals = nullify(line)
try:
cursor.execute(query, vals)
except TypeError:
sys.exc_clear()
i += 1
return
Where the query is of the form:
query = ("insert into %s" % tableName) + (" values (%s)" % placeholders)
This is working perfectly fine with every file it is used for with one exception - the largest file. It stops at different points each time - sometimes it reaches 600,000 records, sometimes 900,000. But there are about 4,000,000 records in total.
I can't figure out why it is doing this. The table type is MyISAM. Plenty of disk space available. The table is reaching about 35MB when it stops. max_allowed_packet is set to 16MB but I don't think is a problem as it is executing line-by-line?
Anyone have any ideas what this could be? Not sure whether it is Python, MySQL or the MySQLdb module that is responsible for this.
Thanks in advance.

Have you tried LOAD MySQL function?
query = "LOAD DATA INFILE '/path/to/file' INTO TABLE atable FIELDS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\\\'"
cursor.execute( query )
You can always pre-process the CSV file (at least that's what I do :)
Another thing worth trying would be bulk inserts. You could try to insert multiple rows with one query:
INSERT INTO x (a,b)
VALUES
('1', 'one'),
('2', 'two'),
('3', 'three')
Oh, yeah, and you don't need to commit since it's the MyISAM engine.

As S. Lott alluded to, aren't cursors used as handles into transactions?
So at any time the db is giving you the option of rolling back all those pending inserts.
You may simply have too many inserts for one transaction. Try committing the transaction every couple of thousand inserts.

Related

How to insert into sqlite faster [duplicate]

I read this: Importing a CSV file into a sqlite3 database table using Python
and it seems that everyone suggests using line-by-line reading instead of using bulk .import from SQLite. However, that will make the insertion really slow if you have millions of rows of data. Is there any other way to circumvent this?
Update: I tried the following code to insert line by line but the speed is not as good as I expected. Is there anyway to improve it
for logFileName in allLogFilesName:
logFile = codecs.open(logFileName, 'rb', encoding='utf-8')
for logLine in logFile:
logLineAsList = logLine.split('\t')
output.execute('''INSERT INTO log VALUES(?, ?, ?, ?)''', logLineAsList)
logFile.close()
connection.commit()
connection.close()
Since this is the top result on a Google search I thought it might be nice to update this question.
From the python sqlite docs you can use
import sqlite3
persons = [
("Hugo", "Boss"),
("Calvin", "Klein")
]
con = sqlite3.connect(":memory:")
# Create the table
con.execute("create table person(firstname, lastname)")
# Fill the table
con.executemany("insert into person(firstname, lastname) values (?,?)", persons)
I have used this method to commit over 50k row inserts at a time and it's lightning fast.
Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction. Here's a quote from sqlite optimization FAQ:
Unless already in a transaction, each SQL statement has a new
transaction started for it. This is very expensive, since it requires
reopening, writing to, and closing the journal file for each
statement. This can be avoided by wrapping sequences of SQL statements
with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup
is also obtained for statements which don't alter the database.
Here's how your code may look like.
Also, sqlite has an ability to import CSV files.
Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

My program isn't changing MySQL database yet presents no error

I've written a program to scrape a website for data, place it into several arrays, iterate through each array and place it in a query and then execute the query. The code looks like this:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...) #continues for about a dozen arrays
cur.execute(query, values)
cur.close()
db.close()
I run the program and aside from a few truncation warnings everything goes fine. I open the database in MySQL Workbench and nothing has changed. I tried changing the arrays in the values to constant strings and running it but still nothing would change.
I then created an array to hold the last executed query: sql_queries.append(cur._last_executed) and pushed them out to a text file:
fo = open("foo.txt", "wb")
for q in sql_queries:
fo.write(q)
fo.close()
Which gives me a large text file with multiple queries. When I copy the whole text file and create a new query in MySQL Workbench and execute it, it populates the database as desired. What is my program missing?
If your table is using a transactional storage engine, like Innodb, then you need to call db.commit() to have the transaction stored:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...)
cur.execute(query, values)
db.commit()
cur.close()
db.close()
Note that with a transactional database, besides comitting you also have the opportunity to handle errors by rollingback inserts or updates with db.rollback(). The db.commit is required to finalize the transaction. Otherwise,
Closing a connection without committing the changes first will cause
an implicit rollback to be performed.

Python/Psycopg2/PostgreSQL Copy_From loop gets slower as it progresses

I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50

SQL/Python School Program

I have a school project for a Python script which runs an SQL query and spits out results to a file... I have this so far and wanted some feedback on if this looks right or if I am way off (I'm using adodbapi). Thanks much!
import adodbapi
# Connect to the SQL DB
conn = adodbapi.connect("Provider=SQLDB;SERVER= x.x.x.x ;User Id=user;Password=pass;DATABASE=db database;")
curs = conn.cursor()
# Execute SQL query test_file.sql"
query = 'test_file'
curs.execute("SELECT test_file")
rows = curs.fetchall()
for row in rows:
print test_file | test_file.txt
conn.close()
# Execute SQL query test_file.sql" You are not executing an SQL query from a file. You are executing the SQL query "SELECT test_file".
"SELECT test_file" is not valid SQL syntax for the SELECT query. See this tutorial on the SELECT statement.
rows = curs.fetchall(); for row in rows: ... is not a nice way of iterating through all the results of a query.
If your query returns large number of rows, say a million rows, then all one million rows will have to be transferred from the database to your python program before the loop can start. This could be very slow if the database server is on a remote machine.
Your program will have to allocate memory for the entire data set before work begins. This could be hundreds of megabytes.
The more Pythonic way of doing this is to avoid loading the entire data set into memory unless you have to. Using sqlite3 I would write:
results = curs.execute("SELECT * FROM table_name")
for row in results:
print (row)
This way only one row is loaded at a time.
print test_file | test_file.txt: the print statement does not support the pipe operator for writing to a file. (Python is not a Linux shell!) See Python File I/O.
Additionally, even if this syntax was correct, you have failed to put the file name in 'quote marks'. Without quotes, Python will interpret test_file.txt as the property txt of a variable called test_file. This will get you a NameError because there is no variable called test_file, or possibly an AttributeError.
If you want to test your code without having to connecting to a network database, then use the sqlite3 module. This is a built-in library of Python, implementing a database similar to adodbapi.
import sqlite3
db_conn = sqlite3.connect(":memory:") # connect to temporary database
db_conn.execute("CREATE TABLE colours ( Name TEXT, Red INT, Green INT, Blue INT )")
db_conn.execute("INSERT INTO colours VALUES (?,?,?,?)", ('gray', 128, 128, 128))
db_conn.execute("INSERT INTO colours VALUES (?,?,?,?)", ('blue', 0, 0, 255))
results = db_conn.execute("SELECT * FROM colours")
for row in results:
print (row)
In future please try running your code, or at least testing that individual lines do what you expect. Trying print test_file | test_file.txt in an interpreter would have given you a TypeError: unsupported operand type(s) for |: 'str' and 'str'.

Categories