Why is 'executemany' so slow compared to just doing an 'IN' query? - python

My MySQL table schema is:
CREATE DATABASE test_db;
USE test_db;
CREATE TABLE test_table (
id INT AUTO_INCREMENT,
last_modified DATETIME NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
When I run the following benchmark script, I get:
b1: 20.5559301376
b2: 0.504406929016
from timeit import timeit
import MySQLdb
ids = range(1000)
query_1 = "update test_table set last_modified=UTC_TIMESTAMP() where id=%(id)s"
query_2 = "update test_table set last_modified=UTC_TIMESTAMP() where id in (%s)" % ", ".join(('%s', ) * len(ids))
db = MySQLdb.connect(host="localhost", user="some_user", passwd="some_pwd", db="test_db")
def b1():
curs = db.cursor()
curs.executemany(query_1, ids)
db.close()
def b2():
curs = db.cursor()
curs.execute(query_2, ids)
db.close()
print "b1: %s" % str(timeit(lambda:b1(), number=30))
print "b2: %s" % str(timeit(lambda:b2(), number=30))
Why is there such a large difference between executemany and the IN clause?
I'm using Python 2.6.6 and MySQL-python 1.2.3.
The only relevant question I could find was - Why is executemany slow in Python MySQLdb?, but it isn't really what I'm after.

executemany repeatedly goes back and forth to the MySQL server, which then needs to parse the query, perform it, and return results. This is perhaps 10 times as slow as doing everything in a single SQL statement, even if it is more complex.
However, for INSERT, this says that it will do the smart thing and construct a multi-row INSERT for you, thereby being efficient.
Hence, IN(1,2,3,...) is much more efficient than UPDATE;UPDATE;UPDATE...
If you have a sequence of ids, then even better would be to say WHERE id BETWEEN 1 and 1000. This is because it can simply scan the rows rather than looking up each one from scratch. (I am assuming id is indexed, probably as the PRIMARY KEY.)
Also, you are probably running with the settings that make each insert/update/delete into its own "transaction". This adds a lot of overhead to each UPDATE. And it is probably not desirable in this case. I suspect you want the entire 1000-row update to be atomic.
Bottom line: Use executemany only for (a) INSERTs or (b) statements that must be run individually.

Related

Python pymssql executemany delete where value in list

I am using python 2.7 to perform CRUD operations on a MS SQL 2012 DB.
I have a list of IDs called "ComputerIDs".
I want to run a query that deletes all records in the database where the ID is equal to one of the IDs in the list.
I have tried the following but it does not seem to work.
cursor.executemany("DELETE FROM Computer WHERE ID=%s", ComputerIDs)
sql='DELETE FROM Computer WHERE ID IN (%s)'
inlist=', '.join(map(lambda x: '%s', ComputerIDs))
sql = sql % inlist
cursor.execute(sql, ComputerIDs)
I was able to resolve the issue.
query_string = "DELETE FROM Computer WHERE ID = %s"
cursor.executemany(query_string, ComputerIDs)
Can anyone tell me if this query is parameterized properly and safe from SQL injection?

APSW (or SQLite3) very slow INSERT on executemany

I have found the following issue with APSW (an SQLite parser for Python) when inserting lines.
Lets say my data is data = [[1,2],[3,4]]
APSW and SQLite3 allow me to do something like:
apsw.executemany("INSERT INTO Table VALUES(?,?)", b)
or I can write some code that does the following:
sql = "BEGIN TRANSACTION;
INSERT INTO Table Values('1','2');
INERT INTO Table Values('3','4');
COMMINT;"
apsw.execute(sql)
When data is a long list/array/table the performance of the first method is extremelly slow compared to the second one (for 400 rows it can be 20 sec vs less than 1!). I do not understand why this is as that is the method shown on all SQLite Python tutorials to add data into a table.
Any idea of what may be happening here?
(Disclosure: I am the author of APSW). If you do not explicitly have a transaction in effect, then SQLite automatically starts one at the beginning of each statement, and ends at the end of each statement. A write transaction is durable - meaning the contents must end up on storage and fsync called to ensure they will survive an unexpected power or system failure. Storage is slow!
I recommend using with rather than BEGIN/COMMIT in your case, because it will automatically rollback on error. That makes sure your data insertion either completely happens or not at all. See the documentation for an example.
When you are inserting a lot of data, you will find WAL mode to be more performant.
Thanks to Confuseh I got the following answer:
Executing:
apsw.execute("BEGIN TRANSACTION;")
apsw.executemany("INERT INTO Table VALUES(?,?)", b)
apsw.execute("COMMIT;")
Speeds up the process by A LOT! This seems to be the right way of adding data (vs using my method of creating multiple INSERT statments).
Thank you for this question, the answer help me when I use Sqlite with Python. finally, I get the following things, and wish it can help some people :
When connct to the sqlite database we can use
con = sqlite3.connect(":memory:",isolation_level=None) or con = sqlite3.connect(":memory:")
when use isolation_level=None, it will use autocommit mode that make too many transaction , and become too slow. this will help:
cur.execute("BEGIN TRANSACTION")
cur.executemany(....)
cur.execute("COMMIT")
And if use con = sqlite3.connect(":memory:"), cur.executemany(....) will be fast immediately.
Problem
There may be a confusion for mysqlclient-python/pymysql users who expect executemany of sqlite3/apsw to rewrite their INERT INTO table VALUES(?, ?) into a multi-row INSERT statement.
For instance, executemany of mysqlclient-python has this in its docstring:
This method improves performance on multiple-row INSERT and REPLACE. Otherwise it is equivalent to looping over args with execute().
Python stdlib's sqlite3.Cursor.executemany doesn't have this optimisation. It's always loop-equivalet. Here's how to demonstrate it (unless you want to read some C, _pysqlite_query_execute):
import sqlite3
conn = sqlite3.connect(':memory:', isolation_level=None)
conn.set_trace_callback(print)
conn.execute('CREATE TABLE tbl (x INTEGER, y INTEGER)')
conn.executemany('INSERT INTO tbl VALUES(?, ?)', [(i, i ** 2) for i in range(5)])
It prints:
CREATE TABLE tbl (x INTEGER, y INTEGER)
INSERT INTO tbl VALUES(0, 0)
INSERT INTO tbl VALUES(1, 1)
INSERT INTO tbl VALUES(2, 4)
INSERT INTO tbl VALUES(3, 9)
INSERT INTO tbl VALUES(4, 16)
Solution
Thus, you either need to rewrite these INSERTs into multi-row one (manually or, for instance, with python-sql) to stay in auto-commit mode (isolation_level=None), or wrap your INSERTs in a transaction (with sensible number of INSERTs in one) in default implicit-commit mode. The latter means the following for the above snippet:
import sqlite3
conn = sqlite3.connect(':memory:')
conn.set_trace_callback(print)
conn.execute('CREATE TABLE tbl (x INTEGER, y INTEGER)')
with conn:
conn.executemany('INSERT INTO tbl VALUES(?, ?)', [(i, i ** 2) for i in range(5)])
Now it prints:
CREATE TABLE tbl (x INTEGER, y INTEGER)
BEGIN
INSERT INTO tbl VALUES(0, 0)
INSERT INTO tbl VALUES(1, 1)
INSERT INTO tbl VALUES(2, 4)
INSERT INTO tbl VALUES(3, 9)
INSERT INTO tbl VALUES(4, 16)
COMMIT
For further bulk-insert performance improvement in SQLite, I'd suggest to start with this overview question.

Python mysql connector prepared INSERT statement cutting off text

I've been trying to insert a large string into an MySQL database using pythons mysql.connector. The problem I'm having is that long strings are getting cut off at some point when using prepared statements. I'm currently using MySQL Connector/Python that is available on MySQL.com. I used the following code do duplicate the problem I'm having.
db = mysql.connector.connect(**creditials)
cursor = db.cursor()
value = []
for x in range(0, 2000):
value.append(str(x+1))
value = " ".join(value)
cursor.execute("""
CREATE TABLE IF NOT EXISTS test (
pid VARCHAR(50),
name VARCHAR(120),
data LONGTEXT,
PRIMARY KEY(pid)
)
""")
db.commit()
#this works as expected
print("Test 1")
cursor.execute("REPLACE INTO test (pid, name, data) VALUES ('try 1', 'Description', '{0}')".format(value))
db.commit()
cursor.close()
#this does not work
print("Test 2")
cursor = db.cursor(prepared=True)
cursor.execute("""REPLACE INTO test (pid, name, data) VALUE (?, ?, ?)""", ('try 2', 'Description2', value))
db.commit()
cursor.close()
Test 1 works as expected and stores all the numbers up to 2000, but test 2 get cut off right after number 65. I would rather use prepared statements than trying to sanitize incoming strings myself. Any help appreciated.
Extra information:
Computer: Windows 7 64 bit
Python: Tried on both python 3.4 and 3.3
MYSQL: 5.6.17 (Came with WAMP)
Library: MySQL Connector/Python
When MySQL Connector driver processes prepared statements, it's using a lower-level binary protocol to communicate values to the server individually. As such, it's telling the server whether the values are INTs or VARCHARs or TEXT, etc. It's not particularly smart about it, and this "behavior" is the result. In this case, it sees that the value is a Python string value and tells MySQL that it's a VARCHAR value. The VARCHAR value has a string length limit that affects the amount of data be sent to the server. What's worse, the interaction between the long value and the limited data type length can yield some strange behavior.
Ultimately, you have a few options:
Use a file-link object for your string
MySQL Connector treats files and file-like objects as BLOBs and TEXTs (depending on whether the file is open in binary or non-binary mode, respectively). You can leverage this to get the behavior you desire.
import StringIO
...
cursor = db.cursor(prepared=True)
cursor.execute("""REPLACE INTO test (pid, name, data) VALUES (?, ?, ?)""",
('try 2', 'Description', StringIO.String(value)))
cursor.close()
db.commit()
Don't use MySQL Connector prepared statements
If you don't use the prepared=True clause to your cursor creation statement, it will generate full valid SQL statements for each execution. You're not really losing too much by avoiding MySQL prepared statements in this context. You do need to pass your SQL statements in a slightly different form to get proper placeholder sanitization behavior.
cursor = db.cursor()
cursor.execute("""REPLACE INTO test (pid, name, data) VALUES (%s, %s, %s)""",
('try 2', 'Description', value))
cursor.close()
db.commit()
Use another MySQL driver
There are a couple different Python MySQL drivers:
MySQLdb
oursql

Python MySQLdb doesn't wait for the result

I am trying to run some querys that needs to create some temporary tables and then returns a result set, but i am unable to do that with MySQLdb api.
I already dig something about this issue like here but without success.
My query is like this:
create temporary table tmp1
select * from table1;
alter tmp1 add index(somefield);
create temporary table tmp2
select * from table2;
select * from tmp1 inner join tmp2 using(somefield);
This returns immediatly an empty result set. If i go to the mysql client and do a show full processlist i can see my queries executing. They take some minutes to complete.
Why cursor returns immediatly and don't wait to query to run.
If i try to run another query i have a "Commands out of sync; you can't run this command now"
I already tried to put my connection with autocommit to True
db = MySQLdb.connect(host='ip',
user='root',
passwd='pass',
db='mydb',
use_unicode=True
)
db.autocommit(True)
Or put every statement in is own cursor.execute() and between them db.commit() but without success too.
Can you help me to figure what is the problem? I know mysql don't support transactions for some operations like alter table, but why the api don't wait until everything is finished like it does with a select?
By the way i'm trying to do this on a ipython notebook.
I suspect that you're passing your multi-statement SQL string directly to the cursor.execute function. The thing is, each of the statements is a query in its own right so it's unclear what the result set should contain.
Here's an example to show what I mean. The first case is passing a semicolon set of statements to execute which is what I presume you have currently.
def query_single_sql(cursor):
print 'query_single_sql'
sql = []
sql.append("""CREATE TEMPORARY TABLE tmp1 (id int)""")
sql.append("""INSERT INTO tmp1 VALUES (1)""")
sql.append("""SELECT * from tmp1""")
cursor.execute(';'.join(sql))
print list(cursor.fetchall())
Output:
query_single_sql
[]
You can see that nothing is returned, even though there is clearly data in the table and a SELECT is used.
The second case is where each statement is executed as an independent query, and the results printed for each query.
def query_separate_sql(cursor):
print 'query_separate_sql'
sql = []
sql.append("""CREATE TEMPORARY TABLE tmp3 (id int)""")
sql.append("""INSERT INTO tmp3 VALUES (1)""")
sql.append("""SELECT * from tmp3""")
for query in sql:
cursor.execute(query)
print list(cursor.fetchall())
Output:
query_separate_sql
[]
[]
[(1L,)]
As you can see, we consumed the results of the cursor for each query and the final query has the results we expect.
I suspect that even though you've issued multiple queries, the API only has a handle to the first query executed and so immediately returns when the CREATE TABLE is done. I'd suggest serializing your queries as described in the second example above.

MySQLdb is extremely slow with large result sets

I executed the following query both in phpMyAdmin & MySQLdb (python).
SELECT *, (SELECT CONCAT(`id`, '|', `name`, '|', `image_code`)
FROM `model_artist` WHERE `id` = `artist_id`) as artist_data,
FIND_IN_SET("metallica", `searchable_words`) as find_0
FROM `model_song` HAVING find_0
phpMyAdmin said the query took 2ms.
My python code said that using MySQLdb the query took 848ms (without even fetching the results).
The python code:
self.db = MySQLdb.connect(host="localhost", user="root", passwd="", db="ibeat")
self.cur = self.db.cursor()
millis = lambda: time.time() * 1000
start_time = millis()
self.cur.execute_cmd("""SELECT *, (SELECT CONCAT(`id`, '|', `name`, '|', `image_code`)
FROM `model_artist` WHERE `id` = `artist_id`) as artist_data,
FIND_IN_SET("metallica", `searchable_words`) as find_0
FROM `model_song` HAVING find_0""")
print millis() - start_time
If you expect an SQL query to have a large result set which you then plan to iterate over record-by-record, then you may want to consider using the MySQLdb SSCursor instead of the default cursor. The default cursor stores the result set in the client, whereas the SSCursor stores the result set in the server. Unlike the default cursor, the SSCursor will not incur a large initial delay if all you need to do is iterate over the records one-by-one.
You can find a bit of example code on how to use the SSCursor here.
For example, try:
import MySQLdb.cursors
self.db = MySQLdb.connect(host="localhost", user="root", passwd="", db="ibeat",
cursorclass = MySQLdb.cursors.SSCursor)
(The rest of the code can remain the same.)
PHPMyAdmin places a limit on all queries so you don't return large result sets in the interface. So if your query normally returns 1,000,000 rows, and PHPMyAdmin reduces that to 1,000 (or whatever the default is), then you would have to expect a lot longer processing time when Python grabs or even queries the entire result set.
Try placing a limit in Python that matches the limit on PHPMyAdmin to compare times.

Categories