APSW (or SQLite3) very slow INSERT on executemany - python

I have found the following issue with APSW (an SQLite parser for Python) when inserting lines.
Lets say my data is data = [[1,2],[3,4]]
APSW and SQLite3 allow me to do something like:
apsw.executemany("INSERT INTO Table VALUES(?,?)", b)
or I can write some code that does the following:
sql = "BEGIN TRANSACTION;
INSERT INTO Table Values('1','2');
INERT INTO Table Values('3','4');
COMMINT;"
apsw.execute(sql)
When data is a long list/array/table the performance of the first method is extremelly slow compared to the second one (for 400 rows it can be 20 sec vs less than 1!). I do not understand why this is as that is the method shown on all SQLite Python tutorials to add data into a table.
Any idea of what may be happening here?

(Disclosure: I am the author of APSW). If you do not explicitly have a transaction in effect, then SQLite automatically starts one at the beginning of each statement, and ends at the end of each statement. A write transaction is durable - meaning the contents must end up on storage and fsync called to ensure they will survive an unexpected power or system failure. Storage is slow!
I recommend using with rather than BEGIN/COMMIT in your case, because it will automatically rollback on error. That makes sure your data insertion either completely happens or not at all. See the documentation for an example.
When you are inserting a lot of data, you will find WAL mode to be more performant.

Thanks to Confuseh I got the following answer:
Executing:
apsw.execute("BEGIN TRANSACTION;")
apsw.executemany("INERT INTO Table VALUES(?,?)", b)
apsw.execute("COMMIT;")
Speeds up the process by A LOT! This seems to be the right way of adding data (vs using my method of creating multiple INSERT statments).

Thank you for this question, the answer help me when I use Sqlite with Python. finally, I get the following things, and wish it can help some people :
When connct to the sqlite database we can use
con = sqlite3.connect(":memory:",isolation_level=None) or con = sqlite3.connect(":memory:")
when use isolation_level=None, it will use autocommit mode that make too many transaction , and become too slow. this will help:
cur.execute("BEGIN TRANSACTION")
cur.executemany(....)
cur.execute("COMMIT")
And if use con = sqlite3.connect(":memory:"), cur.executemany(....) will be fast immediately.

Problem
There may be a confusion for mysqlclient-python/pymysql users who expect executemany of sqlite3/apsw to rewrite their INERT INTO table VALUES(?, ?) into a multi-row INSERT statement.
For instance, executemany of mysqlclient-python has this in its docstring:
This method improves performance on multiple-row INSERT and REPLACE. Otherwise it is equivalent to looping over args with execute().
Python stdlib's sqlite3.Cursor.executemany doesn't have this optimisation. It's always loop-equivalet. Here's how to demonstrate it (unless you want to read some C, _pysqlite_query_execute):
import sqlite3
conn = sqlite3.connect(':memory:', isolation_level=None)
conn.set_trace_callback(print)
conn.execute('CREATE TABLE tbl (x INTEGER, y INTEGER)')
conn.executemany('INSERT INTO tbl VALUES(?, ?)', [(i, i ** 2) for i in range(5)])
It prints:
CREATE TABLE tbl (x INTEGER, y INTEGER)
INSERT INTO tbl VALUES(0, 0)
INSERT INTO tbl VALUES(1, 1)
INSERT INTO tbl VALUES(2, 4)
INSERT INTO tbl VALUES(3, 9)
INSERT INTO tbl VALUES(4, 16)
Solution
Thus, you either need to rewrite these INSERTs into multi-row one (manually or, for instance, with python-sql) to stay in auto-commit mode (isolation_level=None), or wrap your INSERTs in a transaction (with sensible number of INSERTs in one) in default implicit-commit mode. The latter means the following for the above snippet:
import sqlite3
conn = sqlite3.connect(':memory:')
conn.set_trace_callback(print)
conn.execute('CREATE TABLE tbl (x INTEGER, y INTEGER)')
with conn:
conn.executemany('INSERT INTO tbl VALUES(?, ?)', [(i, i ** 2) for i in range(5)])
Now it prints:
CREATE TABLE tbl (x INTEGER, y INTEGER)
BEGIN
INSERT INTO tbl VALUES(0, 0)
INSERT INTO tbl VALUES(1, 1)
INSERT INTO tbl VALUES(2, 4)
INSERT INTO tbl VALUES(3, 9)
INSERT INTO tbl VALUES(4, 16)
COMMIT
For further bulk-insert performance improvement in SQLite, I'd suggest to start with this overview question.

Related

About python sqlite3 order by

Now, I have a study about python sqlite3 database. I think it is very simple problem but not allow next step. Could help me?
There is print OK on vscode terminal, but not revised to DB file. I'm searching several times but I can not fix it.
If I execute the code, it not sorting on DB files.
import sqlite3
conn = sqlite3.connect('sqliteDB1.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM member")
temp123 = cursor. fetchall()
print(temp123)
cursor.execute("SELECT * FROM member ORDER BY -code")
temp321 = cursor.fetchall()
conn.commit
print(temp321)
conn.close()
A select statement just returns data from a database, it will not modify it. Moreover, tables in SQL databases are inherently unordered sets. They have no intrinsic value, and you should never rely on the order of the rows that happens to be returned unless you explicitly sort it with an order by clause.

Inserting arrays into databases

I am trying to write a general function that will insert a line of data into a table in a database but I am trying to write an array of unknown length. I am aiming to just be able to call this function in any programand write a line of data of any length to the table (assuming the table and the array are the same length.
I have tried adding the array like it is a singular peice of data.
import sqlite3
def add2Db(dbName, tableName, data):
connection = sqlite3.connect(dbName)
cur = connection.cursor()
cur.execute("INSERT INTO "+ tableName +" VALUES (?)", (data))
connection.commit()
connection.close()
add2Db("items.db", "allItems", (1, "chair", 5, 4))
This just crashes and gives me an error saying it has 4 columns but only one value was supplied.
SQLite does not support arrays - you have to convert to a TEXT using ','.join() to join your array items into a single string and pass that.
Source: SQLite website
https://www.sqlite.org/datatype3.html
I'm not a Python programmer, but I've been doing SQL a long time. I even wrote my own ORM. My advice is do not write your own SQL query builder. There's a myriad of subtle issues and especially security issues. I elaborate on a few of them below.
Instead, use a well-established SQL Query Builder or ORM. They've already dealt with these issues. Here's an example using SQLAlchemy.
from datetime import date
from sqlalchemy import create_engine, MetaData
# Connect to the database with debugging on.
engine = create_engine('sqlite:///test.sqlite', echo=True)
conn = engine.connect()
# Read the schemas from the database
meta = MetaData()
meta.reflect(bind=engine)
# INSERT INTO users (name, birthday, state, country) VALUES (?, ?, ?, ?)
users = meta.tables['users']
conn.execute(
users.insert().values(name="Yarrow Hock", birthday=date(1977, 1, 23), state="NY", country="US")
)
SQLAlchemy can do the entire range of SQL operations and will work with different SQL variants. You also get type safety.
conn.execute(
users.insert().values(name="Yarrow Hock", birthday="in the past", state="NY", country="US")
)
sqlalchemy.exc.StatementError: (exceptions.TypeError) SQLite Date type only accepts Python date objects as input. [SQL: u'INSERT INTO users (name, birthday, state, country) VALUES (?, ?, ?, ?)']
insert into table values (...) relies on column definition order
This relies on the order columns were defined in the schema. This leaves two problems. First is a readability problem.
add2Db(db, 'some_table', (1, 39, 99, 45, 'papa foxtrot', 0, 42, 0, 6)
What does any of that mean? A reader can't tell. They have to go digging into the schema and count columns to figure out what each value means.
Second is a maintenance problem. If, for any reason, the schema is altered and the column order is not exactly the same, this can lead to some extremely difficult to find bugs. For example...
create table users ( name text, birthday date, state text, country text );
vs
create table users ( name text, birthday date, country text, state text );
add2Db(db, 'users', ('Yarrow Hock', date(1977, 1, 23), 'NY', 'US'));
That insert will silently "work" with either column order.
You can fix this by passing in a dictionary and using the keys for column names.
add2Db(db, 'users', (name="Yarrow Hock", birthday=date(1977, 1, 23), state="NY", country="US"));
Then we'd produce a query like:
insert into users
(name, birthday, state, country)
values (?, ?, ?, ?)
This leads to the next and much bigger problem.
SQL Injection Attack
Now this opens up a new problem. If we simply stick the table and column names into the query that leaves us open to one of the most common security holes, a SQL Injection Attack. That's where someone can craft a value which when naively used in a SQL statement causes the query to do something else. Like Little Bobby Tables.
While the ? protects against SQL Injection for values, it's still possible to inject via the column names. There's no guarantee the column names can be trusted. Maybe they came from the parameters of a web form?
Protecting table and column names is complicated and easy to get wrong.
The more SQL you write the more likely you're vulnerable to an injection attack.
You have to write code for everything else.
Ok, you've done insert. Now update? select? Don't forget about subqueries, group by, unions, joins...
If you want to write a SQL query builder, cool! If, instead, you have a job to do using SQL, writing yet another SQL query builder is not your job.
It's harder for anyone else to understand.
There's a good chance that any given Python programmer knows how SQLAlchemy works, and there's plenty of tutorials and documentation if they don't. There's no chance they know about your home-rolled SQL functions, and you have to write all the tutorials and docs.
You shouldn't try to write your own ORMs without an argumented need. You will have a lot of problems, for example here's quick 25 reasons not to.
Instead use any popular orm that is proven. I recommend using SQLAlchemy as a go to outside of Django. Using it you can map a dict of values to insert it into a model just like insert(schema_name).values(**dict_name) (here's an example of insert/update).
Change your function to this:
def add2Db(dbName, tableName, data):
num_qs = len(data)
qm = ','.join(list('?' * num_qs))
query = """
INSERT INTO {table}
VALUES ({qms})
""".format(table=tableName,
qms=qm)
connection = sqlite3.connect(dbName)
cur = connection.cursor()
cur.execute(query, data)
connection.commit()
connection.close()

Why is 'executemany' so slow compared to just doing an 'IN' query?

My MySQL table schema is:
CREATE DATABASE test_db;
USE test_db;
CREATE TABLE test_table (
id INT AUTO_INCREMENT,
last_modified DATETIME NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
When I run the following benchmark script, I get:
b1: 20.5559301376
b2: 0.504406929016
from timeit import timeit
import MySQLdb
ids = range(1000)
query_1 = "update test_table set last_modified=UTC_TIMESTAMP() where id=%(id)s"
query_2 = "update test_table set last_modified=UTC_TIMESTAMP() where id in (%s)" % ", ".join(('%s', ) * len(ids))
db = MySQLdb.connect(host="localhost", user="some_user", passwd="some_pwd", db="test_db")
def b1():
curs = db.cursor()
curs.executemany(query_1, ids)
db.close()
def b2():
curs = db.cursor()
curs.execute(query_2, ids)
db.close()
print "b1: %s" % str(timeit(lambda:b1(), number=30))
print "b2: %s" % str(timeit(lambda:b2(), number=30))
Why is there such a large difference between executemany and the IN clause?
I'm using Python 2.6.6 and MySQL-python 1.2.3.
The only relevant question I could find was - Why is executemany slow in Python MySQLdb?, but it isn't really what I'm after.
executemany repeatedly goes back and forth to the MySQL server, which then needs to parse the query, perform it, and return results. This is perhaps 10 times as slow as doing everything in a single SQL statement, even if it is more complex.
However, for INSERT, this says that it will do the smart thing and construct a multi-row INSERT for you, thereby being efficient.
Hence, IN(1,2,3,...) is much more efficient than UPDATE;UPDATE;UPDATE...
If you have a sequence of ids, then even better would be to say WHERE id BETWEEN 1 and 1000. This is because it can simply scan the rows rather than looking up each one from scratch. (I am assuming id is indexed, probably as the PRIMARY KEY.)
Also, you are probably running with the settings that make each insert/update/delete into its own "transaction". This adds a lot of overhead to each UPDATE. And it is probably not desirable in this case. I suspect you want the entire 1000-row update to be atomic.
Bottom line: Use executemany only for (a) INSERTs or (b) statements that must be run individually.

jaydebeapi set autocommit off for bulk inserts

I have many rows to insert into a table and tried doing row by row but it is taking a really long time. i read this link Python+MySQL - Bulk Insert and seems like setting autocommit to be off can speed things up.
import jadebeapi
connection = jaydebeapi.connect('com.teradata.jdbc.TeraDriver', ['jdbc:teradata://some url',USER,PASS], ['tdgssconfig.jar','terajdbc4.jar'],)
cur = connection.cursor()
connection.jconn.setAutoCommit(False)
cur.execute('select * from my_table')
connection.commit()
Other queries i perform are:
l = [(1,2,3),(2,4,6).....]
for tup in l:
cur.execute('my insert statement')
#this is the really slow part.
When i have the connection.jconn.setAutoCommit(False) i always get this error:
[Teradata Database] [TeraJDBC 15.10.00.14] [Error 3932] [SQLState 25000] Only an ET or null statement is legal after a DDL Statement.
When that line and connection.commit() is commented out, the code works fine. What is the right syntax to set autocommit false?
If speed/efficiency is a concern, you should be using prepared statements and passing your parameters in as the second argument.
You could then also use .executemany():
l = [(1,2,3),(2,4,6).....]
cur.executemany('my insert statement with 3 ? params', l)
#this should be much faster

Can I set user-defined variable in Python MySQLdb?

So My problem is this, I have a query that uses Mysql User-defined variable like:
#x:=0 SELECT #X:=#X+1 from some_table and this code returns a column from 1-1000.
However, this query doesn't work if I sent it through mySQLdb in Python.
connection =MySQLdb.Connect(host='xxx',user='xxx',passwd='xxx',db = 'xxx')
cursor = connection.cursor
cursor.execute("""SET #X:=0;SELECT #X:=#X+1 FROM some_table""")
rows = cursor.fetchall()
print rows
It prints a empty tuple.
How can I solve this?
Thanks
Try to execute one query at a time:
cursor.execute("SET #X:=0;");
cursor.execute("SELECT #X:=#X+1 FROM some_table");
Try it as two queries.
If you want it to be one query, the examples in the comments to the MySQL User Variables documentation look like this:
SELECT #rownum:=#rownum+1 rownum, t.* FROM (SELECT #rownum:=1) r, mytable t;
or
SELECT if(#a, #a:=#a+1, #a:=1) as rownum
See http://dev.mysql.com/doc/refman/5.1/en/user-variables.html

Categories