Best way to insert ~20 million rows using Python/MySQL

Best way to insert ~20 million rows using Python/MySQL - python

I need to store a defaultdict object containing ~20M objects into a database. The dictionary maps a string to a string, so the table has two columns, no primary key because it's constructed later.
Things I've tried:
executemany, passing in the set of keys and values in the dictionary. Works well when number of values < ~1M.
Executing single statements. Works, but slow.
Using transactions
con = sqlutils.getconnection()
cur = con.cursor()
print len(self.table)
cur.execute("SET FOREIGN_KEY_CHECKS = 0;")
cur.execute("SET UNIQUE_CHECKS = 0;")
cur.execute("SET AUTOCOMMIT = 0;")
i = 0
for k in self.table:
cur.execute("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s);", (k, str(self.hashtable[k])))
i += 1
if i % 10000 == 0:
print i
#cur.executemany("INSERT INTO " + sqlutils.gettablename(self.sequence) + " (key, matches) values (%s, %s)", [(k, str(self.table[k])) for k in self.table])
cur.execute("SET UNIQUE_CHECKS = 1;")
cur.execute("SET FOREIGN_KEY_CHECKS = 1;")
cur.execute("COMMIT")
con.commit()
cur.close()
con.close()
print "Finished", self.sequence, "in %.3f sec" % (time.time() - t)
This is a recent conversion from SQLite to MySQL. Oddly enough, I'm getting much better performance when I use SQLite (30s to insert 3M rows in SQLite, 480s in MySQL). Unfortunately, MySQL is a necessity because the project will be scaled up in the future.
-
Edit
Using LOAD DATA INFILE works like a charm. Thanks to all who helped! Inserting 3.2M rows takes me ~25s.

MySQL can inserts multiple values with one query: INSERT INTO table (key1, key2) VALUES ("value_key1", "value_key2"), ("another_value_key1", "another_value_key2"), ("and_again", "and_again...");
Also, you could try to write your datas inside a file and use LOAD DATA from Mysql that is designed to insert with "very hight speed" (dixit Mysql).
I dunno if "file writing" + "MySQL Load Data" will be faster than Insert multiple values in one query (or many queries if MySQL has a limit for it)
It depends on your hardware (write a file is "fast" with a SSD), on your file system configuration, on MySQL configuration etc. So, you have to test on your "prod" env to see what solution is the fastest for you.

Insert of directly inserting, generate a sql file (using extended inserts etc) then fetch this to MySQL, this will save you quite a lot of overhead.
NB : you'll still save some execution time if you avoid recomputing constant values in your loop, ie:
for k in self.table:
xxx = sqlutils.gettablename(self.sequence)
do_something_with(xxx, k)
=>
xxx = sqlutils.gettablename(self.sequence)
for k in self.table:
do_something_with(xxx, k)

Related

Efficient string Match with SQL and Python

I want to know what is the best approach for a String Match with Python and a PSQL database. My db contains pubs names and zip codes. I want check if there are observations refering to the same pub but spelled differently by mistake.
Conceptually, I was thinking of looping through all the names and, for each other row in the same zip code, obtain a string similarity metric using strsim. If this metric is above a threshold, I insert it into another SQL table which stores the match candidates.
I think I am being inefficient. In "pseudo-code", having pub_table, candidates_table and using the JaroWinkler function, I mean to do something like:
from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()
cursor = conn.cursor()
cur.execute("SELECT name, zip from pub_table")
rows = cur.fetchall()
for r in rows:
cur.execute("SELECT name FROM pub_tables WHERE zip = %s", (r[1],))
search = cur.fetchall()
for pub in search:
if jarowinkler.similarity(r[0], pub[0]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
cur.execute(insertion, (r[0], pub[0], zip))
cursor.close ()
conn.commit ()
conn.close ()
I am sorry if am not being clear (novice here). Any guidance for string matching using PSQL and Python will be highly appreciated. Thank you.

Both the SELECT queries are on the same pub_tables table. And the inner loop with the second query on zip-match repeats for every row of pub_tables. You could directly get the zip equality comparison in one query by doing an INNER JOIN of pub_tables with itself.
SELECT p1.name, p2.name, p1.zip
FROM pub_table p1,
pub_table p2
WHERE p1.zip = p2.zip
AND p1.name != p2.name -- this line assumes your original pub_table
-- has unique names for each "same pub in same zip"
-- and prevents the entries from matching with themselves.
That would reduce your code to just the outer query & inner check + insert, without needing a second query:
cur.execute("<my query as above>")
rows = cur.fetchall()
for r in rows:
# r[0] and r[1] are the names. r[2] is the zip code
if jarowinkler.similarity(r[0], r[1]) > threshold:
insertion = ("INSERT INTO candidates_table (name1, name2, zip)
VALUES (%s, %s, %s)")
# since r already a tuple with the columns in the right order,
# you can replace the `(r[0], r[1], r[2])` below with just `r`
cur.execute(insertion, (r[0], r[1], r[2]))
# or ...
cur.execute(insertion, r)
Another change: The insertion string is always the same, so you can move that to before the for loop and only keep the parameterised cur.execute(insertion, r) inside the loop. Otherwise, you're just redefining the same static string over and over.

Python code optimisation for an sql insert

I have the following code and I'm running it on some big data (2 hours processing time), I'm looking into CUDA for GPU acceleration, but in the mean time can anyone suggest ways to optimise the following code?
I is taking a 3D point from dataset 'T' and finding the point with the minimum distance to another point dataset 'B'
Is there any time saved by sending the result to a list first then inserting to the database table?
All suggestions welcome
conn = psycopg2.connect("<details>")
cur = conn.cursor()
for i in range(len(B)):
i2 = i + 1
# point=T[i]
point = B[i:i2]
# print(B[i])
# print(B[i:i2])
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
print("Base: ", end='')
print(i, end='')
print(" of ", end='')
print(len(B), end='')
print(" ", end='')
print(disti)
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""",
(xi[i], yi[i], zi[i], disti))
conn.commit()
cur.close()
############## EDIT #############
Code update:
conn = psycopg2.connect("dbname=kap_pointcloud host=localhost user=postgres password=Gnob2009")
cur = conn.cursor()
disti = []
for i in range(len(T)):
i2 = i + 1
point = T[i:i2]
disti.append(scipy.spatial.distance.cdist(point, B, metric='euclidean').min())
print("Top: " + str(i) + " of " + str(len(T)))
Insert code to go here once I figure out the syntax
######## EDIT ########
The solution with a lot of help from Alex
cur = conn.cursor()
# list for accumulating insert-params
from scipy.spatial.distance import cdist
insert_params = []
for i in range(len(T)):
XA = [B[i]]
disti = cdist(XA, XB, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
print("Top: " + str(i) + " of " + str(len(T)))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_top_tmp (x,y,z,dist) values (%s, %s, %s, %s)",
insert_params)
conn.commit()
For timing comparison the:
inital code took: 0:00:50.225644
Without multiline prints: 0:00:47.934012
taking commit out of the loop: 0:00:25.411207
I'm assuming the only way to make it faster is to get CUDA working?

There are 2 solutions
1) Try to do the single commit or commit in chunks if len(B) is very large.
2) you can prepare a list of data that you are inserting and do the bulk insert.
eg:
insert into pc_processing.pc_dist_base_tmp (x, y, z, dist) select * from unnest(array[1, 2, 3, 4], array[1, 2, 3, 4]);

OK. Let's accumulate all suggestions from comments.
Suggesion 1. commit as rare as possible, don't print at all
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[]
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
cur.execute("""INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)""", (xi[i], yi[i], zi[i], disti))
conn.commit() # Note that you commit only once. Be careful with **realy** big chunks of data
cur.close()
If you really need debug information inside your loops - use logging.
You will be able to turn on/off logging info when you need.
Suggestion 2. executemany for rescue
conn = psycopg2.connect("<details>")
cur = conn.cursor()
insert_params=[] # list for accumulating insert-params
for i in range(len(B)):
i2 = i + 1
point = B[i:i2]
disti = scipy.spatial.distance.cdist(point, T, metric='euclidean').min()
insert_params.append((xi[i], yi[i], zi[i], disti))
# Only one instruction to insert everything
cur.executemany("INSERT INTO pc_processing.pc_dist_base_tmp (x,y,z,dist) values (%s, %s, %s, %s)", insert_params)
conn.commit()
cur.close()
Suggestion 3. Don't use psycopg2 at all. Use BULK operations
Instead of cur.execute, conn.commit write csv-file.
And then use COPY from created file.
BULK solution must provide ultimate performance but needs an effort to make it work.
Choose yourself what is appropriate for you - how much speed do you need.
Good luck

Try committing when the loop is finished instead of every single iteration

Performance SQLAlchemy and or

I use the following sqlalchemy code to retrieve some data from a database
q = session.query(hd_tbl).\
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']).\
filter(or_(and_(hd_tbl.c['object_id'] == get_id(row['object']),
hd_tbl.c['data_type'] == get_id(row['type']),
hd_tbl.c['data_provider'] == get_id(row['provider']),
hd_tbl.c['data_account'] == get_id(row['account']))
for index, row in data.iterrows())).\
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
where hd_tbland dt_tbl are two tables in sql db, and datais pandas dataframe containing typically around 1k-9k entries. hd_tbl contains at the moment around 90k rows.
The execution time seems to exponentially grow with the length of data. The corresponding sql statement (by sqlalchemy) looks as follows:
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
...
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
The tables and columns are fully indexed, and performance is not satisfying. Currently it is way faster to read all the data of hd_tbl and dt_tbl into memory and merge with pandas merge function. However, this is seems to be suboptimal. Anyone having an idea on how to improve the sqlalchemy call?
EDIT:
I was able to improve performance signifcantly by using sqlalchemy tuple_ in the following way:
header_tuples = [tuple([int(y) for y in tuple(x)]) for x in
data_as_int.values]
q = session.query(hd_tbl). \
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']). \
filter(tuple_(hd_tbl.c['object_id'], hd_tbl.c['data_type'],
hd_tbl.c['data_provider'],
hd_tbl.c['data_account']).in_(header_tuples)). \
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
with corresponding query...
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE (data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account) IN ((%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s), (%(param_5)s, ...))

I'd recommend you create a composite index on fields object_id, data_type, data_provider, ... with the same order, which they are placed in table, and make sure they're following in the same order in your WHERE condition. It may speed-up a bit your requests by cost of the disk space.
Also you may use several consequent small SQL requests instead a large query with complex OR condition. Accumulate extracted data on the application side or, if amount is large enough, in a fast temporary storage (a temporary table, noSQL, etc.)
In addition you may check MySQL configuration and increase values, related to memory volume per a thread, request, etc. A good idea is to check is your composite index fits into available memory, or it is useless.
I guess DB tuning may help a lot to increase productivity. Otherwise you may analyze your application's architecture to get more significant results.

Insert list or tuple into table without iteration into postgresql

I am new to python. What I am trying to achieve is to insert values from my list/tuple into my redshift table without iteration.I have around 1 million rows and 1 column. Below is the code I am using to create my list/tuple.
cursor1.execute("select domain from url limit 5;")
for record, in cursor1:
ext = tldextract.extract(record)
mylist.append(ext.domain + '.' + ext.suffix)
mytuple = tuple(mylist)
I am not sure what is best to use, tuple or list. output of print(mylist) and print(mytuple) are as follows.
List output
['friv.com', 'steep.tv', 'wordpress.com', 'fineartblogger.net',
'v56.org'] Tuple Output('friv.com', 'steep.tv', 'wordpress.com',
'fineartblogger.net', 'v56.org')
Now, below is the code I am using to insert the values into my redshift table but I am getting an error:
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mylist) or
cursor2.execute("INSERT INTO sample(domain) VALUES (%s)", mytuple)
Error - not all arguments converted during string formatting
Any help is appreciated. If any other detail is required please let me know, I will edit my question.
UPDATE 1:
Tried using below code and getting different error.
args_str = ','.join(cur.mogrify("(%s)", x) for x in mylist)
cur.execute("INSERT INTO table VALUES " + args_str)
ERROR - INSERT has more expressions than target columns

I think you're looking for Fast Execution helpers:
mylist=[('t1',), ('t2',)]
execute_values(cursor2, "INSERT INTO sample(domain) %s", mylist, page_size=100)
what this does is it replaces the %s with 100 VALUES. I'm not sure how high you can set page_size, but that should be far more performant.

Finally found a solution. For some reason cur.mogrify was not giving me proper sql string for insert. Created my own SQl string and it works alot faster than cur.executeall()
list_size = len(mylist)
for len in range(0,list_size):
if ( len != list_size-1 ):
sql = sql + ' ('+ "'"+ mylist[len] + "'"+ ') ,'
else:
sql = sql + '('+ "'"+ mylist[len] + "'"+ ')'
cursor1.execute("INSERT into sample(domain) values " + sql)
Thanks for your help guys!

Bulk insert with SQLAlchemy ORM

Is there any way to get SQLAlchemy to do a bulk insert rather than inserting each individual object. i.e.,
doing:
INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)
rather than:
INSERT INTO `foo` (`bar`) VALUES (1)
INSERT INTO `foo` (`bar`) VALUES (2)
INSERT INTO `foo` (`bar`) VALUES (3)
I've just converted some code to use sqlalchemy rather than raw sql and although it is now much nicer to work with it seems to be slower now (up to a factor of 10), I'm wondering if this is the reason.
May be I could improve the situation using sessions more efficiently. At the moment I have autoCommit=False and do a session.commit() after I've added some stuff. Although this seems to cause the data to go stale if the DB is changed elsewhere, like even if I do a new query I still get old results back?
Thanks for your help!

SQLAlchemy introduced that in version 1.0.0:
Bulk operations - SQLAlchemy docs
With these operations, you can now do bulk inserts or updates!
For instance, you can do:
s = Session()
objects = [
User(name="u1"),
User(name="u2"),
User(name="u3")
]
s.bulk_save_objects(objects)
s.commit()
Here, a bulk insert will be made.

The sqlalchemy docs have a writeup on the performance of various techniques that can be used for bulk inserts:
ORMs are basically not intended for high-performance bulk inserts -
this is the whole reason SQLAlchemy offers the Core in addition to the
ORM as a first-class component.
For the use case of fast bulk inserts, the SQL generation and
execution system that the ORM builds on top of is part of the Core.
Using this system directly, we can produce an INSERT that is
competitive with using the raw database API directly.
Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of
methods, which provide hooks into subsections of the unit of work
process in order to emit Core-level INSERT and UPDATE constructs with
a small degree of ORM-based automation.
The example below illustrates time-based tests for several different
methods of inserting rows, going from the most automated to the least.
With cPython 2.7, runtimes observed:
classics-MacBook-Pro:sqlalchemy classic$ python test.py
SQLAlchemy ORM: Total time for 100000 records 12.0471920967 secs
SQLAlchemy ORM pk given: Total time for 100000 records 7.06283402443 secs
SQLAlchemy ORM bulk_save_objects(): Total time for 100000 records 0.856323003769 secs
SQLAlchemy Core: Total time for 100000 records 0.485800027847 secs
sqlite3: Total time for 100000 records 0.487842082977 sec
Script:
import time
import sqlite3
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import scoped_session, sessionmaker
Base = declarative_base()
DBSession = scoped_session(sessionmaker())
engine = None
class Customer(Base):
__tablename__ = "customer"
id = Column(Integer, primary_key=True)
name = Column(String(255))
def init_sqlalchemy(dbname='sqlite:///sqlalchemy.db'):
global engine
engine = create_engine(dbname, echo=False)
DBSession.remove()
DBSession.configure(bind=engine, autoflush=False, expire_on_commit=False)
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
def test_sqlalchemy_orm(n=100000):
init_sqlalchemy()
t0 = time.time()
for i in xrange(n):
customer = Customer()
customer.name = 'NAME ' + str(i)
DBSession.add(customer)
if i % 1000 == 0:
DBSession.flush()
DBSession.commit()
print(
"SQLAlchemy ORM: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
def test_sqlalchemy_orm_pk_given(n=100000):
init_sqlalchemy()
t0 = time.time()
for i in xrange(n):
customer = Customer(id=i+1, name="NAME " + str(i))
DBSession.add(customer)
if i % 1000 == 0:
DBSession.flush()
DBSession.commit()
print(
"SQLAlchemy ORM pk given: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
def test_sqlalchemy_orm_bulk_insert(n=100000):
init_sqlalchemy()
t0 = time.time()
n1 = n
while n1 > 0:
n1 = n1 - 10000
DBSession.bulk_insert_mappings(
Customer,
[
dict(name="NAME " + str(i))
for i in xrange(min(10000, n1))
]
)
DBSession.commit()
print(
"SQLAlchemy ORM bulk_save_objects(): Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
def test_sqlalchemy_core(n=100000):
init_sqlalchemy()
t0 = time.time()
engine.execute(
Customer.__table__.insert(),
[{"name": 'NAME ' + str(i)} for i in xrange(n)]
)
print(
"SQLAlchemy Core: Total time for " + str(n) +
" records " + str(time.time() - t0) + " secs")
def init_sqlite3(dbname):
conn = sqlite3.connect(dbname)
c = conn.cursor()
c.execute("DROP TABLE IF EXISTS customer")
c.execute(
"CREATE TABLE customer (id INTEGER NOT NULL, "
"name VARCHAR(255), PRIMARY KEY(id))")
conn.commit()
return conn
def test_sqlite3(n=100000, dbname='sqlite3.db'):
conn = init_sqlite3(dbname)
c = conn.cursor()
t0 = time.time()
for i in xrange(n):
row = ('NAME ' + str(i),)
c.execute("INSERT INTO customer (name) VALUES (?)", row)
conn.commit()
print(
"sqlite3: Total time for " + str(n) +
" records " + str(time.time() - t0) + " sec")
if __name__ == '__main__':
test_sqlalchemy_orm(100000)
test_sqlalchemy_orm_pk_given(100000)
test_sqlalchemy_orm_bulk_insert(100000)
test_sqlalchemy_core(100000)
test_sqlite3(100000)

As far as I know, there is no way to get the ORM to issue bulk inserts. I believe the underlying reason is that SQLAlchemy needs to keep track of each object's identity (i.e., new primary keys), and bulk inserts interfere with that. For example, assuming your foo table contains an id column and is mapped to a Foo class:
x = Foo(bar=1)
print x.id
# None
session.add(x)
session.flush()
# BEGIN
# INSERT INTO foo (bar) VALUES(1)
# COMMIT
print x.id
# 1
Since SQLAlchemy picked up the value for x.id without issuing another query, we can infer that it got the value directly from the INSERT statement. If you don't need subsequent access to the created objects via the same instances, you can skip the ORM layer for your insert:
Foo.__table__.insert().execute([{'bar': 1}, {'bar': 2}, {'bar': 3}])
# INSERT INTO foo (bar) VALUES ((1,), (2,), (3,))
SQLAlchemy can't match these new rows with any existing objects, so you'll have to query them anew for any subsequent operations.
As far as stale data is concerned, it's helpful to remember that the session has no built-in way to know when the database is changed outside of the session. In order to access externally modified data through existing instances, the instances must be marked as expired. This happens by default on session.commit(), but can be done manually by calling session.expire_all() or session.expire(instance). An example (SQL omitted):
x = Foo(bar=1)
session.add(x)
session.commit()
print x.bar
# 1
foo.update().execute(bar=42)
print x.bar
# 1
session.expire(x)
print x.bar
# 42
session.commit() expires x, so the first print statement implicitly opens a new transaction and re-queries x's attributes. If you comment out the first print statement, you'll notice that the second one now picks up the correct value, because the new query isn't emitted until after the update.
This makes sense from the point of view of transactional isolation - you should only pick up external modifications between transactions. If this is causing you trouble, I'd suggest clarifying or re-thinking your application's transaction boundaries instead of immediately reaching for session.expire_all().

I usually do it using add_all.
from app import session
from models import User
objects = [User(name="u1"), User(name="u2"), User(name="u3")]
session.add_all(objects)
session.commit()

Direct support was added to SQLAlchemy as of version 0.8
As per the docs, connection.execute(table.insert().values(data)) should do the trick. (Note that this is not the same as connection.execute(table.insert(), data) which results in many individual row inserts via a call to executemany). On anything but a local connection the difference in performance can be enormous.

SQLAlchemy introduced that in version 1.0.0:
Bulk operations - SQLAlchemy docs
With these operations, you can now do bulk inserts or updates!
For instance (if you want the lowest overhead for simple table INSERTs), you can use Session.bulk_insert_mappings():
loadme = [(1, 'a'),
(2, 'b'),
(3, 'c')]
dicts = [dict(bar=t[0], fly=t[1]) for t in loadme]
s = Session()
s.bulk_insert_mappings(Foo, dicts)
s.commit()
Or, if you want, skip the loadme tuples and write the dictionaries directly into dicts (but I find it easier to leave all the wordiness out of the data and load up a list of dictionaries in a loop).

Piere's answer is correct but one issue is that bulk_save_objects by default does not return the primary keys of the objects, if that is of concern to you. Set return_defaults to True to get this behavior.
The documentation is here.
foos = [Foo(bar='a',), Foo(bar='b'), Foo(bar='c')]
session.bulk_save_objects(foos, return_defaults=True)
for foo in foos:
assert foo.id is not None
session.commit()

This is a way:
values = [1, 2, 3]
Foo.__table__.insert().execute([{'bar': x} for x in values])
This will insert like this:
INSERT INTO `foo` (`bar`) VALUES (1), (2), (3)
Reference: The SQLAlchemy FAQ includes benchmarks for various commit methods.

All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.
In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:
The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().
In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).
The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

The best answer I found so far was in sqlalchemy documentation:
http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow
There is a complete example of a benchmark of possible solutions.
As shown in the documentation:
bulk_save_objects is not the best solution but it performance are correct.
The second best implementation in terms of readability I think was with the SQLAlchemy Core:
def test_sqlalchemy_core(n=100000):
init_sqlalchemy()
t0 = time.time()
engine.execute(
Customer.__table__.insert(),
[{"name": 'NAME ' + str(i)} for i in xrange(n)]
)
The context of this function is given in the documentation article.

Sqlalchemy supports bulk insert
bulk_list = [
Foo(
bar=1,
),
Foo(
bar=2,
),
Foo(
bar=3,
),
]
db.session.bulk_save_objects(bulk_list)
db.session.commit()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to insert ~20 million rows using Python/MySQL - python

Related

Efficient string Match with SQL and Python

Python code optimisation for an sql insert

Performance SQLAlchemy and or

Insert list or tuple into table without iteration into postgresql

Bulk insert with SQLAlchemy ORM

Categories

Resources