I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?
Here is my sample code:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
conn.commit()
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()
conn.close()
Update:
Figured I would give an update with some code in case anyone runs across this problem.
I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.
After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:
self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF')
In addition, I commit after processing 1000 lines from my text file.
if lineNum % 1000 == 0:
self.conn.commit()
With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.
Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.
If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:') and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump
import sqlite3, os
in_memory = sqlite3.connect(':memory:')
# do stuff
con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
con.execute(line)
Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.
In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT statement, the database starts a transaction, executes the statement and commits. So your commit is - in this case - meaningless. See sqlite3 on python docs.
But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.
Related
We are dealing with some performance issues upon inserting the DISTINCT command in our SQL-queries.
The problem occurs only in the following scenario: 100000 entries (or more) with only ~1% (or less) of distinct values in them.
We boiled down the issue to the following minimal python example (but it's not related to python, mysql workbench behaves the same):
import mysql.connector
import time
import numpy as np
conn = mysql.connector.connect(user='user', password='password', host='server',
database='database', raise_on_warnings=True, autocommit=False)
cursor = conn.cursor()
#define amount of entries
max_exponent = 4.7
n_entry = 10**max_exponent
# fill table with 10, 100, ... distinct entries
for n_distinct in np.logspace(1, max_exponent, num=int(max_exponent)):
# Dropping BENCHMARK table if already exists and create new one
cursor.execute("DROP TABLE IF EXISTS BENCHMARK")
cursor.execute('CREATE TABLE BENCHMARK(ID INT)')
# create distinct number set and insert random permutation of it into table
distinct_numbers = range(int(n_distinct))
random_numbers = np.random.randint(len(distinct_numbers), size=int(n_entry))
value_string = ','.join([f"({i_name})" for i_name in random_numbers])
mySql_insert_query = f"INSERT INTO BENCHMARK (ID) VALUES {value_string}"
print(f'filling table with {n_entry:.0f} random values of {n_distinct:.0f} distinct numbers')
cursor.execute(mySql_insert_query)
conn.commit()
# benchmark distinct call
start = time.time()
sql_query = 'SELECT DISTINCT ID from BENCHMARK'
cursor.execute(sql_query)
result = cursor.fetchall()
print(f'Time to read {len(result)} distinct values: {time.time()-start:.2f}')
conn.close()
The extracted benchmark times show a counter-intuitive behaviour, where time suddenly increases for fewer distinct values in the table:
If we make the query without using DISTINCT the times drop to 170ms, independent from amount of distinct entries.
We cannot make any sense of this dependence (except for some "hardware limitation", but 100000 entries should be ... negligible?), so we ask you for insight what the root cause of this behaviour might be.
The machine we are using for the database has the following specs:
CPU: Intel i5 # 3.3GHz (CPU Load goes to 30% during execution)
Ram: 8 GB (mysqld takes about 2.4GB, does not rise during query execution, InnoDB Buffer usage stays at 42%, buffer_size = 4GB)
HDD: 500GB, ~90% empty
OS, Mysql: Windows 10, Mysql Server 8.0.18
Thanks for reading!
Having versus not having an index on id is likely to make a huge difference.
At some point, MySQL shifts gears -- There are multiple ways to do a GROUP BY or DISTINCT query:
Have a hash in memory and count how many of each.
Write to a temp table, sort it, then go through it counting how many distinct values
If there is a usable index, then skip from one value to the next.
The Optimizer cannot necessarily predict the best way for a given situation, so there could be times when it fails to pick the optimal situation. There is probably no way in the old 5.5 version (almost a decade old) to get insight into what the Optimizer chose to do. Newer versions have EXPLAIN FORMAT=JSON and "Optimizer Trace".
Another issue is I/O. Reading data from disk can slow down a query ten-fold. However, this does not seem to be an issue since the table is rather small. And you seem to run the query immediately after building the table; that is, the table is probably entirely cached in RAM (the buffer_pool).
I hope this adds some specifics to the Comments that say that benchmarking is difficult.
I read this: Importing a CSV file into a sqlite3 database table using Python
and it seems that everyone suggests using line-by-line reading instead of using bulk .import from SQLite. However, that will make the insertion really slow if you have millions of rows of data. Is there any other way to circumvent this?
Update: I tried the following code to insert line by line but the speed is not as good as I expected. Is there anyway to improve it
for logFileName in allLogFilesName:
logFile = codecs.open(logFileName, 'rb', encoding='utf-8')
for logLine in logFile:
logLineAsList = logLine.split('\t')
output.execute('''INSERT INTO log VALUES(?, ?, ?, ?)''', logLineAsList)
logFile.close()
connection.commit()
connection.close()
Since this is the top result on a Google search I thought it might be nice to update this question.
From the python sqlite docs you can use
import sqlite3
persons = [
("Hugo", "Boss"),
("Calvin", "Klein")
]
con = sqlite3.connect(":memory:")
# Create the table
con.execute("create table person(firstname, lastname)")
# Fill the table
con.executemany("insert into person(firstname, lastname) values (?,?)", persons)
I have used this method to commit over 50k row inserts at a time and it's lightning fast.
Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction. Here's a quote from sqlite optimization FAQ:
Unless already in a transaction, each SQL statement has a new
transaction started for it. This is very expensive, since it requires
reopening, writing to, and closing the journal file for each
statement. This can be avoided by wrapping sequences of SQL statements
with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup
is also obtained for statements which don't alter the database.
Here's how your code may look like.
Also, sqlite has an ability to import CSV files.
Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.
I've written a program to scrape a website for data, place it into several arrays, iterate through each array and place it in a query and then execute the query. The code looks like this:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...) #continues for about a dozen arrays
cur.execute(query, values)
cur.close()
db.close()
I run the program and aside from a few truncation warnings everything goes fine. I open the database in MySQL Workbench and nothing has changed. I tried changing the arrays in the values to constant strings and running it but still nothing would change.
I then created an array to hold the last executed query: sql_queries.append(cur._last_executed) and pushed them out to a text file:
fo = open("foo.txt", "wb")
for q in sql_queries:
fo.write(q)
fo.close()
Which gives me a large text file with multiple queries. When I copy the whole text file and create a new query in MySQL Workbench and execute it, it populates the database as desired. What is my program missing?
If your table is using a transactional storage engine, like Innodb, then you need to call db.commit() to have the transaction stored:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...)
cur.execute(query, values)
db.commit()
cur.close()
db.close()
Note that with a transactional database, besides comitting you also have the opportunity to handle errors by rollingback inserts or updates with db.rollback(). The db.commit is required to finalize the transaction. Otherwise,
Closing a connection without committing the changes first will cause
an implicit rollback to be performed.
I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50
I am having problems with a Python script which is basically just analysing a CSV file line-by-line and then inserting each line into a MySQL table using a FOR loop:
f = csv.reader(open(filePath, "r"))
i = 1
for line in f:
if (i > skipLines):
vals = nullify(line)
try:
cursor.execute(query, vals)
except TypeError:
sys.exc_clear()
i += 1
return
Where the query is of the form:
query = ("insert into %s" % tableName) + (" values (%s)" % placeholders)
This is working perfectly fine with every file it is used for with one exception - the largest file. It stops at different points each time - sometimes it reaches 600,000 records, sometimes 900,000. But there are about 4,000,000 records in total.
I can't figure out why it is doing this. The table type is MyISAM. Plenty of disk space available. The table is reaching about 35MB when it stops. max_allowed_packet is set to 16MB but I don't think is a problem as it is executing line-by-line?
Anyone have any ideas what this could be? Not sure whether it is Python, MySQL or the MySQLdb module that is responsible for this.
Thanks in advance.
Have you tried LOAD MySQL function?
query = "LOAD DATA INFILE '/path/to/file' INTO TABLE atable FIELDS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\\\'"
cursor.execute( query )
You can always pre-process the CSV file (at least that's what I do :)
Another thing worth trying would be bulk inserts. You could try to insert multiple rows with one query:
INSERT INTO x (a,b)
VALUES
('1', 'one'),
('2', 'two'),
('3', 'three')
Oh, yeah, and you don't need to commit since it's the MyISAM engine.
As S. Lott alluded to, aren't cursors used as handles into transactions?
So at any time the db is giving you the option of rolling back all those pending inserts.
You may simply have too many inserts for one transaction. Try committing the transaction every couple of thousand inserts.