How to insert into sqlite faster [duplicate] - python

I read this: Importing a CSV file into a sqlite3 database table using Python
and it seems that everyone suggests using line-by-line reading instead of using bulk .import from SQLite. However, that will make the insertion really slow if you have millions of rows of data. Is there any other way to circumvent this?
Update: I tried the following code to insert line by line but the speed is not as good as I expected. Is there anyway to improve it
for logFileName in allLogFilesName:
logFile = codecs.open(logFileName, 'rb', encoding='utf-8')
for logLine in logFile:
logLineAsList = logLine.split('\t')
output.execute('''INSERT INTO log VALUES(?, ?, ?, ?)''', logLineAsList)
logFile.close()
connection.commit()
connection.close()

Since this is the top result on a Google search I thought it might be nice to update this question.
From the python sqlite docs you can use
import sqlite3
persons = [
("Hugo", "Boss"),
("Calvin", "Klein")
]
con = sqlite3.connect(":memory:")
# Create the table
con.execute("create table person(firstname, lastname)")
# Fill the table
con.executemany("insert into person(firstname, lastname) values (?,?)", persons)
I have used this method to commit over 50k row inserts at a time and it's lightning fast.

Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction. Here's a quote from sqlite optimization FAQ:
Unless already in a transaction, each SQL statement has a new
transaction started for it. This is very expensive, since it requires
reopening, writing to, and closing the journal file for each
statement. This can be avoided by wrapping sequences of SQL statements
with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup
is also obtained for statements which don't alter the database.
Here's how your code may look like.
Also, sqlite has an ability to import CSV files.

Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

Related

My program isn't changing MySQL database yet presents no error

I've written a program to scrape a website for data, place it into several arrays, iterate through each array and place it in a query and then execute the query. The code looks like this:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...) #continues for about a dozen arrays
cur.execute(query, values)
cur.close()
db.close()
I run the program and aside from a few truncation warnings everything goes fine. I open the database in MySQL Workbench and nothing has changed. I tried changing the arrays in the values to constant strings and running it but still nothing would change.
I then created an array to hold the last executed query: sql_queries.append(cur._last_executed) and pushed them out to a text file:
fo = open("foo.txt", "wb")
for q in sql_queries:
fo.write(q)
fo.close()
Which gives me a large text file with multiple queries. When I copy the whole text file and create a new query in MySQL Workbench and execute it, it populates the database as desired. What is my program missing?
If your table is using a transactional storage engine, like Innodb, then you need to call db.commit() to have the transaction stored:
for count in range(391):
query = #long query
values = (doctor_names[count].encode("utf-8"), ...)
cur.execute(query, values)
db.commit()
cur.close()
db.close()
Note that with a transactional database, besides comitting you also have the opportunity to handle errors by rollingback inserts or updates with db.rollback(). The db.commit is required to finalize the transaction. Otherwise,
Closing a connection without committing the changes first will cause
an implicit rollback to be performed.

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Python SQLite cache in memory

I am using Python 2.7 and SQLite. I am building a database with millions of rows. I would like to only write out to disk occasionally, with the idea this will improve performance. My thought was to only call commit() from time to time. I have tried that with the code below. The selects in the middle show that we get consistent reads. But, when I look on disc, I see a file example.db-journal. This must be where the data is being cached. In which case this would gain me nothing in terms of performance. Is there a way to have the inserts collect in memory, and then flush them to disc? Is there a better way to do this?
Here is my sample code:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('CREATE TABLE if not exists stocks (date text, trans text, symbol text, qty real, price real)')
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get 2 rows as expected.
print c.fetchall()
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
conn.commit()
t = ('RHAT',)
c.execute('SELECT date, symbol, trans FROM stocks WHERE symbol=?', t)
# Here, we get all the rows as expected.
print c.fetchall()
conn.close()
Update:
Figured I would give an update with some code in case anyone runs across this problem.
I am processing 5+ million lines from a text file and needed a place to store the data for more processing. I originally had all the data in memory, but, was running out of memory. So, I switched to SQLite for a disc cache. My original in memory version of the processing took ~36 secs per 50,000 rows from the original text file.
After measuring, my first cut on SQLite version of the batch processing took ~660 seconds for 50,000 lines. Based on the comments (thanks to the posters), I came up with the following code:
self.conn = sqlite3.connect('myDB.db', isolation_level='Exclusive')
self.cursor.execute('PRAGMA synchronous = 0')
self.cursor.execute('PRAGMA journal_mode = OFF')
In addition, I commit after processing 1000 lines from my text file.
if lineNum % 1000 == 0:
self.conn.commit()
With that, 50,000 lines from the text file now takes ~40 seconds. So, I added 11% to the overall time, but, memory is constant, which is more important.
Firstly, are you sure you need this? For reading, the OS should cache the file anyway, and if you write a lot, not syncing to disc means you can lose data easily.
If you measure and identify this as a bottleneck, you can use an in-memory database using connect(':memory:') and get an iterator returning an sql dump on demand: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.iterdump
import sqlite3, os
in_memory = sqlite3.connect(':memory:')
# do stuff
con = sqlite3.connect('existing_db.db')
con.execute('drop table stocks')
for line in in_memory.iterdump():
con.execute(line)
Again, measure if you need this. If you have enough data that it matters, think hard about using a different data store, for example a full blown DBMS like postgres.
In your case, you are creating a db-connection in autocommit mode, which means that every time you execute an INSERT statement, the database starts a transaction, executes the statement and commits. So your commit is - in this case - meaningless. See sqlite3 on python docs.
But you are correct that inserting a large quantity of rows should ideally be done within a transaction. This signals the connection, that it should record all the incoming INSERT statements in the journal-file, but delaying writing to the database file until the transaction is committed. Even though your execution is limited by the I/O operations, writing to the journal file is no serious performance penalty.

optimize pymssql code

i am inserting records to sql server from python using pymssql. The database takes 2 milliseconds to execute a query, yet it insert 6 rows per second. The only problem is at code side. how to optimize following code or what is the fastest method to insert records.
def save(self):
conn = pymssql.connect(host=dbHost, user=dbUser,
password=dbPassword, database=dbName, as_dict=True)
cur = conn.cursor()
self.pageURL = self.pageURL.replace("'","''")
query = "my query is there"
cur.execute(query)
conn.commit()
conn.close()
It looks like you're creating a new connection per insert there. That's probably the major reason for the slowdown: building new connections is typically quite slow. Create the connection outside the method and you should see a large improvement. You can also create a cursor outside function and re-use it, which will be another speedup.
Depending on your situation, you may also want to use the same transaction for more than a single insertion. This changes the behaviour a little -- since a transaction is supposed to be atomic and either completely succeeds or completely fails -- but committing a transaction is typically a slow operation, because it has to be certain the whole operation succeeded.
In addition to Thomas' great advice,
I'd suggest you look into executemany()*, e.g.:
cur.executemany("INSERT INTO persons VALUES(%d, %s)",
[ (1, 'John Doe'), (2, 'Jane Doe') ])
...where the second argument of executemany() should be a sequence of rows to insert.
This brings up another point:
You probably want to send your query and query parameters as separate arguments to either execute() or executemany(). This will allow the PyMSSQL module to handle any quoting issues for you.
*executemany() as described in the Python DB-API:
.executemany(operation,seq_of_parameters)
Prepare a database operation (query or
command) and then execute it against
all parameter sequences or mappings
found in the sequence
seq_of_parameters.

Python MySQL query not completing

I am having problems with a Python script which is basically just analysing a CSV file line-by-line and then inserting each line into a MySQL table using a FOR loop:
f = csv.reader(open(filePath, "r"))
i = 1
for line in f:
if (i > skipLines):
vals = nullify(line)
try:
cursor.execute(query, vals)
except TypeError:
sys.exc_clear()
i += 1
return
Where the query is of the form:
query = ("insert into %s" % tableName) + (" values (%s)" % placeholders)
This is working perfectly fine with every file it is used for with one exception - the largest file. It stops at different points each time - sometimes it reaches 600,000 records, sometimes 900,000. But there are about 4,000,000 records in total.
I can't figure out why it is doing this. The table type is MyISAM. Plenty of disk space available. The table is reaching about 35MB when it stops. max_allowed_packet is set to 16MB but I don't think is a problem as it is executing line-by-line?
Anyone have any ideas what this could be? Not sure whether it is Python, MySQL or the MySQLdb module that is responsible for this.
Thanks in advance.
Have you tried LOAD MySQL function?
query = "LOAD DATA INFILE '/path/to/file' INTO TABLE atable FIELDS TERMINATED BY ',' ENCLOSED BY '\"' ESCAPED BY '\\\\'"
cursor.execute( query )
You can always pre-process the CSV file (at least that's what I do :)
Another thing worth trying would be bulk inserts. You could try to insert multiple rows with one query:
INSERT INTO x (a,b)
VALUES
('1', 'one'),
('2', 'two'),
('3', 'three')
Oh, yeah, and you don't need to commit since it's the MyISAM engine.
As S. Lott alluded to, aren't cursors used as handles into transactions?
So at any time the db is giving you the option of rolling back all those pending inserts.
You may simply have too many inserts for one transaction. Try committing the transaction every couple of thousand inserts.

Categories