Python SQLITE3 Inserting Backwards - python

I have a small piece of code which inserts some data into a database. However, the data is being inserting in a reverse order.
If i "commit" after the for loop has run through, it inserts backwards, if i "commit" as part of the for loop, it inserts in the correct order, however it is much slower.
How can i commit after the for loop but still retain the correct order?
import subprocess, sqlite3
output4 = subprocess.Popen(['laZagne.exe', 'all'], stdout=subprocess.PIPE).communicate()[0]
lines4 = output4.splitlines()
conn = sqlite3.connect('DBNAME')
cur = conn.cursor()
for j in lines4:
print j
cur.execute('insert into Passwords (PassString) VALUES (?)',(j,))
conn.commit()
conn.close()

You can't rely on any ordering in SQL database tables. Insertion takes place in an implementation-dependent manner, and where rows end up depends entirely on the storage implementation used and the data that is already there.
As such, no reversing takes place; if you are selecting data from the table again and these rows come back in a reverse order, then that's a coincidence and not a choice the database made.
If rows must come back in a specific order, use ORDER BY when selecting. You could order by ROWID for example, which may be increasing monotonically for new rows and thus give you an approximation for insertion order. See ROWIDs and the INTEGER PRIMARY KEY.

Related

purging a huge data mysql table using python

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days. I want to do it using python and chunks concept. Want to delete chunk wise.
do we have any library with this chunk concept related to mysql on python?
If no, can anyone suggest me a best method of how to use chunk or apply this with mysql
I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table. SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. I suggest using PyMySql.
import datetime
import pymysql.cursors
connection = pymysql.connect(
host='host',
user='user',
password='password',
database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
num_deleted = None
while num_deleted != 0:
num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
connection.commit()
The LIMIT just limits the number of deleted rows to the chunksize. The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; so even though it sorts for each chunk, at least it's sorting on an indexed column. Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time.
You'll need to replace the connection details, table name, column name and chunksize. Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer. You'll need to make some changes if your schema differs.
As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table. Please consider a migration to do so.
And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.

SQLite: Batch aggregation with insert

I would like to do the following:
cur.execute("SELECT key, SUM(val) FROM table GROUP BY key")
cur.executemany("INSERT INTO table_sums VALUES(?,?)",(row for row in cur))
in a single SQLite statement with batch processing if possible, that is it does the sum only for a number of keys, inserts, continues till all are processed.
Apparently I am using Python right now but as I am asking for a single statement (if exists), I don't think this should matter. If it doesn't exist, perhaps there is an efficient(!) work-around in Python?
EDIT: To avoid a SELECT WHERE query, it would actually be desirable not to produce complete sums for a subset of keys, but to just sum over the first n rows and store the resulting sums so far, then continue with the next n...
The two SQLs could be combined into one using a temporary view.
WITH tempsums as
(SELECT key,sum(value) from table
GROUP by key
where key in :batch)
INSERT INTO total_sums SELECT * from tempsums)

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

Is sqlite3 fetchall necessary?

I just started using sqlite3 with python .
I would like to know the difference between :
cursor = db.execute("SELECT customer FROM table")
for row in cursor:
print row[0]
and
cursor = db.execute("SELECT customer FROM table")
for row in cursor.fetchall():
print row[0]
Except that cursor is <type 'sqlite3.Cursor'> and cursor.fetchall() is <type 'list'>, both of them have the same result .
Is there a difference, a preference or specific cases where one is more preferred than the other ?
fetchall() reads all records into memory, and then returns that list.
When iterating over the cursor itself, rows are read only when needed.
This is more efficient when you have much data and can handle the rows one by one.
The main difference is precisely the call to fetchall(). By issuing fetchall(), it will return a list object filled with all the elements remaining in your initial query (all elements if you haven't get anything yet). This has several drawbacks:
Increments memory usage: by storing all the query's elements in a list, which could be huge
Bad performance: filling the list can be quite slow if there are many elements
Process termination: If it is a huge query then your program might crash by running out of memory.
When you instead use cursor iterator (for e in cursor:) you get the query's rows lazily. This means, returning one by one only when the program requires it.
Surely that the output of your two code snippets are the same, but internally there's a huge perfomance drawback between using the fetchall() against using only cursor.
Hope this helps!

Categories