How to retrieve data from SQLite faster in python

How to retrieve data from SQLite faster in python - python

I have the following info in my database (example):
longitude (real): 70.74
userid (int): 12
This is how i fetch it:
import sqlite3 as lite
con = lite.connect(dbpath)
with con:
cur = con.cursor()
cur.execute('SELECT latitude, userid FROM message')
con.commit()
print "executed"
while True:
tmp = cur.fetchone()
if tmp != None:
info.append([tmp[0],tmp[1]])
else:
break
To get the same info on the form [70.74, 12]
What else can I do to speed up this process? At 10,000,000 rows this takes approx 50 seconds, as I'm aiming for 200,000,000 rows - I never get through this, possible to a memory leak or something like that?

From the sqlite3 documentation:
A Row instance serves as a highly optimized row_factory for Connection objects. It tries to mimic a tuple in most of its features.
Since a Row closely mimics a tuple, depending on your needs you may not even need to unpack the results.
However, since your numerical types are stored as strings, we do need to do some processing. As #Jon Clements pointed out, the cursor is an iterable, so we can just use a comprehension, obtaining the float and ints at the same time.
import sqlite3 as lite
with lite.connect(dbpath) as conn:
cur = conn.execute('SELECT latitude, userid FROM message')
items = [[float(x[0]), int(x[1])] for x in cur]
EDIT: We're not making any changes, so we don't need to call commit.

Related

How to store and query hex values in mysqldb

I want to use a thermal printer with raspberry pi. I want to receive the printer vendor id and product id from mysql database. My columns are of type varchar.
My code is
import MySQLdb
from escpos.printer import Usb
db= MySQLdb.connect(host=HOST, port=PORT,user=USER, passwd=PASSWORD, db=database)
cursor = db.cursor()
sql = ("select * from printerdetails")
cursor.execute(sql)
result = cursor.fetchall()
db.close()
for row in result:
printer_vendor_id = row[2]
printer_product_id = row[3]
input_end_point = row[4]
output_end_point = row[5]
print printer_vendor_id,printer_product_id,input_end_point,output_end_point
Printer = Usb(printer_vendor_id,printer_product_id,0,input_end_point,output_end_point)
Printer.text("Hello World")
Printer.cut()
but it doesnot work. the id's are string. print command shows 0x154f 0x0517 0x82 0x02.in my case
Printer = Usb(0x154f,0x0517,0,0x82,0x02)
works fine.How could I store the same id's to the database and use them to configure the printer

Your problem is that your call to Usb is expecting integers, which works if you call it like this
Printer = Usb(0x154f,0x0517,0,0x82,0x02)
but your database call is returning tuples of hexadecimal values stored as strings. So you need to convert those strings to integers, like this:
for row in result:
printer_vendor_id = int(row[2],16)
printer_product_id = int(row[3],16)
input_end_point = int(row[4],16)
output_end_point = int(row[5],16)
Now if you do
print printer_vendor_id,printer_product_id,input_end_point,output_end_point
you will get
(5455, 1303, 130, 2)
which might look wrong, but isn't, which you can check by asking for the integers to be shown in hex format:
print ','.join('0x{0:04x}'.format(i) for i in (printer_vendor_id,printer_product_id,input_end_point,output_end_point))
0x154f,0x0517,0x0082,0x0002
I should point out that this only works because your database table contains only one row. for row in result loops through all of the rows in your table, but there happens to be only one, which is okay. If there were more, your code would always get the last row of the table, because it doesn't check the identifier of the row and so will repeatedly assign values to the same variables until it runs out of data.
The way to fix that is to put a where clause in your SQL select statement. Something like
"select * from printerdetails where id = '{0}'".format(printer_id)
Now, because I don't know what your database table looks like, the column name id is almost certainly wrong. And very likely the datatype also: it might very well not be a string.

Pythonic way to mock the pyodbc.Row

What is the pythonic way to do the proper unittesting of a function that depends on the SQL query made by pyodbc? As I understand the best way is to mock the function that returns output from SQL server. The problem is what the mock should return?
My setup:
In lib1:
def selectSQL(connection, query):
cursor = connection.cursor()
cursor.execute(query)
return cursor.fetchall()
In lib2:
def function_to_be_tested(cxnx):
my_query = "SELECT r1, r2 FROM t1"
rows = lib1.selectSQL(cxnx, my_query)
# do someting with the rows like:
a = 0
for row in rows
a += row.r1 * row.r2
return a
I have came with the following sollution:
Print the lib1.selectSQL(cxnx, my_query) to a file
Insert the data from lib1.selectSQL to the namedtuple
,
out_tuple = namedtuple('out1', ["r1", "r2"])
printed_data = [(1,2),(2,3)]
out = [out_tuple(*row) for row in printed_data]
def test_mockSelectSQL(self):
piotrSQL.selectSQL = MagicMock()
piotrSQL.selectSQL.side_effect = [out]
self.assertEqual(lib2.function_to_be_tested(True), 7)
My only concern is that the mock returns namedtuple not the pyodbc.Row like the original function. I have checked following sites in search for the information on how to properly create pyodbc.Row:
https://github.com/mkleehammer/pyodbc/blob/master/tests2/informixtests.py
https://github.com/mkleehammer/pyodbc/wiki/Row
In the unittest of pyodbc there is no constructor of if - neither have I found it in the source code (but I am novice so I might have omitted it)... However I have found following information on the Row documentation:
However, there are some pyodbc additions that make them very convenient:
Values can be accessed by column name.
The Cursor.description values can be accessed even after the cursor is closed.
Values can be replaced.
Rows from the same select statement share memory.
So it seams that the namedtuple is in fact behaving in the same way as pyodbc.Row (when it comes to accessing the values). Is there a more pythonic way to do a unittest on pyodbc.Row? Can one assume that this is a good Mock?

Further to the suggestion from #Nullman in a comment to the question, if you wanted to use an in-memory database you might try using the SQLite ODBC driver so you can return actual pyodbc.Row objects like so:
import pyodbc
conn_str = 'Driver=SQLite3 ODBC Driver;Database=:memory:'
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
# create test data
crsr.execute("CREATE TABLE table1 (id INTEGER PRIMARY KEY, dtm DATETIME)")
crsr.execute("INSERT INTO table1 (dtm) VALUES ('2017-07-26 08:08:08')")
# test retrieval
crsr.execute("SELECT * FROM table1")
print(crsr.fetchall())
# prints:
# [(1, datetime.datetime(2017, 7, 26, 8, 8, 8))]
crsr.close()
cnxn.close()
I just tested it and it worked for me in PyCharm on Windows.

Python foreach not looping properly

I'm writing a script that formats a bunch of csv files into one csv file.
To do this, I'm using a couple of cursor tables in python via sqlite.
Here is my code - currently I'm just trying to get every row in gsap that is associated with a code that is in gsap_locs to print
data = c.execute("SELECT * from gsap_locs")
for row in data:
print row[0]
d2 = c.execute("select date, cardtype, volume, transactions from gsap where gsaploc=?", (row[0],))
for r2 in d2:
print r2
However, my code is only returning one row. I know that the problem isn't in the first for because when I take out everything after print row[0] it prints out all of the values from the first select.
Why is it escaping out of my first for after my second for runs without satisfying the conditions of the first for?

You are missing the fetchall or fetchone instructions.
It's a common thing, we think that the execute has done the job of getting the data but you should use fetch.
To retrieve data after executing a SELECT statement, you can either treat the cursor as an iterator, call the cursor’s fetchone() method to retrieve a single matching row, or call fetchall() to get a list of the matching rows.
import sqlite3
conn = sqlite3.connect('gasp.sqlite')
c = conn.cursor()
c.execute("SELECT * FROM gsap_locs")
rows = c.fetchall()
for row in rows:
print row[0]
c.execute("select * from gsap where loc=?", (row[0],))
d2 = c.fetchall()
for r2 in d2:
print r2
conn.close()

Looks like cursor.execute can only track one operation/returns an iterator at a time. You might want to keep the results of the first operation in memory, calling tuple on it:
data = tuple(c.execute("SELECT * from gsap_locs"))
for row in data:
...
Be sure to have enough memory to hold all the results from the first query.

cleaning a Postgres table of bad rows

I have inherited a Postgres database, and am currently in the process of cleaning it. I have created an algorithm to find the rows where the data is bad. The algorithm is encoded into the function called checkProblems(). Using this, I am able to select the rows that contains the bad rows, as shown below ...
schema = findTables(dbName)
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
results = []
for t in tqdm(sorted(schema.keys())):
n = 0
cur.execute('select * from %s'%t)
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
n += 1
results.append({
'tableName': t,
'totalRows': i+1,
'badRows' : n,
})
cur.close()
conn.close()
print pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
Now, I need to delete the rows that are bad. I have two different ways of doing it. First, I can write the clean rows in a temporary table, and rename the table. I think that this option is too memory-intensive. It would be much better if I would be able to just delete the specific record at the cursor. Is this even an option?
Otherwise, what is the best way of deleting a record under such circumstances? I am guessing that this should be a relatively common thing that database administrators do ...

Of course that delete the specific record at the cursor is better. You can do something like:
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
# if cs is a tuple with cs[0] being the record id.
cur.execute('delete from %s where id=%d'%(t, cs[0]))
Or you can store the ids of the bad records and then do something like
DELETE FROM table WHERE id IN (id1,id2,id3,id4)

SQL multiple inserts with Python

UPDATE
After passing execute() a list of rows as per Nathan's suggestion, below, the code executes further but still gets stuck on the execute function. The error message reads:
query = query % db.literal(args)
TypeError: not all arguments converted during string formatting
So it still isn't working. Does anybody know why there is a type error now?
END UPDATE
I have a large mailing list in .xls format. I am using python with xlrd to retrieve the name and email from the xls file into two lists. Now I want to put each name and email into a mysql database. I'm using MySQLdb for this part. Obviously I don't want to do an insert statement for every list item.
Here's what I have so far.
from xlrd import open_workbook, cellname
import MySQLdb
dbname = 'h4h'
host = 'localhost'
pwd = 'P#ssw0rd'
user = 'root'
book = open_workbook('h4hlist.xls')
sheet = book.sheet_by_index(0)
mailing_list = {}
name_list = []
email_list = []
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name] = email
for n, e in sorted(mailing_list.iteritems()):
name_list.append(n)
email_list.append(e)
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.execute("""INSERT INTO mailing_list (name,email) VALUES (%s,%s)""",
(name_list, email_list))
The problem when the cursor executes. This is the error: _mysql_exceptions.OperationalError: (1241, 'Operand should contain 1 column(s)') I tried putting my query into a var initially, but then it just barfed up a message about passing a tuple to execute().
What am I doing wrong? Is this even possible?
The list is huge and I definitely can't afford to put the insert into a loop. I looked at using LOAD DATA INFILE, but I really don't understand how to format the file or the query and my eyes bleed when I have to read MySQL docs. I know I could probably use some online xls to mysql converter, but this is a learning exercise for me as well. Is there a better way?

You need to give executemany() a list of rows. You don't need break the name and email out into separate lists, just create one list with both of the values in it.
rows = []
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
rows.append((name, email))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""INSERT INTO mailing_list (name,email) VALUES (%s,%s)""", rows)
Update: as #JonClements mentions, it should be executemany() not execute().

To fix TypeError: not all arguments converted during string formatting - you need to use the cursor.executemany(...) method, as this accepts an iterable of tuples (more than one row), while cursor.execute(...) expects the parameter to be a single row value.
After the command is executed, you need to ensure that the transaction is committed to make the changes active in the database by using db.commit().

If you are interested in high-performance of the code, this answer may be better.
Compare to excutemany method, the below execute will much faster:
INSERT INTO mailing_list (name,email) VALUES ('Jim','jim#yahoo.com'),('Lucy','Lucy#gmail.com')
You can easily modify the answer from #Nathan Villaescusa and get the new code.
cursor.execute("""INSERT INTO mailing_list (name,email) VALUES (%s)""".format(",".join(str(i) for i in rows))
here is my own test result:
excutemany:10000 runs takes 220 seconds
execute:10000 runs takes 12 seconds.
The speed difference will be about 15 times.

Taking up the idea of #PengjuZhao, it should work to simply add one single placeholder for all values to be passed. The difference to #PengjuZhao's answer is that the values are passed as a second parameter to the execute() function, which should be injection attack safe because this is only evalutated during runtime (in contrast to ".format()").
cursor.execute("""INSERT INTO mailing_list (name,email) VALUES (%s)""", ",".join(str(i) for i in rows))
Only if this does not work properly, try the approach below.
####
#PengjuZhao's answer shows that executemany() has either a strong Python overhead or it uses multiple execute() statements where this is not needed, elsewise executemany() would not be so much slower than a single execute() statement.
Here is a function that puts NathanVillaescusa's and #PengjuZhao's answers in a single execute() approach.
The solution builds a dynamic number of placeholders to be added to the sql statement. It is a manually built execute() statement with multiple placeholders of "%s", which likely outperforms the executemany() statement.
For example, at 2 columns, inserting 100 rows:
execute(): 200 times "%s" (= dependent from the number of the rows)
executemany(): just 2 times "%s" (= independent from the number of the rows).
There is a chance that this solution has the high speed of #PengjuZhao's answer without risking injection attacks.
Prepare parameters of the function:
You will store your values in 1-dimensional numpy arrays arr_name and arr_email which are then converted in a list of concatenated values, row by row. Alternatively, you use the approach of #NathanVillaescusa.
from itertools import chain
listAllValues = list(chain([
arr_name.reshape(-1,1), arr_email.reshape(-1,1)
]))
column_names = 'name, email'
table_name = 'mailing_list'
Get sql query with placeholders:
The numRows = int((len(listAllValues))/numColumns) simply avoids passing the number of rows. If you insert 6 values in listAllValues at 2 columns this would make 6/2 = 3 rows then, obviously.
def getSqlInsertMultipleRowsInSqlTable(table_name, column_names, listAllValues):
numColumns = len(column_names.split(","))
numRows = int((len(listAllValues))/numColumns)
placeholdersPerRow = "("+', '.join(['%s'] * numColumns)+")"
placeholders = ', '.join([placeholdersPerRow] * numRows)
sqlInsertMultipleRowsInSqlTable = "insert into `{table}` ({columns}) values {values};".format(table=table_name, columns=column_names, values=placeholders)
return sqlInsertMultipleRowsInSqlTable
strSqlQuery = getSqlInsertMultipleRowsInSqlTable(table_name, column_names, listAllValues)
Execute strSqlQuery
Final step:
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.execute(strSqlQuery, listAllValues)
This solution is hopefully without the risk of injection attacks as in #PengjuZhao's answer since it fills the sql statement only with placeholders instead of values. The values are only passed separately in listAllValues at this point here, where strSqlQuery has only placeholders instead of values:
cursor.execute(strSqlQuery, listAllValues)
The execute() statement gets the sql statement with placeholders %s and the list of values in two separate parameters, as it is done in #NathanVillaescusa's answer. I am still not sure whether this avoids injection attacks. It is my understanding that injection attacks can only occur if the values are put directly in the sql statement, please comment if I am wrong.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to retrieve data from SQLite faster in python - python

Related

How to store and query hex values in mysqldb

Pythonic way to mock the pyodbc.Row

Python foreach not looping properly

cleaning a Postgres table of bad rows

SQL multiple inserts with Python

Categories

Resources