cx_Oracle: How do I iterate over a result set? - python

There are several ways to iterate over a result set. What are the tradeoff of each?

The canonical way is to use the built-in cursor iterator.
curs.execute('select * from people')
for row in curs:
print row
You can use fetchall() to get all rows at once.
for row in curs.fetchall():
print row
It can be convenient to use this to create a Python list containing the values returned:
curs.execute('select first_name from people')
names = [row[0] for row in curs.fetchall()]
This can be useful for smaller result sets, but can have bad side effects if the result set is large.
You have to wait for the entire result set to be returned to
your client process.
You may eat up a lot of memory in your client to hold
the built-up list.
It may take a while for Python to construct and deconstruct the
list which you are going to immediately discard anyways.
If you know there's a single row being returned in the result set you can call fetchone() to get the single row.
curs.execute('select max(x) from t')
maxValue = curs.fetchone()[0]
Finally, you can loop over the result set fetching one row at a time. In general, there's no particular advantage in doing this over using the iterator.
row = curs.fetchone()
while row:
print row
row = curs.fetchone()

My preferred way is the cursor iterator, but setting first the arraysize property of the cursor.
curs.execute('select * from people')
curs.arraysize = 256
for row in curs:
print row
In this example, cx_Oracle will fetch rows from Oracle 256 rows at a time, reducing the number of network round trips that need to be performed

There's also the way psyco-pg seems to do it... From what I gather, it seems to create dictionary-like row-proxies to map key lookup into the memory block returned by the query. In that case, fetching the whole answer and working with a similar proxy-factory over the rows seems like useful idea. Come to think of it though, it feels more like Lua than Python.
Also, this should be applicable to all PEP-249 DBAPI2.0 interfaces, not just Oracle, or did you mean just fastest using Oracle?

Related

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

Why is querying a table so much slower after sorting it?

I have a Python program that uses Pytables and queries a table in this simple manner:
def get_element(table, somevar):
rows = table.where("colname == somevar")
row = next(rows, None)
if row:
return elem_from_row(row)
To reduce the query time, I decided to try to sort the table with table.copy(sortby='colname'). This indeed improved the query time (spent in where), but it increased the time spent in the next() built-in function by several orders of magnitude! What could be the reason?
This slowdown occurs only when there is another column in the table, and the slowdown increases with the element size of that other column.
To help me understand the problem and make sure this was not related to something else in my program, I made this minimum working example reproducing the problem:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tables
import time
import sys
def create_set(sort, withdata):
#Table description with or without data
tabledesc = {
'id': tables.UIntCol()
}
if withdata:
tabledesc['data'] = tables.Float32Col(2000)
#Create table with CSI'ed id
fp = tables.open_file('tmp.h5', mode='w')
table = fp.create_table('/', 'myset', tabledesc)
table.cols.id.create_csindex()
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush()
return fp
def get_element(table, i):
#By construction, i always exists in the table
rows = table.where('id == i')
row = next(rows, None)
if row:
return {'id': row['id']}
return None
sort = sys.argv[1] == 'sort'
withdata = sys.argv[2] == 'withdata'
fp = create_set(sort, withdata)
start_time = time.time()
table = fp.root.myset
for i in xrange(500):
get_element(table, i)
print("Queried the set in %.3fs" % (time.time() - start_time))
fp.close()
And here is some console output showing the figures:
$ ./timedset.py nosort nodata
Queried the set in 0.718s
$ ./timedset.py sort nodata
Queried the set in 0.003s
$ ./timedset.py nosort withdata
Queried the set in 0.597s
$ ./timedset.py sort withdata
Queried the set in 5.846s
Some notes:
The rows are actually sorted in all cases, so it seems to be linked to the table being aware of the sort rather than just the data being sorted.
If instead of creating the file, I read it from disk, same results.
The issue occurs only when the data column is present, even though I never write to it nor read it. I noticed that the time difference increases "in stages" when the size of the column (the number of floats) increases. The slowdown must be linked with internal data movements or I/O:
If I don't use the next function, but instead use a for row in rows and trust that there is only one result, the slowdown still occurs.
Accessing an element from a table by some sort of id (sorted or not) sounds like a basic feature, I must be missing the typical way of doing it with pytables. What is it?
And why such a terrible slowdown? Is it a bug that I should report?
I finally understood what's going on.
Long story short
The root cause is a bug and it was on my side: I was not flushing the data before making the copy in case of sort. As a result, the copy was based on data that was not complete, and so was the new sorted table. This is what caused the slowdown, and flushing when appropriate led to a less surprising result:
...
#Fill the table with sorted ids
row = table.row
for i in xrange(500):
row['id'] = i
row.append()
fp.flush() # <--
#Force a sort if asked for
if sort:
newtable = table.copy(newname='sortedset', sortby='id')
table.remove()
newtable.rename('myset')
fp.flush() # <--
return fp
...
But why?
I realized my mistake when I decided to inspect and compare the structure and data of the tables "not sorted" vs "sorted". I noticed that in the sorted case, the table had less rows. The number varied seemingly randomly from 0 to about 450 depending on the size of the data column. Moreover, in the sorted table, the id of all the rows was set to 0. I guess that when creating a table, pytables initializes the columns and may or may not pre-create some of the rows with some initial value. This "may or may not" probably depends on the size of the row and the computed chunksize.
As a result, when querying the sorted table, all queries but the one with id == 0 had no result. I initially thought that raising and catching the StopIteration error was what caused the slowdown, but that would not explain why the slowdown depends on the size of the data column.
After reading some of the code from pytables (notably table.py and tableextension.pyx), I think what happens is the following: when a column is indexed, pytables will first try to use this index to fasten the search. If some matching rows are found, only these rows will be read. But if the index indicates that no row matches the query, for some reason pytables fallbacks to a "in kernel" search, which iterates over and reads all the rows. This requires reading the full rows from disk in multiple I/Os, and this is why the size of the data column mattered. Also under a certain size of that column, pytables did not "pre-create" some rows on disk, resulting in a sorted table with no row at all. This is why on the graph the search is very fast when the column size is under 525: iterating over 0 row doesn't take much time.
I am not clear on why the iterator fallbacks on an "in kernel" search. If the searched id is clearly out of the index bounds, I don't see any reason to search it anyway... Edit: After a closer look at the code, it turns out this is because of a bug. It is present in the version I am using (3.1.1), but has been fixed in 3.2.0.
The irony
What really makes me cry is that I forgot to flush before copying only in the example of the question. In my actual program, this bug is not present! What I also did not know but found out while investigating the question is that by default pytables do not propagate indexes. This has to be required explicitly with propindexes=True. This is why the search was slower after sorting in my application...
So moral of the story:
Indexing is good: use it
But don't forget to propagate them when sorting a table
Make sure your data is on disk before reading it...

In python's sqlite3 module, will using cursor.fetchone() as the condition for a while loop fetch a row?

I am using python's sqlite3 module.
I want to iterate through the rows in my table until none are left. I want to do this one at a time instead of using fetchall because the table is quite large. I read in the python sqlite3 documentation that curs.fetchone() returns None when no rows are left in the cursor's selection.
So I had the idea to run my code in a while fetchone(): loop, with the 'actual' fetchone() call occuring inside the loop, like so:
while c.fetchone():
row = c.fetchone()
print(foo(row))
I can't tell if this loop is actually fetching a row once in the while statement and once in the body, or if it is only in the body. Which one is it?
It will execute fetchone twice so it is not a good idea. Since sqlite3.Cursor is iterable you do something like this:
for row in c:
print(foo(row))
Alternatively you can use infinite while loop and fetch data in batches using fetchmany:
# Lets take 1000 rows at the time
batch_size = 1000
while True:
rows = c.fetchmany(batch_size)
if not rows: break
for row in rows:
print(foo(row))

Is sqlite3 fetchall necessary?

I just started using sqlite3 with python .
I would like to know the difference between :
cursor = db.execute("SELECT customer FROM table")
for row in cursor:
print row[0]
and
cursor = db.execute("SELECT customer FROM table")
for row in cursor.fetchall():
print row[0]
Except that cursor is <type 'sqlite3.Cursor'> and cursor.fetchall() is <type 'list'>, both of them have the same result .
Is there a difference, a preference or specific cases where one is more preferred than the other ?
fetchall() reads all records into memory, and then returns that list.
When iterating over the cursor itself, rows are read only when needed.
This is more efficient when you have much data and can handle the rows one by one.
The main difference is precisely the call to fetchall(). By issuing fetchall(), it will return a list object filled with all the elements remaining in your initial query (all elements if you haven't get anything yet). This has several drawbacks:
Increments memory usage: by storing all the query's elements in a list, which could be huge
Bad performance: filling the list can be quite slow if there are many elements
Process termination: If it is a huge query then your program might crash by running out of memory.
When you instead use cursor iterator (for e in cursor:) you get the query's rows lazily. This means, returning one by one only when the program requires it.
Surely that the output of your two code snippets are the same, but internally there's a huge perfomance drawback between using the fetchall() against using only cursor.
Hope this helps!

How can I "fetch two" with python-mysql?

I have a table, and I want to execute a query that will return the values of two rows:
cursor.execute("""SELECT `egg_id`
FROM `groups`
WHERE `id` = %s;""", (req_id))
req_egg = str(cursor.fetchone())
print req_egg
The column egg_id has two rows it can return for that query, however the above code will only print the first result -- I want it to also show the second, how would I get both values?
Edit: Would there be any way to store each one in a separate variable, with fetchmany?
in this case you can use fetchmany to fetch a specified number of rows:
req_egg = cursor.fetchmany(2)
edit:
but be aware: if you have a table with many rows but only need two, then you should also use a LIMIT in your sql query, otherwise all rows are returned from the database, but only two are used by your program.
Call .fetchone() a second time, and it would return the next result.
Otherwise if you are 100% positively sure that your query would always return exactly two results, even if you've had a bug or inconsistent data in the database, then just do a .fetchall() and capture both results.
Try this:
Cursor.fetchmany(size=2)
Documentation for sqlite3 (which also implements dbapi): http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

Categories