How to use server side cursors with psycopg2 - python

I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?

Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)

I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.

Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on

Related

Python mySQL query - date between is slow unless LIMIT is added

Having a strange issue with pymysql and python. I have a table where date_rec is one of 3 columns composing a primary key. If I do this select, it takes forever to get the result
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
However if I add a limit of 5000, it works superfast, even though there are only 1290 records to be found. The 5000 number doesn't matter... 50,000 fixes the problem exactly the same way (just as fast). As long as it's more than 1290, I get all the records.
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec LIMIT 5000"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
Can someone explain what's happening here and how to make the first case work as fast as the second? Thanks.
EDIT:
3 columns compose primary key:
date_rec
customer_number
order_number
So I did explain on SQL workbench and got this:
Limit-less query
Query with 5000 limit
So Mysql wasn't using the index for whatever reason. Putting in "USE INDEX(PRIMARY)" inside the query fixed the problem.
I have an explanation as to why adding a LIMIT clause speeds up the query, but if you want to tune the query, then consider adding the following index:
CREATE INDEX idx ON string (date_rec);
This index will let MySQL quickly filter off records not inside the date range, and it also provides the ordering needed in the ORDER BY clause.

Python does cursor execute load all data

I am trying to query a large data (10 million rows) and try to prevent out of memory, but not familiar with Python and confused with different opinions regarding the execute(), cursor iterator and fetchone()
Am I right to assume that cursor.execute() does not load all data into memory and only when I call fetchone() then it will load 1 row of data
from mysql.connector import MySQLConnection
def query():
conn = MySQLConnection(host=conf['host'],
conf['port'],
conf['user'],
conf['password'],
conf['database'])
cursor = conn.cursor(buffered=True)
cursor.execute('SELECT * FROM TABLE') # 10 million rows
does this cursor iterator does the same with fetchone() ?
for row in cursor:
print(row)
is my code snippet is safe to handle 10 million rows of data? if not, how can I safely iterate the data without out of memory?
My first suggestion is to use from mysql.connector import connect, which by the default will use the C extension (CMySQLConnection), instead of from mysql.connector import MySQLConnection (pure Python).
If you for some reason want to use the pure Python version, you can pass use_pure=True in connect()
The second suggestion is to paginate the results, if you use a buffered cursor, it will fetch the entire result set from the server. I don't know if you want that with 10M rows.
Here's some references:
https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorbuffered.html
Taken from MySQL documentation:
The fetchone() method is used by fetchall() and fetchmany(). It is also used when a cursor is used as an iterator.
The following example shows two equivalent ways to process a query result. The first uses fetchone() in a while loop, the second uses the cursor as an iterator:
# Using a while loop
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
while row is not None:
print(row)
row = cursor.fetchone()
# Using the cursor as iterator
cursor.execute("SELECT * FROM employees")
for row in cursor:
print(row)
It also stated that:
You must fetch all rows for the current query before executing new statements using the same connection.
If you are worried about performance issues you should use fetchmany(n) in a while loop until you fetch all of the results like so:
'An iterator that uses fetchmany to keep memory usage down'
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
This behavior adheres to PEP249, which describes how and which methods db connectors should implement. A partial answer is given in this thread.
Basically the implementation of fetchall vs fetchmany vs fetchone would be up to the developers of the library depending on the database capabilities, but it would make sense, in the case of fetchmany and fetchone, that the unfetched/remaining results would be kept server side, until requested by another call or destruction of cursor object.
So in conclusion I think it is safe to assume calling execute method does not, in this case(mysqldb), dump all the data from the query to memory.

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

Python/Psycopg2/PostgreSQL Copy_From loop gets slower as it progresses

I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50

offset/limit programmatically with python mysqldb

Im trying to process data from a database table. Loading them all simultaneously will hog most of computer resource. Heres my current code
cursor.execute("SELECT * from sometable")
rs = cursor.fetchall()
.
# process 1: set operation
.
for a in rs:
# process 2: put data in another db
Is there a way to reduce resource usage? Like getting rows bit by bit under a loop?
You can use cursor.fetchone(). Read about it here. Example usage:
cursor.execute(query)
numrows = int(cursor.rowcount)
for x in range(0,numrows):
row = cursor.fetchone()
You can also consider using LIMIT in the mysql query:
cursor.execute("SELECT * from sometable LIMIT 0, 100")
Finally, avoid using the star operator and only select the columns you need.
Using cursor.fetchone() will probably still cause resource issues because of the SELECT * statement before it. I believe the best way to do this is to limit the query results with LIMIT and just loop through them all.

Categories