psycopg2 leaking memory after large query

psycopg2 leaking memory after large query - python

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?
import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.
Instead of
cursor = conn.cursor()
write
cursor = conn.cursor(name="my_cursor_name")
or simpler yet
cursor = conn.cursor("my_cursor_name")
The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors
I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include
"DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.
The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.
import datetime as dt
import psycopg2
import sys
import time
conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server
curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
cptLigne += 1
if cptLigne % 10000 == 0:
print('.', end='')
sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()
As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

Please see the next answer by #joeblog for the better solution.
First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.
Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.
Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.
Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.
What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.
The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.
It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.
See also:
python - memory not being given back to kernel
Why doesn't memory get released to system after large queries (or series of queries) in django?
Releasing memory in Python
So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

Related

Long-running Python program (using Pandas) keeps ramping up memory usage

I'm running a python script that handles and processes data using Pandas functions inside an infinite loop. But the program seems to be leaking memory over time.
This is the graph produced by the memory-profiler package:
Sadly, I cannot identify the source of the increasing memory usage. To my knowledge, all data (pandas timeseries) are stored in the object Obj, and I track the memory usage of this object using the pandas function .memory_usage and the objsize function get_deep_size(). According to their output, the memory usage should be stable around 90-100 MB. Other than this, I don't see where memory can ramp up.
It may be useful to know that the python program is running inside a docker container.
Below is a simplified version of the script which should illuminate the basic working principle.
from datetime import datetime
from time import sleep
import objsize
from dateutil import relativedelta
def update_data(Obj, now_utctime):
# attaining the newest timeseries data
new_data = requests.get(host, start=Obj.data[0].index, end=now_utctime)
Obj.data.append(new_data)
# cut off data older than 1 day
Obj.data.truncate(before=now_utctime-relativedelta.relativedelta(days=1))
class ExampleClass():
def __init__(self):
now_utctime = datetime.utcnow()
data = requests.get(host, start=now_utctime-relativedelta.relativedelta(days=1), end=now_utctime)
Obj = ExampleClass()
while True:
update_data(Obj, datetime.utcnow())
logger.info(f"Average at {datetime.utcnow()} is at {Obj.data.mean()}")
logger.info(f"Stored timeseries memory usage at {Obj.data.memory_usage(deep=True)* 10 ** -6} MB")
logger.info(f"Stored Object memory usage at {objsize.get_deep_size(Obj) * 10 ** -6} MB")
time.sleep(60)
Any advice into where memory could ramp up, or how to further investigate, would be appreciated.
EDIT: Looking at the chart, it makes sense that there will be spikes before I truncate, but since the data ingress is steady I don't know why it wouldn't normalize, but remain at a higher point. Then there is this sudden drop after every 4th cycle, even though the process does not have another, broader cycle that could explain this ...

As suggested by moooeeeep, the increase of memory usage was related to a memory leak, the exact source of which remains to be identified. However, I was able to resolve the issue by manually calling the garbage collector after every loop, via gc.collect().

Make python process writes be scheduled for writeback immediately without being marked dirty

We are building a python framework that captures data from a framegrabber card through a cffi interface. After some manipulation, we try to write RAW images (numpy arrays using the tofile method) to disk at a rate of around 120 MB/s. We are well aware that are disks are capable of handling this throughput.
The problem we were experiencing was dropped frames, often entire seconds of data completely missing from the framegrabber output. What we found was that these framedrops were occurring when our Debian system hit the dirty_background_ratio set in sysctl. The system was calling the flush background gang that would choke up the framegrabber and cause it to skip frames.
Not surprisingly, setting the dirty_background_ratio to 0% managed to get rid of the problem entirely (It is worth noting that even small numbers like 1% and 2% still resulted in ~40% frame loss)
So, my question is, is there any way to get this python process to write in such a way that it is immediately scheduled for writeout, bypassing the dirty buffer entirely?
Thanks

So heres one way I've managed to do it.
By using the numpy memmap object you can instantiate an array that directly corresponds with a part of the disk. Calling the method flush() or python's del causes the array to sync to disk, completely bypassing the OS's buffer. I've successfully written ~280GB to disk at max throughput using this method.
Will continue researching.

Another option is to get the os file id and call os.fsync on it. This will schedule it for writeback immediately.

speed improvement in postgres INSERT command

I am writing a program to load data into a particular database. This is what I am doing right now ...
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
lRows = len(rows)
i, iN = 0, 1000
while True:
if iN >= lRows:
# write the last of the data, and break ...
iN = lRows
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
break
values = [dict(zip(header, r)) for r in rows[i:iN]]
cur.executemany( insertString, values )
conn.commit()
i += 1000
iN += 1000
cur.close()
conn.close()
I am aware about this question about the use of the COPY command. However, I need to do some bookkeeping on my files before I can upload the files into a database. Hence I am using Python in this manner.
I have a couple of questions on how to make things faster ...
Would it be better (or possible) to do many cur.executemany() statements and a single conn.commit() at the end? This means that I will put a single conn.commit() statement just before the cur.close() statement.
I have always seen other people use cur.executemany() for batches of like 1000 or so records. Is this generally the case or is it possible to just do an cur.executemany() on the entire set of records that I read from the file. I would potentially have hundreds of thousands of records, or maybe a little over a million records. (I have sufficient RAM to fit the entire file in memory). How do I know the upper limit of the number of records that I can upload at any one time.
I am making a fresh connection to the database for every file that I am opening. I am doing this because, this process is taking me many days to complete and I dont want issues with connection to corrupt the entirety of the data, if the connection is lost at any time. I have over a thousand files that I need to go through. Are these thousand connections that we are making going to be a significant part of the time that is used for the process?
Are there any other things that I am doing in the program that I shouldn't be doing that can shorten the total time for the process?
Thanks very much for any help that I can get. Sorry for the questions being so basic. I am just starting with databases in Python, and for some reason, I don't seem to have any definitive answer to any of these questions right now.

As you mentioned at p.3 you are worried about database connection, that might break, so if you use one conn.commit() only after all inserts, you can easily loose already inserted, but not commited data if your connection breaks before conn.commit(). If you do conn.commit() after each cur.executemany(), you won't loose everything, only the last batch. So, it's up to you and depends on a workflow you need to support.
The number of records per batch is a trade-off between insertion speed and other things. You need to choose value that satisfies your requirements, you can test your script with 1000 records per batch, with 10000 per batch and check the difference.
The case of inserting whole file within one cur.executemany() has an advantage of an atomicity: if it has been executed, that means all records from this particular file have been inserted, so we're back to p. 1.
I think the cost of establishing a new connection in your case does not really matter. Let's say, if it takes one second to establish new connection, with 1000 files it will be 1000 seconds spent on connection within days.
The program itself looks fine, but I would still recommend you to take a look on COPY TO command with UNLOGGED or TEMPORARY tables, it will really speed up your imports.

64 bit python fills up memory until computer freezes with no memerror

I used to run 32 bit python on a 32-bit OS and whenever i accidentally appended values to an array in an infinite list or tried to load too big of a file, python would just stop with an out of memory error. However, i now use 64-bit python on a 64-bit OS, and instead of giving an exception, python uses up every last bit of memory and causes my computer to freeze up so i am forced to restart it.
I looked around stack overflow and it doesn't seem as if there is a good way to control memory usage or limit memory usage. For example, this solution: How to set memory limit for thread or process in python? limits the resources python can use, but it would be impractical to paste into every piece of code i want to write.
How can i prevent this from happening?

I don't know if this will be the solution for anyone else but me, as my case was very specific, but I thought I'd post it here in case someone could use my procedure.
I was having a VERY huge dataset with millions of rows of data. Once I queried this data through a postgreSQL database I used up a lot of my available memory (63,9 GB available in total on a Windows 10 64 bit PC using Python 3.x 64 bit) and for each of my queries I used around 28-40 GB of memory as the rows of data was to be kept in memory while Python did calculations on the data. I used the psycopg2 module to connect to my postgreSQL.
My initial procedure was to perform calculations and then append the result to a list which I would return in my methods. I quite quickly ended up having too much stored in memory and my PC started freaking out (froze up, logged me out of Windows, display driver stopped responding and etc).
Therefore I changed my approach using Python Generators. And as I would want to store the data I did calculations on back in my database, I would write each row, as I was done performing calculations on it, to my database.
def fetch_rows(cursor, arraysize=1000):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
And with this approach I would do calculations on my yielded result by using my generator:
def main():
connection_string = "...."
connection = psycopg2.connect(connection_string)
cursor = connection.cursor()
# Using generator
for row in fecth_rows(cursor):
# placeholder functions
result = do_calculations(row)
write_to_db(result)
This procedure does however indeed require that you have enough physical RAM to store the data in memory.
I hope this helps whomever is out there with same problems.

Python in Windows: large number of inserts using pyodbc causes memory leak

I am trying to populate a MS SQL 2005 database using python on windows. I am inserting millions of rows, and by 7 million I am using almost a gigabyte of memory. The test below eats up 4 megs of RAM for each 100k rows inserted:
import pyodbc
connection=pyodbc.connect('DRIVER={SQL Server};SERVER=x;DATABASE=x;UID=x;PWD=x')
cursor=connection.cursor()
connection.autocommit=True
while 1:
cursor.execute("insert into x (a,b,c,d, e,f) VALUES (?,?,?,?,?,?)",1,2,3,4,5,6)
mdbconn.close()
Hack solution: I ended up spawning a new process using the multiprocessing module to return memory. Still confused about why inserting rows in this way consumes so much memory. Any ideas?

I had the same issue, and it looks like a pyodbc issue with parameterized inserts: http://code.google.com/p/pyodbc/issues/detail?id=145
Temporarily switching to a static insert with the VALUES clause populated eliminates the leak, until I try a build from the current source.

Even I had faced the same problem.
I had to read more than 50 XML files each about 300 MB and load them into SQL Server 2005.
I tried the following :
Using the same cursor by dereferencing.
Closing /opening the connection
Setting the connection to None.
Finally ended up bootstrapping each XML file load using Process module.
Now I have replaced the process using IronPython - System.Data.SqlClient.
This give a better performance and also better interface.

Maybe close & re-open the connection every million rows or so?
Sure it doesn't solve anything, but if you only have to do this once you could get on with life!

Try creating a separate cursor for each insert. Reuse the cursor variable each time through the loop to implicitly dereference the previous cursor. Add a connection.commit after each insert.
You may only need something as simple as a time.sleep(0) at the bottom of each loop to allow the garbage collector to run.

You could also try forcing a garbage collection every once in a while with gc.collect() after importing the gc module.
Another option might be to use cursor.executemany() and see if that clears up the problem. The nasty thing about executemany(), though, is that it takes a sequence rather than an iterator (so you can't pass it a generator). I'd try the garbage collector first.
EDIT: I just tested the code you posted, and I am not seeing the same issue. Are you using an old version of pyodbc?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.