Long-running Python program (using Pandas) keeps ramping up memory usage

Long-running Python program (using Pandas) keeps ramping up memory usage - python

I'm running a python script that handles and processes data using Pandas functions inside an infinite loop. But the program seems to be leaking memory over time.
This is the graph produced by the memory-profiler package:
Sadly, I cannot identify the source of the increasing memory usage. To my knowledge, all data (pandas timeseries) are stored in the object Obj, and I track the memory usage of this object using the pandas function .memory_usage and the objsize function get_deep_size(). According to their output, the memory usage should be stable around 90-100 MB. Other than this, I don't see where memory can ramp up.
It may be useful to know that the python program is running inside a docker container.
Below is a simplified version of the script which should illuminate the basic working principle.
from datetime import datetime
from time import sleep
import objsize
from dateutil import relativedelta
def update_data(Obj, now_utctime):
# attaining the newest timeseries data
new_data = requests.get(host, start=Obj.data[0].index, end=now_utctime)
Obj.data.append(new_data)
# cut off data older than 1 day
Obj.data.truncate(before=now_utctime-relativedelta.relativedelta(days=1))
class ExampleClass():
def __init__(self):
now_utctime = datetime.utcnow()
data = requests.get(host, start=now_utctime-relativedelta.relativedelta(days=1), end=now_utctime)
Obj = ExampleClass()
while True:
update_data(Obj, datetime.utcnow())
logger.info(f"Average at {datetime.utcnow()} is at {Obj.data.mean()}")
logger.info(f"Stored timeseries memory usage at {Obj.data.memory_usage(deep=True)* 10 ** -6} MB")
logger.info(f"Stored Object memory usage at {objsize.get_deep_size(Obj) * 10 ** -6} MB")
time.sleep(60)
Any advice into where memory could ramp up, or how to further investigate, would be appreciated.
EDIT: Looking at the chart, it makes sense that there will be spikes before I truncate, but since the data ingress is steady I don't know why it wouldn't normalize, but remain at a higher point. Then there is this sudden drop after every 4th cycle, even though the process does not have another, broader cycle that could explain this ...

As suggested by moooeeeep, the increase of memory usage was related to a memory leak, the exact source of which remains to be identified. However, I was able to resolve the issue by manually calling the garbage collector after every loop, via gc.collect().

Related

long-running python program ram usage

I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!

Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()

Amount of memory used in a function?

Which python module can I use to calculate the amount of memory spent executing a function? I'm using memory_profiler, but it shows the amount of memory spent by each line of the algorithm, in this case I want one that shows the total amount spent.

You can use tracemalloc to do what memory_profiller does automatically. It's a little unfriendly but I think it does what you want to do pretty well.
Just follow the code snippet below.
import tracemalloc
def myFucn():
tracemalloc.start()
## Your code
print( tracemalloc..get_traced_memory())
tracemalloc.stop()
The output is given in form of (current,peak),i.e, current memory is the memory the code is currently using and peak memory is the maximum space the program used while executing.
The code is from geeksforgeeks. Check it out for more info. Also, there are a couple of other methods to trace the memory explained inside the link. Make sure to check it out.

psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?
import psycopg2
conn = psycopg2.connect("dbname='dbname' user='user' host='host'")
cursor = conn.cursor()
cursor.execute("""large query""")
rows = cursor.fetchall()
del rows
cursor.close()
conn.close()
import gc
gc.collect()

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.
Instead of
cursor = conn.cursor()
write
cursor = conn.cursor(name="my_cursor_name")
or simpler yet
cursor = conn.cursor("my_cursor_name")
The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors
I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include
"DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.
The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.
import datetime as dt
import psycopg2
import sys
import time
conPG = psycopg2.connect("dbname='myDearDB'")
curPG = conPG.cursor('testCursor')
curPG.itersize = 100000 # Rows fetched at one time from the server
curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
# Warning: curPG.rowcount == -1 ALWAYS !!
cptLigne = 0
for rec in curPG:
cptLigne += 1
if cptLigne % 10000 == 0:
print('.', end='')
sys.stdout.flush() # To see the progression
conPG.commit() # Also close the cursor
conPG.close()
As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

Please see the next answer by #joeblog for the better solution.
First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.
Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.
Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.
Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.
What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.
The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.
It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.
See also:
python - memory not being given back to kernel
Why doesn't memory get released to system after large queries (or series of queries) in django?
Releasing memory in Python
So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

live plotting in matplotlib while performing a measurement that takes time

I would like to perform a measurement and plot a graph while the measurement is
running. This measurements takes quite some time in python (it has to retrieve data over a slow connection). The problem is that the graph freezes when measuring. The measurement
consists of setting a center wavelength, and then measuring some signal.
My program looks something like this:
# this is just some arbitrary library that has the functions set_wavelength and
# perform_measurement
from measurement_module import set_wavelength, perform_measurement
from pylab import *
xdata = np.linspace(600,1000,30) # this will be the x axis
ydata = np.zeros(len(xdata)) # this will be the y data. It will
for i in range(len(xdata)):
# this call takes approx 1 s
set_wavelength(xdata[i])
# this takes approx 10 s
ydata[i] = perform_measurement(xdata)
# now I would like to plot the measured data
plot(xdata,ydata)
draw()
This will work when it is run in IPython with the -pylab module switched on,
but while the measurement is running the figure will freeze. How can modify
the behaviour to have an interactive plot while measuring?
You cannot simply use pylab.ion(), because python is busy while performing the measurements.
regards,
Dirk

You can, though maybe a bit awkward, run the data-gathering as a serparate process. I find Popen in the subprocess module quite handy. Then let that data-gathering script save what it does to disk somewhere and you use
Popen.poll()
To check if it has completed.
It ought to work.

I recommend buffering the data in large chunks and render/re-render when the buffer fills up. If you want it to be nonblocking look at greenlets.
from gevent.greenlet import Greenlet
import copy
def render(buffer):
'''
do rendering stuff
'''
pass
buff = ''
while not_finished:
buff = connection.read()
g = Greenlet(render, copy.deepcopy(buff))
g.start()

Slow input and output is the perfect time to use threads and queues in Python. Threads have there limitations, but this is the case where they work easily and effectively.
Outline of how to do this:
Generally the GUI (e.g., the matplotlib window) needs to be in the main thread, so do the data collection in a second thread. In the data thread, check for new data coming in (and if you do this in some type of infinite polling loop, put in a short time.sleep to release the thread occasionally). Then, whenever needed, let the main thread know that there's some new data to be processed/displayed. Exactly how to do this depends on details of your program and your GUI, etc. You could just use a flag in the data thread that you check for from the main thread, or a theading.Event, or, e.g., if you have a wx backend for matplotlib wx.CallAfter is easy. I recommend looking through one of the many Python threading tutorials to get a sense of it, and also threading with a GUI usually has a few issues too so just do a quick google on threading with your particular backend. This sounds cumbersome as I explain it so briefly, but it's really pretty easy and powerful, and will be smoother than, e.g., reading and writing to the same file from different processes.

Take a look at Traits and Chaco, Enthought's type system and plotting library. They provide a nice abstraction to solve the problem you're running into. A Chaco plot will update itself whenever any of its dependencies change.

How do I find what is using memory in a Python process in a production system?

My production system occasionally exhibits a memory leak I have not been able to reproduce in a development environment. I've used a Python memory profiler (specifically, Heapy) with some success in the development environment, but it can't help me with things I can't reproduce, and I'm reluctant to instrument our production system with Heapy because it takes a while to do its thing and its threaded remote interface does not work well in our server.
What I think I want is a way to dump a snapshot of the production Python process (or at least gc.get_objects), and then analyze it offline to see where it is using memory. How do I get a core dump of a python process like this? Once I have one, how do I do something useful with it?

Using Python's gc garbage collector interface and sys.getsizeof() it's possible to dump all the python objects and their sizes. Here's the code I'm using in production to troubleshoot a memory leak:
rss = psutil.Process(os.getpid()).get_memory_info().rss
# Dump variables if using more than 100MB of memory
if rss > 100 * 1024 * 1024:
memory_dump()
os.abort()
def memory_dump():
dump = open("memory.pickle", 'wb')
xs = []
for obj in gc.get_objects():
i = id(obj)
size = sys.getsizeof(obj, 0)
# referrers = [id(o) for o in gc.get_referrers(obj) if hasattr(o, '__class__')]
referents = [id(o) for o in gc.get_referents(obj) if hasattr(o, '__class__')]
if hasattr(obj, '__class__'):
cls = str(obj.__class__)
xs.append({'id': i, 'class': cls, 'size': size, 'referents': referents})
cPickle.dump(xs, dump)
Note that I'm only saving data from objects that have a __class__ attribute because those are the only objects I care about. It should be possible to save the complete list of objects, but you will need to take care choosing other attributes. Also, I found that getting the referrers for each object was extremely slow so I opted to save only the referents. Anyway, after the crash, the resulting pickled data can be read back like this:
with open("memory.pickle", 'rb') as dump:
objs = cPickle.load(dump)
Added 2017-11-15
The Python 3.6 version is here:
import gc
import sys
import _pickle as cPickle
def memory_dump():
with open("memory.pickle", 'wb') as dump:
xs = []
for obj in gc.get_objects():
i = id(obj)
size = sys.getsizeof(obj, 0)
# referrers = [id(o) for o in gc.get_referrers(obj) if hasattr(o, '__class__')]
referents = [id(o) for o in gc.get_referents(obj) if hasattr(o, '__class__')]
if hasattr(obj, '__class__'):
cls = str(obj.__class__)
xs.append({'id': i, 'class': cls, 'size': size, 'referents': referents})
cPickle.dump(xs, dump)

I will expand on Brett's answer from my recent experience. Dozer package is well maintained, and despite advancements, like addition of tracemalloc to stdlib in Python 3.4, its gc.get_objects counting chart is my go-to tool to tackle memory leaks. Below I use dozer > 0.7 which has not been released at the time of writing (well, because I contributed a couple of fixes there recently).
Example
Let's look at a non-trivial memory leak. I'll use Celery 4.4 here and will eventually uncover a feature which causes the leak (and because it's a bug/feature kind of thing, it can be called mere misconfiguration, cause by ignorance). So there's a Python 3.6 venv where I pip install celery < 4.5. And have the following module.
demo.py
import time
import celery
redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)
#app.task
def subtask():
pass
#app.task
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
if __name__ == '__main__':
task.delay().get()
Basically a task which schedules a bunch of subtasks. What can go wrong?
I'll use procpath to analyse Celery node memory consumption. pip install procpath. I have 4 terminals:
procpath record -d celery.sqlite -i1 "$..children[?('celery' in #.cmdline)]" to record the Celery node's process tree stats
docker run --rm -it -p 6379:6379 redis to run Redis which will serve as Celery broker and result backend
celery -A demo worker --concurrency 2 to run the node with 2 workers
python demo.py to finally run the example
(4) will finish under 2 minutes.
Then I use sqliteviz (pre-built version) to visualise what procpath has recorder. I drop the celery.sqlite there and use this query:
SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record
And in sqliteviz I create a line chart trace with X=ts, Y=rss, and add split transform By=stat_pid. The result chart is:
This shape is likely pretty familiar to anyone who fought with memory leaks.
Finding leaking objects
Now it's time for dozer. I'll show non-instrumented case (and you can instrument your code in similar way if you can). To inject Dozer server into target process I'll use Pyrasite. There are two things to know about it:
To run it, ptrace has to be configured as "classic ptrace permissions": echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, which is may be a security risk
There are non-zero chances that your target Python process will crash
With that caveat I:
pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip (that's to-be 0.8 I mentioned above)
pip install pillow (which dozer uses for charting)
pip install pyrasite
After that I can get Python shell in the target process:
pyrasite-shell 26572
And inject the following, which will run Dozer's WSGI application using stdlib's wsgiref's server.
import threading
import wsgiref.simple_server
import dozer
def run_dozer():
app = dozer.Dozer(app=None, path='/')
with wsgiref.simple_server.make_server('', 8000, app) as httpd:
print('Serving Dozer on port 8000...')
httpd.serve_forever()
threading.Thread(target=run_dozer, daemon=True).start()
Opening http://localhost:8000 in a browser there should see something like:
After that I run python demo.py from (4) again and wait for it to finish. Then in Dozer I set "Floor" to 5000, and here's what I see:
Two types related to Celery grow as the subtask are scheduled:
celery.result.AsyncResult
vine.promises.promise
weakref.WeakMethod has the same shape and numbers and must be caused by the same thing.
Finding root cause
At this point from the leaking types and the trends it may be already clear what's going on in your case. If it's not, Dozer has "TRACE" link per type, which allows tracing (e.g. seeing object's attributes) chosen object's referrers (gc.get_referrers) and referents (gc.get_referents), and continue the process again traversing the graph.
But a picture says a thousand words, right? So I'll show how to use objgraph to render chosen object's dependency graph.
pip install objgraph
apt-get install graphviz
Then:
I run python demo.py from (4) again
in Dozer I set floor=0, filter=AsyncResult
and click "TRACE" which should yield
Then in Pyrasite shell run:
objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')
The PNG file should contain:
Basically there's some Context object containing a list called _children that in turn is containing many instances of celery.result.AsyncResult, which leak. Changing Filter=celery.*context in Dozer here's what I see:
So the culprit is celery.app.task.Context. Searching that type would certainly lead you to Celery task page. Quickly searching for "children" there, here's what it says:
trail = True
If enabled the request will keep track of subtasks started by this task, and this information will be sent with the result (result.children).
Disabling the trail by setting trail=False like:
#app.task(trail=False)
def task():
for i in range(10_000):
subtask.delay()
time.sleep(0.01)
Then restarting the Celery node from (3) and python demo.py from (4) yet again, shows this memory consumption.
Problem solved!

Could you record the traffic (via a log) on your production site, then re-play it on your development server instrumented with a python memory debugger? (I recommend dozer: http://pypi.python.org/pypi/Dozer)

Make your program dump core, then clone an instance of the program on a sufficiently similar box using gdb. There are special macros to help with debugging python programs within gdb, but if you can get your program to concurrently serve up a remote shell, you could just continue the program's execution, and query it with python.
I have never had to do this, so I'm not 100% sure it'll work, but perhaps the pointers will be helpful.

I don't know how to dump an entire python interpreter state and restore it. It would be useful, I'll keep my eye on this answer in case anyone else has ideas.
If you have an idea where the memory is leaking, you can add checks the refcounts of your objects. For example:
x = SomeObject()
... later ...
oldRefCount = sys.getrefcount( x )
suspiciousFunction( x )
if (oldRefCount != sys.getrefcount(x)):
print "Possible memory leak..."
You could also check for reference counts higher than some number that is reasonable for your app. To take it further, you could modify the python interpreter to do these kinds of check by replacing the Py_INCREF and Py_DECREF macros with your own. This might be a bit dangerous in a production app, though.
Here is an essay with more info on debugging these sorts of things. It's more geared for plugin authors but most of it applies.
Debugging Reference Counts

The gc module has some functions that might be useful, like listing all objects the garbage collector found to be unreachable but cannot free, or a list of all objects being tracked.
If you have a suspicion which objects might leak, the weakref module could be handy to find out if/when objects are collected.

Meliae looks promising:
This project is similar to heapy (in the 'guppy' project), in its attempt to understand how memory has been allocated.
Currently, its main difference is that it splits the task of computing summary statistics, etc of memory consumption from the actual scanning of memory consumption. It does this, because I often want to figure out what is going on in my process, while my process is consuming huge amounts of memory (1GB, etc). It also allows dramatically simplifying the scanner, as I don't allocate python objects while trying to analyze python object memory consumption.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.