I'm looking for an example that purposely makes a memory leak in Python.
It should be as short and simple as possible and ideally not use non-standard dependencies (that could simply do the memory leak in C code) or multi-threading/processing.
I've seen memory leaks achieved before but only when bad things were being done to libraries such as matplotlib. Also, there are many questions about how to find and fix memory leaks in Python, but they all seem to be big programs with lots of external dependencies.
The reason for asking this is about how good Python's GC really is. I know it detects reference cycles. However, can it be tricked? Is there some way to leak memory? It may be impossible to solve the most restrictive version of this problem. In that case, I'm very happy to see a rigorous argument why. Ideally, the answer should refer to the actual implementation and not just state that "an ideal garbage collector would be ideal and disallow memory leaks".
For nitpicking purposes: An ideal solution to the problem would be a program like this:
# Use Python version at least v3.10
# May use imports.
# Bonus points for only standard library.
# If the problem is unsolvable otherwise (please argue that it is),
# then you may use e.g. Numpy, Scipy, Pandas. Minus points for Matplotlib.
def memleak():
# do whatever you want but only within this function
# No global variables!
# Bonus points for no observable side-effects (besides memory use)
# ...
for _ in range(100):
memleak()
The function must return and be called multiple times. Goals in order of bonus points (high number = many bonus points)
the program keeps using more memory, until it crashes.
after calling the function multiple times (e.g. the 100 specified above), the program may continue doing other (normal) things such that the memory leaked during the function is never freed.
Like 2 but the memory cannot be freed, even by by calling gc manually and similar means.
One way to "trick" CPython's garbage collector into leaking memory is by invalidating an object's reference count. We can do this by creating an extraneous strong reference that never gets deleted.
To create a new strong reference, we need to invoke Py_IncRef (or Py_NewRef) from Python's C API. This can be done via ctypes.pythonapi:
import ctypes
import sys
# Create C API callable
inc_ref = ctypes.pythonapi.Py_IncRef
inc_ref.argtypes = [ctypes.py_object]
inc_ref.restype = None
# Create an arbitrary object.
obj = object()
# Print the number of references to obj.
# This should be 2:
# - one for the global variable 'obj'
# - one for the argument inside of 'sys.getrefcount'
print(sys.getrefcount(obj))
# Create a new strong reference.
inc_ref(obj)
# Print the number of references to obj.
# This should be 3 now.
print(sys.getrefcount(obj))
outputs
2
3
Concretely, you can write your memleak function as
import ctypes
def memleak():
# Create C api callable
inc_ref = ctypes.pythonapi.Py_IncRef
inc_ref.argtypes = [ctypes.py_object]
inc_ref.restype = None
# Allocate a large object
obj = list(range(10_000_000))
# Increment its ref count
inc_ref(obj)
# obj will have a dangling reference after this function exits
memleak() # leaks memory
An object with a dangling strong reference will never be freed by reference counting, and won't be detected as an unreachable object by the optional garbage collector. Running gc manually via
gc.collect()
will have not effect.
Related
I'm new in Python.
Let's say, I use large pandas data frames.
My code looks something like:
all_data = pd.read_csv(huge_file_name)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
data_filtered = part_data.loc[:,part_data['ColumnName2']==-1]
and so on.
Is some way, that python can delete all_data, part_data and other variables no more used?
I can write del var_name, but it will change the code to be very dirty.
Also I can use for all variables the same name, but it also doesn't look good.
Thank you all in advance!
The del keyword is the way to do it; I'm not sure there's much to be done about your concern for making the code "dirty." Python people like to say that explicit is better than implicit, and this would be an instance of that.
Otherwise declare the intermediate variables within a function scope and the space used by those variables will be freed (or rather marked for "garbage collection"; see below) when the function terminates.
So you could:
import gc
all_data = pd.read_csv(huge_file_name)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
data_filtered = part_data.loc[:,part_data['ColumnName2']==-1]
del all_data, part_data
# and if you're impatient for that memory to be freed, like RIGHT now
gc.collect()
Or you could:
import gc
def filter_data(infile):
all_data = pd.read_csv(infile)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
return part_data.loc[:,part_data['ColumnName2']==-1]
data_filtered = filter_data(huge_file_name)
# force out-of-scope variables to be garbage collected RIGHT now
gc.collect()
The del keyword releases a variable from the local scope so it can be (eventually) garbage collected, but the memory freed when variables go out of scope may not be immediately returned to the operating system. The SO thread AMC helpfully pointed you to has details.
Garbage collection strategies are PhD-level computer science stuff, but my intuition is that GC is only triggered when there is some "pressure" on the Python runtime to release some memory; as in, new variable declarations that would need to use some memory previously in use by out-of-scope variables.
You were careful to point out that this is a large CSV file being read into a single (Pandas) data structure, but be mindful of the fact that out-of-scope variables are normally automatically garbage collected, and usually you do not need to micro-manage this process yourself.
Here is some background on garbage collection in Python that you may find illuminating, and here is a discussion of other times when del is useful (deleting slices out of a list, for example).
I am trying to write a python module which checks consistency of the mac addresses stored in the HW memory. The scale could go upto 80K mac addresses. But when I make multiple calls to get a list of mac addresses through a python method, the memory does not get freed up and eventually I am running out of memory.
An example of what I am doing is:
import resource
import copy
def get_list():
list1 = None
list1 = []
for j in range(1,10):
for i in range(0,1000000):
list1.append('abcdefg')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)
return list1
for i in range(0,5):
x=get_list()
On executing the script, I get:
45805
53805
61804
69804
77803
85803
93802
101801
109805
118075
126074
134074
142073
150073
158072
166072
174071
182075
190361
198361
206360
214360
222359
230359
238358
246358
254361
262365
270364
278364
286363
294363
302362
310362
318361
326365
334368
342368
350367
358367
366366
374366
382365
390365
398368
i.e. the memory usage reported keeps going up.
Is it that I am looking at the memory usage in a wrong way?
And if not, is there a way to not have the memory usage go up between function calls in a loop. (In my case with mac addresses, I do not call the same list of mac addresses again. I get the list from a different section of the HW memory. i.e. all the calls to get mac addresses are valid, but after each call the data obtained is useless and can be discarded.
Python is a managed language. Memory is, generally speaking, the concern of the implementation rather than the average developer. The system is designed to reclaim memory that you are no longer using automatically.
If you are using CPython, an object will be destroyed when its reference count reaches zero, or when the cyclic garbage collector finds and collects it. If you want to reclaim the memory belonging to an object, you need to ensure that no references to it remain, or at least that it is not reachable from any stack frame's variables. That is to say, it should not be possible to refer to the data you want reclaimed, either directly or through some expression such as foo.bar[42], from any currently executing function.
If you are using another implementation, such as PyPy, the rules may vary. In particular, reference counting is not required by the Python language standard, so objects may not go away until the next garbage collection run (and then you may have to wait for the right generation to be collected).
For older versions of Python (prior to Python 3.4), you also need to worry about reference cycles which involve finalizers (__del__() methods). The old garbage collector cannot collect such cycles, so they will (basically) get leaked. Most built-in types do not have finalizers, are not capable of participating in reference cycles, or both, but this is a legitimate concern if you are creating your own classes.
For your use case, you should empty or replace the list when you no longer need its contents (with e.g. list1 = [] or del list1[:]), or return from the function which created it (assuming it's a local variable, rather than a global variable or some other such thing). If you find that you are still running out of memory after that, you should either switch to a lower-overhead language like C or invest in more memory. For more complicated cases, you can use the gc module to test and evaluate how the garbage collector is interacting with your program.
Try this : it might not Lways free the memory as it may still be in use.
See if it works
gc.collect()
Read in Python CFFI documentation:
The interface is based on LuaJIT’s FFI (...)
Read on LuaJIT website (about ffi.gc()):
This function allows safe integration of unmanaged resources into the automatic memory management of the LuaJIT garbage collector. Typical usage:
local p = ffi.gc(ffi.C.malloc(n), ffi.C.free)
...
p = nil -- Last reference to p is gone.
-- GC will eventually run finalizer: ffi.C.free(p)
So, using Python-CFFI, do you have to trigger the destruction of the last reference to a variable instantiated using ffi.gc (= that needs a special function for deallocation because some parts of it are dynamically allocated) by setting it to (i.e.) ffi.NULL ?
Python is designed so that all objects are garbage collected as soon as there is no more reference to it (or soon afterwards), like any other garbage-collected language (including Lua). The trick of setting p = None explicitly (or del p) will merely make sure that this local variable p does not keep the object alive. It is pointless (barring special cases) if, for example, it is one of the last thing done in this function. You don't need it any more than you need it to free, say, a variable that would contain a regular string object.
I'm reading in a collection of objects (tables like sqlite3 tables or dataframes) from an Object Oriented DataBase, most of which are small enough that the Python garbage collector can handle without incident. However, when they get larger in size (less than 10 MB's) the GC doesn't seem to be able to keep up.
psuedocode looks like this:
walk = walkgenerator('/path')
objs = objgenerator(walk)
with db.transaction(bundle=True, maxSize=10000, maxParts=10):
oldobj = None
oldtable = None
for obj in objs:
currenttable = obj.table
if oldtable and oldtable in currenttable:
db.delete(oldobj.path)
del oldtable
oldtable = currenttable
del oldobj
oldobj = obj
if not count % 100:
gc.collect()
I'm looking for an elegant way to manage memory while allowing Python to handle it when possible.
Perhaps embarrassingly, I've tried using del to help clean up reference counts.
I've tried gc.collect() at varying modulo counts in my for loops:
100 (no difference),
1 (slows loop quite a lot, and I will still get a memory error of some type),
3 (loop is still slow but memory still blows up eventually)
Suggestions are appreciated!!!
Particularly, if you can give me tools to assist with introspection. I've used Windows Task Manager here, and it seems to more or less randomly spring a memory leak. I've limited the transaction size as much as I feel comfortable, and that seems to help a little bit.
There's not enough info here to say much, but what I do have to say wouldn't fit in a comment so I'll post it here ;-)
First, and most importantly, in CPython garbage collection is mostly based on reference counting. gc.collect() won't do anything for you (except burn time) unless trash objects are involved in reference cycles (an object A can be reached from itself by following a chain of pointers transitively reachable from A). You create no reference cycles in the code you showed, but perhaps the database layer does.
So, after you run gc.collect(), does memory use go down at all? If not, running it is pointless.
I expect it's most likely that the database layer is holding references to objects longer than necessary, but digging into that requires digging into exact details of how the database layer is implemented.
One way to get clues is to print the result of sys.getrefcount() applied to various large objects:
>>> import sys
>>> bigobj = [1] * 1000000
>>> sys.getrefcount(bigobj)
2
As the docs say, the result is generally 1 larger than you might hope, because the refcount of getrefcount()'s argument is temporarily incremented by 1 simply because it is being used (temporarily) as an argument.
So if you see a refcount greater than 2, del won't free the object.
Another way to get clues is to pass the object to gc.get_referrers(). That returns a list of objects that directly refer to the argument (provided that a referrer participates in Python's cyclic gc).
BTW, you need to be clearer about what you mean by "doesn't seem to work" and "blows up eventually". Can't guess. What exactly goes wrong? For example, is MemoryError raised? Something else? Traebacks often yield a world of useful clues.
I have created some python code which creates an object in a loop, and in every iteration overwrites this object with a new one of the same type. This is done 10.000 times, and Python takes up 7mb of memory every second until my 3gb RAM is used. Does anyone know of a way to remove the objects from memory?
I think this is circular reference (though the question isn't explicit about this information.)
One way to solve this problem is to manually invoke garbage collection. When you manually run garbage collector, it will sweep circular referenced objects too.
import gc
for i in xrange(10000):
j = myObj()
processObj(j)
#assuming count reference is not zero but still
#object won't remain usable after the iteration
if !(i%100):
gc.collect()
Here don't run garbage collector too often because it has its own overhead, e.g. if you run garbage collector in every loop, interpretation will become extremely slow.
You haven't provided enough information - this depends on the specifics of the object you are creating and what else you're doing with it in the loop. If the object does not create circular references, it should be deallocated on the next iteration. For example, the code
for x in range(100000):
obj = " " * 10000000
will not result in ever-increasing memory allocation.
This is an old error that was corrected for some types in python 2.5. What was happening was that python was not so good at collecting things like empty lists/dictionaries/tupes/floats/ints. In python 2.5 this was fixed...mostly. However floats and ints are singletons for comparisons so once one of those is created it stays around as long as the interpreter is alive. I've been bitten by this worst when dealing with large amount of floats since they have a nasty habit of being unique. This was characterized for python 2.4 and updated about it being folded into python 2.5
The best way I've found around it is to upgrade to python 2.5 or newer to take care of the lists/dictionaries/tuples issue. For numbers the only solution is to not let large amounts of numbers get into python. I've done it with my own wrapper to a c++ object, but I have the impression that numpy.array will give similar results.
As a post script I have no idea what has happened to this in python 3, but I'm suspicious that numbers are still part of a singleton. So the memory leak is actually a feature of the language.
If you're creating circular references, your objects won't be deallocated immediately, but have to wait for a GC cycle to run.
You could use the weakref module to address this problem, or explicitly del your objects after use.
I found that in my case (with Python 2.5.1), with circular references involving classes that have __del__() methods, not only was garbage collection not happening in a timely manner, the __del__() methods of my objects were never getting called, even when the script exited. So I used weakref to break the circular references and all was well.
Kudos to Miles who provided all the information in his comments for me to put this together.
Here's one thing you can do at the REPL to force a dereferencing of a variable:
>>> x = 5
>>> x
5
>>> del x
>>> x
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
weakref can be used for circular object structured code as in the explained example