I'm new in Python.
Let's say, I use large pandas data frames.
My code looks something like:
all_data = pd.read_csv(huge_file_name)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
data_filtered = part_data.loc[:,part_data['ColumnName2']==-1]
and so on.
Is some way, that python can delete all_data, part_data and other variables no more used?
I can write del var_name, but it will change the code to be very dirty.
Also I can use for all variables the same name, but it also doesn't look good.
Thank you all in advance!
The del keyword is the way to do it; I'm not sure there's much to be done about your concern for making the code "dirty." Python people like to say that explicit is better than implicit, and this would be an instance of that.
Otherwise declare the intermediate variables within a function scope and the space used by those variables will be freed (or rather marked for "garbage collection"; see below) when the function terminates.
So you could:
import gc
all_data = pd.read_csv(huge_file_name)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
data_filtered = part_data.loc[:,part_data['ColumnName2']==-1]
del all_data, part_data
# and if you're impatient for that memory to be freed, like RIGHT now
gc.collect()
Or you could:
import gc
def filter_data(infile):
all_data = pd.read_csv(infile)
part_data = all_data.loc['ColumnName1', 'ColumnName2','ColumnName3']
return part_data.loc[:,part_data['ColumnName2']==-1]
data_filtered = filter_data(huge_file_name)
# force out-of-scope variables to be garbage collected RIGHT now
gc.collect()
The del keyword releases a variable from the local scope so it can be (eventually) garbage collected, but the memory freed when variables go out of scope may not be immediately returned to the operating system. The SO thread AMC helpfully pointed you to has details.
Garbage collection strategies are PhD-level computer science stuff, but my intuition is that GC is only triggered when there is some "pressure" on the Python runtime to release some memory; as in, new variable declarations that would need to use some memory previously in use by out-of-scope variables.
You were careful to point out that this is a large CSV file being read into a single (Pandas) data structure, but be mindful of the fact that out-of-scope variables are normally automatically garbage collected, and usually you do not need to micro-manage this process yourself.
Here is some background on garbage collection in Python that you may find illuminating, and here is a discussion of other times when del is useful (deleting slices out of a list, for example).
Related
I'm looking for an example that purposely makes a memory leak in Python.
It should be as short and simple as possible and ideally not use non-standard dependencies (that could simply do the memory leak in C code) or multi-threading/processing.
I've seen memory leaks achieved before but only when bad things were being done to libraries such as matplotlib. Also, there are many questions about how to find and fix memory leaks in Python, but they all seem to be big programs with lots of external dependencies.
The reason for asking this is about how good Python's GC really is. I know it detects reference cycles. However, can it be tricked? Is there some way to leak memory? It may be impossible to solve the most restrictive version of this problem. In that case, I'm very happy to see a rigorous argument why. Ideally, the answer should refer to the actual implementation and not just state that "an ideal garbage collector would be ideal and disallow memory leaks".
For nitpicking purposes: An ideal solution to the problem would be a program like this:
# Use Python version at least v3.10
# May use imports.
# Bonus points for only standard library.
# If the problem is unsolvable otherwise (please argue that it is),
# then you may use e.g. Numpy, Scipy, Pandas. Minus points for Matplotlib.
def memleak():
# do whatever you want but only within this function
# No global variables!
# Bonus points for no observable side-effects (besides memory use)
# ...
for _ in range(100):
memleak()
The function must return and be called multiple times. Goals in order of bonus points (high number = many bonus points)
the program keeps using more memory, until it crashes.
after calling the function multiple times (e.g. the 100 specified above), the program may continue doing other (normal) things such that the memory leaked during the function is never freed.
Like 2 but the memory cannot be freed, even by by calling gc manually and similar means.
One way to "trick" CPython's garbage collector into leaking memory is by invalidating an object's reference count. We can do this by creating an extraneous strong reference that never gets deleted.
To create a new strong reference, we need to invoke Py_IncRef (or Py_NewRef) from Python's C API. This can be done via ctypes.pythonapi:
import ctypes
import sys
# Create C API callable
inc_ref = ctypes.pythonapi.Py_IncRef
inc_ref.argtypes = [ctypes.py_object]
inc_ref.restype = None
# Create an arbitrary object.
obj = object()
# Print the number of references to obj.
# This should be 2:
# - one for the global variable 'obj'
# - one for the argument inside of 'sys.getrefcount'
print(sys.getrefcount(obj))
# Create a new strong reference.
inc_ref(obj)
# Print the number of references to obj.
# This should be 3 now.
print(sys.getrefcount(obj))
outputs
2
3
Concretely, you can write your memleak function as
import ctypes
def memleak():
# Create C api callable
inc_ref = ctypes.pythonapi.Py_IncRef
inc_ref.argtypes = [ctypes.py_object]
inc_ref.restype = None
# Allocate a large object
obj = list(range(10_000_000))
# Increment its ref count
inc_ref(obj)
# obj will have a dangling reference after this function exits
memleak() # leaks memory
An object with a dangling strong reference will never be freed by reference counting, and won't be detected as an unreachable object by the optional garbage collector. Running gc manually via
gc.collect()
will have not effect.
I have a list which will get really big. So I will save the List on my HDD and continue with an empty list. My question is: when I do myList[] will the old data be deleted or will it remain somewhere on the Ram. I fear, that the pointer of myList will just point somewhere else and and the old data will not be toched.
myList = []
for i in range(bigNumber1)
for k in range(bigNumber2)
myList.append( bigData(i,k) )
savemat("data"+str(i), "data":{myList})
myList = []
Good day.
In python and many other programming languages, object pointers that reference an unused object will be collected by the garbage collector, a feature that looks for these objects and clears them from memory. How this is done exactly under the hook, can be read about in more detail here:
https://stackify.com/python-garbage-collection/
Happy codings!
Python uses Garbage Collection for memory management (read more here).
The garbage collector attempts to reclaim memory which was allocated
by the program, but is no longer referenced—also called garbage.
So your data will automatically be deleted. However, if you want to sure that the memory is free at a particular point, you can call the GC directly with
import gc
gc.collect()
This is not recommended though.
I have a quite complex python (2.7 on ubuntu) code which is leaking memory unexpectedly. To break it down, it is a method which is repeatedly called (and itself calls different methods) and returns a very small object. After finishing the method the used memory is not released. As far as I know it is not unusual to reserve some memory for later usages, but if I use big enough input my machine eventually consumes all memory and freezes. This is not the case if I use a subprocess with concurrent.futures ProcessPoolExecutor, thus I need to assume it is not my code but some underlying problems?!
Is this a known issue? Might it be a problem in 3rd party libraries I am using (e.g. PyQgis)? Where should I start to search for the problem?
Some more Background to eliminate silly reasons (because I am still somewhat of a beginner):
The method uses some global variables but in my understanding these should only be active in the file where they are declared and anyways should be overwritten in the next call of the method?!
To clarify in pseudocode:
def main():
load input from file
for x in input:
result = extra_file.initialization(x)
#here is the point where memory should get released in my opinion
#extra file
def initialization(x):
global input
input = x
result_container = []
while not result do:
part_of_result = method1()
result_container.append(part_of_result)
if result_container fulfills condition to be the final result:
result = result_container
del input
return result
def method1():
#do stuff
method2()
#do stuff
return part_of_result
def method2():
#do stuff with input not altering it
Numerous different methods and global variables are involved and the global declaration is used to not pass like 5 different input variables through multiple methods which don't even use them.
Should I try using garbage collection? All references after finishing the method should be deleted and python itself should take care of it?
Definitely try using garbage collection. I don't believe it's a known problem.
I am trying to write a python module which checks consistency of the mac addresses stored in the HW memory. The scale could go upto 80K mac addresses. But when I make multiple calls to get a list of mac addresses through a python method, the memory does not get freed up and eventually I am running out of memory.
An example of what I am doing is:
import resource
import copy
def get_list():
list1 = None
list1 = []
for j in range(1,10):
for i in range(0,1000000):
list1.append('abcdefg')
print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000)
return list1
for i in range(0,5):
x=get_list()
On executing the script, I get:
45805
53805
61804
69804
77803
85803
93802
101801
109805
118075
126074
134074
142073
150073
158072
166072
174071
182075
190361
198361
206360
214360
222359
230359
238358
246358
254361
262365
270364
278364
286363
294363
302362
310362
318361
326365
334368
342368
350367
358367
366366
374366
382365
390365
398368
i.e. the memory usage reported keeps going up.
Is it that I am looking at the memory usage in a wrong way?
And if not, is there a way to not have the memory usage go up between function calls in a loop. (In my case with mac addresses, I do not call the same list of mac addresses again. I get the list from a different section of the HW memory. i.e. all the calls to get mac addresses are valid, but after each call the data obtained is useless and can be discarded.
Python is a managed language. Memory is, generally speaking, the concern of the implementation rather than the average developer. The system is designed to reclaim memory that you are no longer using automatically.
If you are using CPython, an object will be destroyed when its reference count reaches zero, or when the cyclic garbage collector finds and collects it. If you want to reclaim the memory belonging to an object, you need to ensure that no references to it remain, or at least that it is not reachable from any stack frame's variables. That is to say, it should not be possible to refer to the data you want reclaimed, either directly or through some expression such as foo.bar[42], from any currently executing function.
If you are using another implementation, such as PyPy, the rules may vary. In particular, reference counting is not required by the Python language standard, so objects may not go away until the next garbage collection run (and then you may have to wait for the right generation to be collected).
For older versions of Python (prior to Python 3.4), you also need to worry about reference cycles which involve finalizers (__del__() methods). The old garbage collector cannot collect such cycles, so they will (basically) get leaked. Most built-in types do not have finalizers, are not capable of participating in reference cycles, or both, but this is a legitimate concern if you are creating your own classes.
For your use case, you should empty or replace the list when you no longer need its contents (with e.g. list1 = [] or del list1[:]), or return from the function which created it (assuming it's a local variable, rather than a global variable or some other such thing). If you find that you are still running out of memory after that, you should either switch to a lower-overhead language like C or invest in more memory. For more complicated cases, you can use the gc module to test and evaluate how the garbage collector is interacting with your program.
Try this : it might not Lways free the memory as it may still be in use.
See if it works
gc.collect()
I have this code, save as so.py:
import gc
gc.set_debug(gc.DEBUG_STATS|gc.DEBUG_LEAK)
class GUI():
#########################################
def set_func(self):
self.functions = {}
self.functions[100] = self.userInput
#########################################
def userInput(self):
a = 1
g = GUI()
g.set_func()
print gc.collect()
print gc.garbage
And this is the output:
I have two questions:
Why gc.collect() does not reports unreachable when first time import? Instead it reports unreachable only when reload().
Is there any quick way to fix this function mapping circular reference, i.e self.functions[100] = self.userInput ? Because my old project have a lot of this function mapping circular reference and i'm looking for a quick way/one line to change this codes. Currently what i do is "del g.functions" for all this functions at the end.
The first time you import the module nothing is being collected because you have a reference to the so module and all other objects are referenced by it, so they are all alive and the garbage collector has nothing to collect.
When you reload(so) what happens is that the module is reexecuted, overriding all previous references and thus now the old values don't have any reference anymore.
You do have a reference cycle in:
self.functions[100] = self.userInput
since self.userInput is a bound method it has a reference to self. So now self has a reference to the functions dictionary which has a reference to the userInput bound method which has a reference to self and the gc will collect those objects.
It depends by what you are trying to do. From your code is not clear how you are using that self.functions dictionary and depending on that different options may be viable.
The simplest way to break the cycle is to simply not create the self.functions attribute, but pass the dictionary around explicitly.
If self.functions only references bound methods you could store the name of the methods instead of the method itself:
self.functions[100] = self.userInput.__name__
and then you can call the method doing:
getattr(self, self.functions[100])()
or you can do:
from operator import methodcaller
call_method = methodcaller(self.functions[100])
call_method(self) # calls self.userInput()
I don't really understand what do you mean by "Currently what i do is del g.functions for all this functions at the end." Which functions are you talking about?
Also, is this really a problem? Are you experience a real memory leak?
Note that the garbage collector reports the objects as unreachable not as uncollectable. This means that the objects are freed even if they are part of a reference cycle. So no memory leak should happen.
In fact adding del g.functions is useless because the objects are going to be freed anyway, so the one line fix is to simply remove all those del statements, since they don't do anything at all.
The fact that they are put into gc.garbage is because gc.DEBUG_LEAK implies the flag GC.DEBUG_SAVEALL which makes the collector put all unreachable objects into the garbage and not just the uncollectable ones.
The nature of reload is that the module is re-executed. The new definitions supersede the old ones, so the old values become unreachable. By contrast, on the first import, there are no superseded definitions, so naturally there is nothing to become unreachable.
One way is to pass the functions object as a parameter to set_func, and do not assign it as an instance attribute. This will break the cycle while still allowing you to pass the functions object to where it's needed.