I have been working on some code. My usual approach is to first solve all of the pieces of the problem, creating the loops and other pieces of code I need as I work through the problem and then if I expect to reuse the code I go back through it and group the parts of code together that I think should be grouped to create functions.
I have just noticed that creating functions and calling them seems to be much more efficient than writing lines of code and deleting containers as I am finished with them.
for example:
def someFunction(aList):
do things to aList
that create a dictionary
return aDict
seems to release more memory at the end than
>>do things to alist
>>that create a dictionary
>>del(aList)
Is this expected behavior?
EDIT added example code
When this function finishes running the PF Usage shows an increase of about 100 mb the filingsList has about 8 million lines.
def getAllCIKS(filingList):
cikDICT=defaultdict(int)
for filing in filingList:
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
return ciklist
allCIKS=getAllCIKS(open(r'c:\filinglist.txt').readlines())
If I run this instead I show an increase of almost 400 mb
cikDICT=defaultdict(int)
for filing in open(r'c:\filinglist.txt').readlines():
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
del(cikDICT)
EDIT
I have been playing around with this some more today. My observation and question should be refined a bit since my focus has been on the PF Usage. Unfortunately I can only poke at this between my other tasks. However I am starting to wonder about references versus copies. If I create a dictionary from a list does the dictionary container hold a copy of the values that came from the list or do they hold references to the values in the list? My bet is that the values are copied instead of referenced.
Another thing I noticed is that items in the GC list were items from containers that were deleted. Does that make sense? Soo I have a list and suppose each of the items in the list was [(aTuple),anInteger,[another list]]. When I started learning about how to manipulate the gc objects and inspect them I found those objects in the gc even though the list had been forcefully deleted and even though I passed the 0,1 & 2 value to the method that I don't remember to try to still delete them.
I appreciate the insights people have been sharing. Unfortunately I am always interested in figuring out how things work under the hood.
Maybe you used some local variables in your function, which are implicitly released by reference counting at the end of the function, while they are not released at the end of your code segment?
You can use the Python garbage collector interface provided to more closely examine what (if anything) is being left around in the second case. Specifically, you may want to check out gc.get_objects() to see what is left uncollected, or gc.garbage to see if you have any reference cycles.
Some extra memory is freed when you return from a function, but that's exactly as much extra memory as was allocated to call the function in the first place. In any case - if you seeing a large amount of difference, that's likely an artifact of the state of the runtime, and is not something you should really be worrying about. If you are running low on memory, the way to solve the problem is to keep more data on disk using things like b-trees (or just use a database), or use algorithms that use less memory. Also, keep an eye out for making unnecessary copies of large data structures.
The real memory savings in creating functions is in your short-term memory. By moving something into a function, you reduce the amount of detail you need to remember by encapsulating part of the minutia away.
Maybe you should re-engineer your code to get rid of unnecessary variables (that may not be freed instantly)... how about the following snippet?
myfile = file(r"c:\fillinglist.txt")
ciklist = sorted(set(x.split("^")[0].strip() for x in myfile if not x.startswith(".")))
EDIT: I don't know why this answer was voted negative... Maybe because it's short? Or maybe because the dude who voted was unable to understand how this one-liner does the same that the code in the question without creating unnecessary temporal containers?
Sigh...
I asked another question about copying lists and the answers, particularly the answer directing me to look at deepcopy caused me to think about some dictionary behavior. The problem I was experiencing had to do with the fact that the original list is never garbage collected because the dictionary maintains references to the list. I need to use the information about weakref in the Python Docs.
objects referenced by dictionaries seem to stay alive. I think (but am not sure) the process of pushing the dictionary out of the function forces the copy process and kills the object. This is not complete I need to do some more research.
Related
Pretty simple question:
I have some code to show some graphs, and it prepares data for the graphs, and I don't want to waste memory (limited)... is there a way to have a "local scope" so when we get to the end, everything inside is freed?
I come from C++ where you can define code inside { ... } so at the end everything is freed, and you don't have to care about anything
Anything like that in python?
The only thing I can think of is:
def tmp():
... code ...
tmp()
but is very ugly, and for sure I don't want to list all the del x at the end
If anything holds a reference to your object, it cannot be freed. By default, anything at the global scope is going to be held in the global namespace (globals()), and as far as the interpreter knows, the very next line of source code could reference it (or, another module could import it from this current module), so globals cannot be implicitly freed, ever.
This forces your hand to either explicitly delete references to objects with del, or to put them within the local scope of a function. This may seem ugly, but if you follow the philosophy that a function should do one thing and one thing well (thanks Unix!), you will already segment your code into functions already. On the one-off exceptions where you allocate a lot of memory early on in your function, and no longer need it midway through, you can del the reference to it.
I know this isn't the answer you want to hear, but its the reality of Python. You could accomplish something similar by nesting function defs or classs inside, but this is kinda hacky (or in the class case, which wouldn't require calling/instantiating, extremely hacky).
I will also mention, there is a gc built in module for interacting with the garbage collector. Here, you can trigger an immediate garbage collection (otherwise python will eventually get around to collecting the things you del refs to), as well as inspect how many references a given object has.
If you're curious where the allocations are happening, you can also use the built in tracemalloc module to trace said allocations.
Mechanism that handles freeing memory in Python is called "Garbage Collector" and it means there's no reason to use del in overwhelming majority of Python code.
When programming in Python, you are "not supposed" to care about such low level things as allocating and freeing memory for your variables.
That being said, putting your code into functions (although preferrably called something clearer than tmp()) is most definitely a good idea as it will make your code much more readable and "Pythonic"
Coming from C++ and already stumbled to one of the main diferences (drawbacks) of python and this is memory management.Python Garbage Collector will delete all the objects that will fall out of scope.Freeing up memory of objects althought doesnt guarantee that this memory will return actually to the system but instead a rather big portion will be kept reserved by the python programm even if not used.If you face a memory problem and you want to free your memory back to the system the only safe method is to run the memory intensive function into a seperate process.Every process in python have its own interpreter and any memory consumed by this process will return to the system when the process exits.
I'm transiting from Matlab/Octave to NumPy/SciPy. When I use Matlab in the interactive mode, I used clear or clear [some_variable] from time to time to remove that variable from memory. For example, before reading some new data to start a new sets of experiments, I used to clear data in Matlab.
How could I do the same thing with NumPy/SciPy?
I did some research, and I found there is a command called del, but I heard that del actually doesn't clear memory, but the variable disappears from the namespace instead. Am I right?
That being said, what would be the best way to mimic "clear" of Matlab in NumPy/SciPy?
del(obj) will work, according to the scipy mail list
If you're working in IPython, then you can also use %xdel obj
...but I heard that "del" actually doesn't clear memory, but the
variable disappears from the namespace. Am I right?
Yes, that's correct. That's what garbage collection is, Python will handle clearing the memory when it makes sense to, you don't need to worry about it, as from your end the variable no longer exists. Your code will behave the same, whether or not garbage collection has occurred yet won't matter, so there's no need for an alternative to del.
If you are curious about the differences of Matlab and Pythons garbage collection / memory allocation, you can read this SO thread on it.
I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.
If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.
It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.
Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.
... and every for-loop looked like a list comprehension.
Instead of:
for stuff in all_stuff:
do(stuff)
I was doing (not assigning the list to anything):
[ do(stuff) for stuff in all_stuff ]
This is a common pattern found on list-comp how-to's. 1) OK, so no big deal right? Wrong. 2) Can't this just be code style? Super wrong.
1) Yea that was wrong. As NiklasB points out, the of the HowTos is to build up a new list.
2) Maybe, but its not obvious and explicit, so better not to use it.
I didn't keep in mind that those how-to's were largely command-line based. After my team yelled at me wondering why the hell I was building up massive lists and then letting them go, it occurred to me that I might be introducing a major memory-related bug.
So here'er my question/s. If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go? When will the garbage collector claim the memory back? After the scope this list is built in is lost?
My guess is yes, it will keep consuming my memory. I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
EDIT.
The essence of my question is relayed much cleaner in this question (thanks for the link Niklas)
If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go?
Absolutely.
When will the garbage collector claim the memory back? After the scope this list is built in is lost?
CPython uses reference counting, so that is the most likely case. Other implementations work differently, so don't count on it.
Thanks to Karl for pointing out that due to the complex memory management mechanisms used by CPython this does not mean that the memory is immediately returned to the OS after that.
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
I don't think any garbage collector works like that. Usually they mark-and-sweep, so it could be quite some time before the list is garbage collected.
This is a common pattern found on list-comp how-to's.
Absolutely not. The point is that you iterate the list with the purpose of doing something with every item (do is called for it's side-effects). In all the examples of the List-comp HOWTO, the list is iterated to build up a new list based on the items of the old one. Let's look at an example:
# list comp, creates the list [0,1,2,3,4,5,6,7,8,9]
[i for i in range(10)]
# loop, does nothing
for i in range(10):
i # meh, just an expression which doesn't have an effect
Maybe you'll agree that this loop is utterly senseless, as it doesn't do anything, in contrary to the comprehension, which builds a list. In your example, it's the other way round: The comprehension is completely senseless, because you don't need the list! You can find more information about the issue on a related question
By the way, if you really want to write that loop in one line, use a generator consumer like deque.extend. This will be slightly slower than a raw for loop in this simple example, though:
>>> from collections import deque
>>> consume = deque(maxlen=0).extend
>>> consume(do(stuff) for stuff in all_stuff)
Try manually doing GC and dumping the statistics.
gc.DEBUG_STATS
Print statistics during collection. This information can be useful when tuning the collection frequency.
FROM
http://docs.python.org/library/gc.html
The CPython GC will reap it once there are no references to it outside of a cycle. Jython and IronPython follow the rules of the underlying GCs.
If you like that idiom, do returns something that always evaluates to either True or False and would consider a similar alternative with no ugly side effects, you can use a generator expression combined with either any or all.
For functions that return False values (or don't return):
any(do(stuff) for stuff in all_stuff)
For functions that return True values:
all(do(stuff) for stuff in all_stuff)
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
Well, of course it will, since you're building a list that will have the same number of elements of all_stuff. The interpreter can't discard the list before it's finished, can it? You could call gc.collect between one of these loops and another one, but each one will be fully constructed before it can be reclaimed.
In some cases you could use a generator expression instead of a list comprehension, so it doesn't have to build a list with all your values:
(do_something(i) for i in xrange(1000))
However you'd still have to "exaust" that generator in some way...
im using the fantastic eric4 ide to code python, it's got a tool built in called 'cyclops', which is apparently looking for cycles. After running it, it gives me a bunch of big bold red letters declaring there to be a multitude of cycles in my code. The problem is the output is nearly indecipherable, there's no way im gonna understand what a cycle is by reading its output. ive browsed the web for hours and cant seem to find so much as a blog post. when the cycles pile up to a certain point the profiler and debugger stop working :(.
my question is what are cycles, how do i know when im making a cycle, how do i avoid making cycles in python. thanks.
A cycle (or "references loop") is two or more objects referring to each other, e.g.:
alist = []
anoth = [alist]
alist.append(anoth)
or
class Child(object): pass
class Parent(object): pass
c = Child()
p = Parent()
c.parent = p
p.child = c
Of course, these are extremely simple examples with cycles of just two items; real-life examples are often longer and harder to spot. There's no magic bullet telling you that you just made a cycle -- you just need to watch for it. The gc module (whose specific job is to garbage-collect unreachable cycles) can help you diagnose existing cycles (when you set the appropriate debug flags). The weakref module can help you to avoid building cycles when you do need (e.g.) a child and parent to know about each other without creating a reference cycle (make just one of the two mutual references into a weak ref or proxy, or use the handy weak-dictionary containers that the module supplies).
All Cyclops tells you is whether there are objects in your code that refer to themselves through a chain of other objects. This used to be an issue in python, because the garbage collector wouldn't handle these kinds of objects correctly. That problem has since been, for the most part, fixed.
Bottom line: if you're not observing a memory leak, you don't need to worry about the output of Cyclops in most instances.