I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.
If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.
It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.
Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.
Related
I am doing some experiments with the Python garbage collector, I would like to check if a memory address is used or not. In the following example, I have de-referenced the string (surely) at ls[2]. If I run the garbage collector, I can still see surely at the original address. I would like to be sure that the address is now writable. Is there a way to check it in Python?
from ctypes import string_at
from sys import getsizeof
import gc
ls = ['This','will be','surely','deleted']
idsurely= id(ls[2])
sizesurely = getsizeof(ls[2])
ls[2] = 'probably'
print(ls)
print(string_at(idsurely,sizesurely))
gc.collect()
# I check there is nothing in the garbage
print(gc.garbage)
print(string_at(idsurely,sizesurely))
I am interested in this mainly from a theoretical point of view so I am not saying that is something that has practical usage. My goal is to show how memory works for a tutorial. I want to show that the data is still there and that just that the bytes at the address can be now written. So the output of the script is up to now as expected. I just want to prove the last passage.
Not possible.
There is no central registry of used or unused memory addresses in Python. There isn't even a central registry of all objects (the cyclic GC doesn't know about all of them), and even if you had a registry of all objects, that wouldn't be enough to determine what memory locations are in use. Additionally, you can't just read arbitrary memory addresses, or write to arbitrary deallocated addresses. That'll quickly lead to segfaults or worse.
Finally, I would strongly advise against using this kind of thing in a tutorial even if you did find something to make it work. When you put something in a tutorial, a large fraction of people reading the tutorial are going to think it's something they're supposed to learn. Programming newbies should not be mislead into thinking that examining possibly-deallocated memory locations is something they should be doing.
Your experiments are way off base. id (solely as a CPython implementation detail) does get the memory address of the object in question, but we're talking about the Python object itself, not the data it contains. sys.getsizeof returns a number that roughly corresponds to how much memory the object occupies, but there is no guarantee that memory is contiguous.
By sheer coincidence, this almost works on str (though it will perform a buffer overread if the string in question has cached copies of its UTF-8 or wchar_t form, so you're risking crashing your program), but even then your test is flawed; CPython interns string literals that look like legal variable names, so if the string in question appears as a literal anywhere else in your program (including as the name of some class or function in some module you imported), it won't actually go away when you replace it. Similar implicit caches can occur if the literal string appears in any function, anywhere (it ends up being not only interned, but stored in the constants for that function).
Update: On testing, in an actual script, the reference count for 'surely' when you hold onto a copy of it is 3, which drops to 2 when you replace it with 'probably'. Turns out constants are being cached even at global scope. The only reason the interactive interpreter doesn't exhibit this behavior is that it effectively evals each line separately, so the constant cache is discarded when the eval completes.
And even if all that's not a problem, most (almost all) memory managers (CPython's specialized small object heap and the general heap it's built on) don't actually zero out memory when its released, so if you do look at the same address shortly after it really was released, it'll probably have pretty similar data in it.
Lastly, your gc.collect() call won't change anything except by coincidence (of whatever happens during gc possibly allocating memory by side-effect). str is not a garbage collected type, as it cannot contain references to other Python objects, so it's impossible for it to be a link in a reference cycle, and the CPython garbage collector is solely concerned with collecting cyclic garbage; CPython is reference counted, so anything that's not part of a reference cycle is cleaned up automatically and immediately when the last reference disappears.
The short answer this all leads up to is: There is no way to determine, within CPython, non-heuristically, if a particular memory address has been released to the free store and made available for reuse. CPython's memory management scheme is pure implementation detail, and exposing APIs at that level of detail would create compatibility concerns when people depended on them.
The closest you're going to get is using something like the tracemalloc module to perform basic snapshotting and compute differences in the snapshot. That's not going to give you a window into whether a specific address is still in use though AFAICT; at best it can tell you where an address that's definitely in use was allocated.
The other approach (specific to CPython) you can use is to just check the reference counts before replacing the object; sys.getrefcount for a given name/attribute reports 2, then deling (or rebinding) that name/attribute will release it (assuming no threads that might create additional references between the test and the del/rebind). You expect 2, not 1, because calling sys.getrefcount creates a temporary reference to the object in question. If it reports a number greater than 2, deling/rebinding could still lead to the object being deleted eventually when the cyclic garbage collectors runs, if the object was part of a reference cycle, but for a reference count of 2 (or 1 for something otherwise unnamed, e.g. sys.getrefcount(''.join(('f', '9')) or the like), the behavior will be deterministic.
From the documentation about gc:
... the collector supplements the reference counting already used in Python...
And from gc.is_tracked():
Returns True if the object is currently tracked by the garbage collector, False otherwise. As a general rule, instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are.
Strings are not tracked by the garbage collector:
In [1]: import gc
In [2]: test = 'surely'
Out[2]: 'surely'
In [3]: gc.is_tracked(test)
Out[3]: False
Looking at the documentation, there doesn't seem to be a method for accessing the reference counting from within the language.
Note that at least for me, using string_at doesn't work from the interactive interpreter. It does work in a script.
By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.
Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.
Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.
The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.
I want to know about the top 10 most common user agents, so I keep this around in a list.
Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?
Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str)
keeping a substring of a field prevents GC for object.
Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.
In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.
A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.
CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.
You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)
Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.
I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.
Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.
It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().
Is checking if a key is in a dictionary - if key in mydict - an atomic operation?
If not, would there be any negative effects if a thread was checking for a key while another thread was modifying the dictionary? The checking thread doesn't modify the dictionary, it just behaves differently depending on the presence of a key.
I don't think "atomic" is really what you're interested in.
If you're not using CPython (or you are, but one of the threads is running C code that's not under the GIL…), it is definitely not atomic, but it may be safe anyway under certain conditions.
If you are using CPython, it's atomic in the sense that in is a single bytecode op (COMPARE_OP 6), and in the possibly more useful sense that the actual hash-table lookup itself definitely happens under the GIL, and any potential equality comparison definitely happens with an object that's guaranteed to be alive. But it may still be unsafe, except under certain conditions.
First, the higher-level operation you're doing here is inherently racy. If thread 1 can be doing d['foo'] = 3 or del d['foo'] at the same time thread 0 is calling 'foo' in d, there is no right answer. It's not a question of atomic or not—there is no sequencing here.
But if you do have some kind of explicit sequencing at the application level, so there is a right answer to get, then you're only guaranteed to get the right answer if both threads hold the GIL. And I think that's what you're asking about, yes?
That's only even possible in CPython—and, even there, it amounts to guaranteeing that no object you ever put in a dict could ever release the GIL when you try to hash or == it, which is a hard thing to guarantee generally.
Now, what if the other thread is just replacing the value associated with a key, not changing the set of keys? Then there is a right answer, and it's available concurrently, as long as the dict implementation avoids mutating the hash table for this operation. At least in versions of CPython up to whatever was released by Jul 29 '10. Alex Martelli indirectly guarantees that in his answer to python dictionary is thread safe?. So, in that restricted case, you are safe in CPython—and likely in other implementations, but you'd want to read the code before relying on that.
As pointed out in the comments, the key you may end up comparing your lookup value with isn't guaranteed to be immutable, so even if the other thread doesn't do anything that changes the set of keys, it's still not absolutely guaranteed that you'll get the right answer. (You may have to craft a pathological key type to make this fail, but it would still be a legal key type.)
I find that I have recently been implementing Mapping interfaces on classes which on the surface fit the model (they are essentially just key-value stores with no more meta-data), but underneath they are sometimes quite complex.
Here are a couple examples of increasing severity:
An object which wraps another mapping converting all objects to strings when set.
An object which uses a local database as a back-end to store the key-value pairs.
An object which makes HTTP requests to remote servers to get/set data.
Lets suppose all of these examples seamlessly implement the Mapping interface, and the only indication that there is something fishy going on is that item access potentially could take a few seconds, and an item may not be retrievable in the same form as stored (if at all). I'm perfectly content with something like the first example, pretty okay with the second, but I'm getting kinda uncomfortable with the last.
The question is, is there a line at which the API for these models should not use item access, even though the underlying structure may feel like it fits on the surface?
It sounds like you are describing the standard anydbm module semantics. And just as anydbm can raise exception anydbm.error, so too could your subclass raise derivatives like MyDbmTimeoutError as needed. Whether you implement it as dictionary operations or function calls, the caller will still have to contend with exceptions anyway (e.g. KeyError, NameError).
I think that arbitrary "tied" hashes exist in Python 2 and 3.x is decent justification for saying it is a reasonable approach. Indeed, I've been looking for (and designing in my head) more complex bindings than simple key ⇒ value mappings without a heavy ORM SQL layer in between.
added: The more I think about it, the more Pythonic tied dictionaries seem to be. A key ⇒ value collection is a dictionary. Whether it lives in core or on disk or across the network is an implementation detail that is best abstracted away. The only substantive differences are increased latency and possible unavailability; however, on a virtual memory based OS, "core" can have higher latency than RAM and in a multiprocessing OS, "core" can become unavailable too. So these are differences in degree only, not kind.
From a strictly philosophical point of view, I don't think that there is a line you can cross with this. If some tool provides the needed functionality, but its API is different, adapt away. The only time you shouldn't do this is if the adapted to API simply is not expressive enough to manipulate the adapted component in a way that is needed.
I wouldn't hesitate to adapt a database into a dict, because that's a great way to manipulate collections, and it's already compatible with a heck of a lot of other code. If I find that my particular application must make calls to the database connections begin(), commit(), and rollback() methods to work right, then a dict won't do, since dict's don't have transaction semantics.
I have been working on some code. My usual approach is to first solve all of the pieces of the problem, creating the loops and other pieces of code I need as I work through the problem and then if I expect to reuse the code I go back through it and group the parts of code together that I think should be grouped to create functions.
I have just noticed that creating functions and calling them seems to be much more efficient than writing lines of code and deleting containers as I am finished with them.
for example:
def someFunction(aList):
do things to aList
that create a dictionary
return aDict
seems to release more memory at the end than
>>do things to alist
>>that create a dictionary
>>del(aList)
Is this expected behavior?
EDIT added example code
When this function finishes running the PF Usage shows an increase of about 100 mb the filingsList has about 8 million lines.
def getAllCIKS(filingList):
cikDICT=defaultdict(int)
for filing in filingList:
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
return ciklist
allCIKS=getAllCIKS(open(r'c:\filinglist.txt').readlines())
If I run this instead I show an increase of almost 400 mb
cikDICT=defaultdict(int)
for filing in open(r'c:\filinglist.txt').readlines():
if filing.startswith('.'):
del(filing)
continue
cik=filing.split('^')[0].strip()
cikDICT[cik]+=1
del(filing)
ciklist=cikDICT.keys()
ciklist.sort()
del(cikDICT)
EDIT
I have been playing around with this some more today. My observation and question should be refined a bit since my focus has been on the PF Usage. Unfortunately I can only poke at this between my other tasks. However I am starting to wonder about references versus copies. If I create a dictionary from a list does the dictionary container hold a copy of the values that came from the list or do they hold references to the values in the list? My bet is that the values are copied instead of referenced.
Another thing I noticed is that items in the GC list were items from containers that were deleted. Does that make sense? Soo I have a list and suppose each of the items in the list was [(aTuple),anInteger,[another list]]. When I started learning about how to manipulate the gc objects and inspect them I found those objects in the gc even though the list had been forcefully deleted and even though I passed the 0,1 & 2 value to the method that I don't remember to try to still delete them.
I appreciate the insights people have been sharing. Unfortunately I am always interested in figuring out how things work under the hood.
Maybe you used some local variables in your function, which are implicitly released by reference counting at the end of the function, while they are not released at the end of your code segment?
You can use the Python garbage collector interface provided to more closely examine what (if anything) is being left around in the second case. Specifically, you may want to check out gc.get_objects() to see what is left uncollected, or gc.garbage to see if you have any reference cycles.
Some extra memory is freed when you return from a function, but that's exactly as much extra memory as was allocated to call the function in the first place. In any case - if you seeing a large amount of difference, that's likely an artifact of the state of the runtime, and is not something you should really be worrying about. If you are running low on memory, the way to solve the problem is to keep more data on disk using things like b-trees (or just use a database), or use algorithms that use less memory. Also, keep an eye out for making unnecessary copies of large data structures.
The real memory savings in creating functions is in your short-term memory. By moving something into a function, you reduce the amount of detail you need to remember by encapsulating part of the minutia away.
Maybe you should re-engineer your code to get rid of unnecessary variables (that may not be freed instantly)... how about the following snippet?
myfile = file(r"c:\fillinglist.txt")
ciklist = sorted(set(x.split("^")[0].strip() for x in myfile if not x.startswith(".")))
EDIT: I don't know why this answer was voted negative... Maybe because it's short? Or maybe because the dude who voted was unable to understand how this one-liner does the same that the code in the question without creating unnecessary temporal containers?
Sigh...
I asked another question about copying lists and the answers, particularly the answer directing me to look at deepcopy caused me to think about some dictionary behavior. The problem I was experiencing had to do with the fact that the original list is never garbage collected because the dictionary maintains references to the list. I need to use the information about weakref in the Python Docs.
objects referenced by dictionaries seem to stay alive. I think (but am not sure) the process of pushing the dictionary out of the function forces the copy process and kills the object. This is not complete I need to do some more research.