Is checking a dictionary for a key an atomic operation? - python

Is checking if a key is in a dictionary - if key in mydict - an atomic operation?
If not, would there be any negative effects if a thread was checking for a key while another thread was modifying the dictionary? The checking thread doesn't modify the dictionary, it just behaves differently depending on the presence of a key.

I don't think "atomic" is really what you're interested in.
If you're not using CPython (or you are, but one of the threads is running C code that's not under the GIL…), it is definitely not atomic, but it may be safe anyway under certain conditions.
If you are using CPython, it's atomic in the sense that in is a single bytecode op (COMPARE_OP 6), and in the possibly more useful sense that the actual hash-table lookup itself definitely happens under the GIL, and any potential equality comparison definitely happens with an object that's guaranteed to be alive. But it may still be unsafe, except under certain conditions.
First, the higher-level operation you're doing here is inherently racy. If thread 1 can be doing d['foo'] = 3 or del d['foo'] at the same time thread 0 is calling 'foo' in d, there is no right answer. It's not a question of atomic or not—there is no sequencing here.
But if you do have some kind of explicit sequencing at the application level, so there is a right answer to get, then you're only guaranteed to get the right answer if both threads hold the GIL. And I think that's what you're asking about, yes?
That's only even possible in CPython—and, even there, it amounts to guaranteeing that no object you ever put in a dict could ever release the GIL when you try to hash or == it, which is a hard thing to guarantee generally.
Now, what if the other thread is just replacing the value associated with a key, not changing the set of keys? Then there is a right answer, and it's available concurrently, as long as the dict implementation avoids mutating the hash table for this operation. At least in versions of CPython up to whatever was released by Jul 29 '10. Alex Martelli indirectly guarantees that in his answer to python dictionary is thread safe?. So, in that restricted case, you are safe in CPython—and likely in other implementations, but you'd want to read the code before relying on that.
As pointed out in the comments, the key you may end up comparing your lookup value with isn't guaranteed to be immutable, so even if the other thread doesn't do anything that changes the set of keys, it's still not absolutely guaranteed that you'll get the right answer. (You may have to craft a pathological key type to make this fail, but it would still be a legal key type.)

Related

Does `dict.copy()` iterate? Can I use it while modifying the dict in another thread?

I have a dict shared between two threads. One thread is adding and removing entries, and the other thread now needs to iterate over the dict and derive some data from it.
In Python 2, there was items() which would return a list and didn't necessarily iterate over the dict. The suggested way to iterate over a dict you want to modify in Python 3 seems to be to iterate over list(mydict.items()), but that seems like it should only work for one thread; another thread might add or remove items while list() is still using an iterator over the dictionary view returned by items(), right?
There is a copy() method on dict; the documentation doesn't suggest that it can throw a RuntimeError like dictionary and dictionary view iterators can. Can I safely use copy() to snapshot a dict that is being modified by another thread? Then I can just iterate over the snapshot.
Definitionally, dict.copy must iterate; you can't copy all the key/value pairs without iterating.
The rest of the answer depends on your interpreter. It's a safe/atomic operation on the CPython reference interpreter (where the GIL ensures the entire copy operation occurs as a result of a single CALL_METHOD bytecode, and the GIL can only be released between bytecodes), but there's no language guarantee backing this up in general. If your code might run on GIL-free Python interpreters with truly simultaneous threads, you'll need to use locking.
Note that not all seemingly atomic dict operations will work this way. For example, dict.setdefault will be atomic on CPython if all keys involved are built-in types implemented in C (so invoking their __hash__ and __eq__ can't end up back in the interpreter loop, during which the GIL could be released), but it won't work for user-defined class instances with __hash__/__eq__ defined in Python-level code. dict.copy happens to be safe because:
The hashes are cached in the dict, and need not be recomputed
Nothing is being added or removed from the dict, so collisions are impossible while building the new dict (they have optimized code paths for when keys are being inserted and the keys being inserted are guaranteed not to be equal to any existing key, which can only be used in special cases like copy).

python deep copy and function call [duplicate]

I'm doing some things in Python (3.3.3), and I came across something that is confusing me since to my understanding classes get a new id each time they are called.
Lets say you have this in some .py file:
class someClass: pass
print(someClass())
print(someClass())
The above returns the same id which is confusing me since I'm calling on it so it shouldn't be the same, right? Is this how Python works when the same class is called twice in a row or not? It gives a different id when I wait a few seconds but if I do it at the same like the example above it doesn't seem to work that way, which is confusing me.
>>> print(someClass());print(someClass())
<__main__.someClass object at 0x0000000002D96F98>
<__main__.someClass object at 0x0000000002D96F98>
It returns the same thing, but why? I also notice it with ranges for example
for i in range(10):
print(someClass())
Is there any particular reason for Python doing this when the class is called quickly? I didn't even know Python did this, or is it possibly a bug? If it is not a bug can someone explain to me how to fix it or a method so it generates a different id each time the method/class is called? I'm pretty puzzled on how that is doing it because if I wait, it does change but not if I try to call the same class two or more times.
The id of an object is only guaranteed to be unique during that object's lifetime, not over the entire lifetime of a program. The two someClass objects you create only exist for the duration of the call to print - after that, they are available for garbage collection (and, in CPython, deallocated immediately). Since their lifetimes don't overlap, it is valid for them to share an id.
It is also unsuprising in this case, because of a combination of two CPython implementation details: first, it does garbage collection by reference counting (with some extra magic to avoid problems with circular references), and second, the id of an object is related to the value of the underlying pointer for the variable (ie, its memory location). So, the first object, which was the most recent object allocated, is immediately freed - it isn't too surprising that the next object allocated will end up in the same spot (although this potentially also depends on details of how the interpreter was compiled).
If you are relying on several objects having distinct ids, you might keep them around - say, in a list - so that their lifetimes overlap. Otherwise, you might implement a class-specific id that has different guarantees - eg:
class SomeClass:
next_id = 0
def __init__(self):
self.id = SomeClass.nextid
SomeClass.nextid += 1
If you read the documentation for id, it says:
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
And that's exactly what's happening: you have two objects with non-overlapping lifetimes, because the first one is already out of scope before the second one is ever created.
But don't trust that this will always happen, either. Especially if you need to deal with other Python implementations, or with more complicated classes. All that the language says is that these two objects may have the same id() value, not that they will. And the fact that they do depends on two implementation details:
The garbage collector has to clean up the first object before your code even starts to allocate the second object—which is guaranteed to happen with CPython or any other ref-counting implementation (when there are no circular references), but pretty unlikely with a generational garbage collector as in Jython or IronPython.
The allocator under the covers have to have a very strong preference for reusing recently-freed objects of the same type. This is true in CPython, which has multiple layers of fancy allocators on top of basic C malloc, but most of the other implementations leave a lot more to the underlying virtual machine.
One last thing: The fact that the object.__repr__ happens to contain a substring that happens to be the same as the id as a hexadecimal number is just an implementation artifact of CPython that isn't guaranteed anywhere. According to the docs:
If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description…> should be returned.
The fact that CPython's object happens to put hex(id(self)) (actually, I believe it's doing the equivalent of sprintf-ing its pointer through %p, but since CPython's id just returns the same pointer cast to a long that ends up being the same) isn't guaranteed anywhere. Even if it has been true since… before object even existed in the early 2.x days. You're safe to rely on it for this kind of simple "what's going on here" debugging at the interactive prompt, but don't try to use it beyond that.
I sense a deeper problem here. You should not be relying on id to track unique instances over the lifetime of your program. You should simply see it as a non-guaranteed memory location indicator for the duration of each object instance. If you immediately create and release instances then you may very well create consecutive instances in the same memory location.
Perhaps what you need to do is track a class static counter that assigns each new instance with a unique id, and increments the class static counter for the next instance.
It's releasing the first instance since it wasn't retained, then since nothing has happened to the memory in the meantime, it instantiates a second time to the same location.
Try this, try calling the following:
a = someClass()
for i in range(0,44):
print(someClass())
print(a)
You'll see something different. Why? Cause the memory that was released by the first object in the "foo" loop was reused. On the other hand a is not reused since it's retained.
A example where the memory location (and id) is not released is:
print([someClass() for i in range(10)])
Now the ids are all unique.

Python ordered garbage collectible dictionary?

I want my Python program to be deterministic, so I have been using OrderedDicts extensively throughout the code. Unfortunately, while debugging memory leaks today, I discovered that OrderedDicts have a custom __del__ method, making them uncollectable whenever there's a cycle. It's rather unfortunate that there's no warning in the documentation about this.
So what can I do? Is there any deterministic dictionary in the Python standard library that plays nicely with gc? I'd really hate to have to roll my own, especially over a stupid one line function like this.
Also, is this something I should file a bug report for? I'm not familiar with the Python library's procedures, and what they consider a bug.
Edit: It appears that this is a known bug that was fixed back in 2010. I must have somehow gotten a really old version of 2.7 installed. I guess the best approach is to just include a monkey patch in case the user happens to be running a broken version like me.
If the presence of the __del__ method is problematic for you, just remove it:
>>> import collections
>>> del collections.OrderedDict.__del__
You will gain the ability to use OrderedDicts in a reference cycle. You will lose having the OrderedDict free all its resources immediately upon deletion.
It sounds like you've tracked down a bug in OrderedDict that was fixed at some point after your version of 2.7. If it wasn't in any actual released versions, maybe you can just ignore it. But otherwise, yeah, you need a workaround.
I would suggest that, instead of monkeypatching collections.OrderedDict, you should instead use the Equivalent OrderedDict recipe that runs on Python 2.4 or later linked in the documentation for collections.OrderedDict (which does not have the excess __del__). If nothing else, when someone comes along and says "I need to run this on 2.6, how much work is it to port" the answer will be "a little less"…
But two more points:
rewriting everything to avoid cycles is a huge amount of effort.
The fact that you've got cycles in your dictionaries is a red flag that you're doing something wrong (typically using strong refs for a cache or for back-pointers), which is likely to lead to other memory problems, and possibly other bugs. So that effort may turn out to be necessary anyway.
You still haven't explained what you're trying to accomplish; I suspect the "deterministic" thing is just a red herring (especially since dicts actually are deterministic), so the best solution is s/OrderedDict/dict/g.
But if determinism is necessary, you can't depend on the cycle collector, because it's not deterministic, and that means your finalizer ordering and so on all become non-deterministic. It also means your memory usage is non-deterministic—you may end up with a program that stays within your desired memory bounds 99.999% of the time, but not 100%; if those bounds are critically important, that can be worse than failing every time.
Meanwhile, the iteration order of dictionaries isn't specified, but in practice, CPython and PyPy iterate in the order of the hash buckets, not the id (memory location) of either the value or the key, and whatever Jython and IronPython do (they may be using some underlying Java or .NET collection that has different behavior; I haven't tested), it's unlikely that the memory order of the keys would be relevant. (How could you efficiently iterate a hash table based on something like that?) You may have confused yourself by testing with objects that use id for hash, but most objects hash based on value.
For example, take this simple program:
d={}
d[0] = 0
d[1] = 1
d[2] = 2
for k in d:
print(k, d[k], id(k), id(d[k]), hash(k))
If you run it repeatedly with CPython 2.7, CPython 3.2, and PyPy 1.9, the keys will always be iterated in order 0, 1, 2. The id columns may also be the same each time (that depends on your platform), but you can fix that in a number of ways—insert in a different order, reverse the order of the values, use string values instead of ints, assign the values to variables and then insert those variables instead of the literals, etc. Play with it enough and you can get every possible order for the id columns, and yet the keys are still iterated in the same order every time.
The order of iteration is not predictable, because to predict it you need the function for converting hash(k) into a bucket index, which depends on information you don't have access to from Python. Even if it's just hash(k) % self._table_size, unless that _table_size is exposed to the Python interface, it's not helpful. (It's a complex function of the sequence of inserts and deletes that could in principle be calculated, but in practice it's silly to try.)
But it is deterministic; if you insert and delete the same keys in the same order every time, the iteration order will be the same every time.
Note that the fix made in Python 2.7 to eliminate the __del__ method and so stop them from being uncollectable does unfortunately mean that every use of an OrderedDict (even an empty one) results in a reference cycle which must be garbage collected. See this answer for more details.

I found myself swinging the list comprehension hammer

... and every for-loop looked like a list comprehension.
Instead of:
for stuff in all_stuff:
do(stuff)
I was doing (not assigning the list to anything):
[ do(stuff) for stuff in all_stuff ]
This is a common pattern found on list-comp how-to's. 1) OK, so no big deal right? Wrong. 2) Can't this just be code style? Super wrong.
1) Yea that was wrong. As NiklasB points out, the of the HowTos is to build up a new list.
2) Maybe, but its not obvious and explicit, so better not to use it.
I didn't keep in mind that those how-to's were largely command-line based. After my team yelled at me wondering why the hell I was building up massive lists and then letting them go, it occurred to me that I might be introducing a major memory-related bug.
So here'er my question/s. If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go? When will the garbage collector claim the memory back? After the scope this list is built in is lost?
My guess is yes, it will keep consuming my memory. I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
EDIT.
The essence of my question is relayed much cleaner in this question (thanks for the link Niklas)
If I were to do this in a very long running process, where lots of data was being consumed, would this "list" just continue consuming my memory until let go?
Absolutely.
When will the garbage collector claim the memory back? After the scope this list is built in is lost?
CPython uses reference counting, so that is the most likely case. Other implementations work differently, so don't count on it.
Thanks to Karl for pointing out that due to the complex memory management mechanisms used by CPython this does not mean that the memory is immediately returned to the OS after that.
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
I don't think any garbage collector works like that. Usually they mark-and-sweep, so it could be quite some time before the list is garbage collected.
This is a common pattern found on list-comp how-to's.
Absolutely not. The point is that you iterate the list with the purpose of doing something with every item (do is called for it's side-effects). In all the examples of the List-comp HOWTO, the list is iterated to build up a new list based on the items of the old one. Let's look at an example:
# list comp, creates the list [0,1,2,3,4,5,6,7,8,9]
[i for i in range(10)]
# loop, does nothing
for i in range(10):
i # meh, just an expression which doesn't have an effect
Maybe you'll agree that this loop is utterly senseless, as it doesn't do anything, in contrary to the comprehension, which builds a list. In your example, it's the other way round: The comprehension is completely senseless, because you don't need the list! You can find more information about the issue on a related question
By the way, if you really want to write that loop in one line, use a generator consumer like deque.extend. This will be slightly slower than a raw for loop in this simple example, though:
>>> from collections import deque
>>> consume = deque(maxlen=0).extend
>>> consume(do(stuff) for stuff in all_stuff)
Try manually doing GC and dumping the statistics.
gc.DEBUG_STATS
Print statistics during collection. This information can be useful when tuning the collection frequency.
FROM
http://docs.python.org/library/gc.html
The CPython GC will reap it once there are no references to it outside of a cycle. Jython and IronPython follow the rules of the underlying GCs.
If you like that idiom, do returns something that always evaluates to either True or False and would consider a similar alternative with no ugly side effects, you can use a generator expression combined with either any or all.
For functions that return False values (or don't return):
any(do(stuff) for stuff in all_stuff)
For functions that return True values:
all(do(stuff) for stuff in all_stuff)
I don't know how the python garbage collector works, but I would venture to say that this list will exist until after the last next is called on all_stuff.
Well, of course it will, since you're building a list that will have the same number of elements of all_stuff. The interpreter can't discard the list before it's finished, can it? You could call gc.collect between one of these loops and another one, but each one will be fully constructed before it can be reclaimed.
In some cases you could use a generator expression instead of a list comprehension, so it doesn't have to build a list with all your values:
(do_something(i) for i in xrange(1000))
However you'd still have to "exaust" that generator in some way...

Are there any Python reference counting/garbage collection gotchas when dealing with C code?

Just for the sheer heck of it, I've decided to create a Scheme binding to libpython so you can embed Python in Scheme programs. I'm already able to call into Python's C API, but I haven't really thought about memory management.
The way mzscheme's FFI works is that I can call a function, and if that function returns a pointer to a PyObject, then I can have it automatically increment the reference count. Then, I can register a finalizer that will decrement the reference count when the Scheme object gets garbage collected. I've looked at the documentation for reference counting, and don't see any problems with this at first glance (although it may be sub-optimal in some cases). Are there any gotchas I'm missing?
Also, I'm having trouble making heads or tails of the cyclic garbage collector documentation. What things will I need to bear in mind here? In particular, how do I make Python aware that I have a reference to something so it doesn't collect it while I'm still using it?
Your link to http://docs.python.org/extending/extending.html#reference-counts is the right place. The Extending and Embedding and Python/C API sections of the documentation are the ones that will explain how to use the C API.
Reference counting is one of the annoying parts of using the C API. The main gotcha is keeping everything straight: Depending on the API function you call, you may or may not own the reference to the object you get. Be careful to understand whether you own it (and thus cannot forget to DECREF it or give it to something that will steal it) or are borrowing it (and must INCREF it to keep it and possibly to use it during your function). The most common bugs involving this are 1) remembering incorrectly whether you own a reference returned by a particular function and 2) believing you're safe to borrow a reference for a longer time than you are.
You do not have to do anything special for the cyclic garbage collector. It's just there to patch up a flaw in reference counting and doesn't require direct access.
The biggest gotcha I know with ref counting and the C API is the __del__ thing. When you have a borrowed reference to something, you think you can get away without INCREF'ing because you don't give up the GIL while you use that reference. But, if you end up deleting an object (by, for example, removing it from a list), it's possible that you trigger a __del__ call, which might remove the reference you're borrowing from under your feet. Very tricky.
If you INCREF (and then DECREF, of course) all borrowed references as soon as you get them, there shouldn't be any problem.

Categories