What objects are guaranteed to have different identity? - python

ORIGINAL QUESTION:
(My question applies to Python 3.2+, but I doubt this has changed since Python 2.7.)
Suppose I use an expression that we usually expect to create an object. Examples: [1,2,3]; 42; 'abc'; range(10); True; open('readme.txt'); MyClass(); lambda x : 2 * x; etc.
Suppose two such expressions are executed at different times and "evaluate to the same value" (i.e., have the same type, and compare as equal). Under what conditions does Python provide what I call a distinct object guarantee that the two expressions actually create two distinct objects (i.e., x is y evaluates as False, assuming the two objects are bound to x and y, and both are in scope at the same time)?
I understand that for objects of any mutable type, the "distinct object guarantee" holds:
x = [1,2]
y = [1,2]
assert x is not y # guaranteed to pass
I also know for certain immutable types (str, int) the guarantee does not hold; and for certain other immutable types (bool, NoneType), the opposite guarantee holds:
x = True
y = not not x
assert x is not y # guaranteed to fail
x = 2
y = 3 - 1
assert x is not y # implementation-dependent; likely to fail in CPython
x = 1234567890
y = x + 1 - 1
assert x is not y # implementation-dependent; likely to pass in CPython
But what about all the other immutable types?
In particular, can two tuples created at different times have the same identity?
The reason I'm interested in this is that I represent nodes in my graph as tuples of int, and the domain model is such that any two nodes are distinct (even if they are represented by tuples with the same values). I need to create sets of nodes. If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity:
class DistinctTuple(tuple):
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
x = (1,2)
y = (1,2)
s = set(x,y)
assert len(s) == 1 # pass; but not what I want
x = DistinctTuple(x)
y = DistinctTuple(y)
s = set(x,y)
assert len(s) == 2 # pass; as desired
But if tuples created at different times are not guaranteed to be distinct, then the above is a terrible technique, which hides a dormant bug that may appear at random and may be very hard to replicate and find. In that case, subclassing won't help; I will actually need to add to each tuple, as an extra element, a unique id. Alternatively, I can convert my tuples to lists. Either way, I'd use more memory. Obviously, I'd prefer not to use these alternatives unless my original subclassing solution is unsafe.
My guess is that Python does not offer the "distinct object guarantee" for immutable types, either built-in or user-defined. But I haven't found a clear statement about it in the documentation.
UPDATE 1:
#LuperRouch #larsmans Thank you for the discussion and the answer so far. Here's the last issue I'm still unclear with:
Is there any chance that the creation of an object of a user-defined
type results in a reuse of an existing object?
If this is possible, I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Here's my understanding. Any time an object of a user-defined class is created, the class' __new__() method is called first. If this method is overridden, nothing in the language would prevent the programmer from returning a reference to an existing object, thus violating my "distinct object guarantee". Obviously, I can observe it by examining the class definition.
I am not sure what happens if a user-defined class does not override __new__() (or explicitly relies __new__() from the base class). If I write
class MyInt(int):
pass
the object creation is handled by int.__new__(). I would expect that this means I may sometimes see the following assertion fail:
x = MyInt(1)
y = MyInt(1)
assert x is not y # may fail, since int.__new__() might return the same object twice?
But in my experimentation with CPython I could not achieve such behavior. Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override __new__, or is it just an arbitrary implementation behavior?
UPDATE 2:
While my DistinctTuple turned out to be a perfectly safe implementation, I now understand that my design idea of using DistinctTuple to model nodes is very bad.
The identity operator is already available in the language; making == behave in the same way as is is logically superfluous.
Worse, if == could have been done something useful, I made it unavailable. For instance, it's quite likely that somewhere in my program I'll want to see if two nodes are represented by the same pair of integers; == would have been perfect for that - and in fact, that's what it does by default...
Worse yet, most people actually do expect == to compare some "value" rather than identity - even for a user-defined class. They would be caught unawares with my override that only looks at identity.
Finally... the only reason I had to redefine == was to allow multiple nodes with the same tuple representation to be part of a set. This is the wrong way to go about it! It's not == behavior that needs to change, it's the container type! I simply needed to use multisets instead of sets.
In short, while my question may have some value for other situations, I am absolutely convinced that creating class DistinctTuple is a terrible idea for my use case (and I strongly suspect it has no valid use case at all).

Python reference, section 3, Data model:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed.
(Emphasis added.)
In practice, it seems CPython only caches the empty tuple:
>>> 1 is 1
True
>>> (1,) is (1,)
False
>>> () is ()
True

Is there any chance that the creation of an object of a user-defined type results in a reuse of an existing object?
This will happen if, and only if, the user-defined type is explicitly designed to do that. With __new__() or some metaclass.
I'd like to know how I can verify for any class I work with whether it might exhibit such a behavior.
Use the source, Luke.
When it comes to int, small integers are pre-allocated, and these pre-allocated integers are used wherever you create of calculate with integers. You can't get this working when you do MyInt(1) is MyInt(1), because what you have there are not integers. However:
>>> MyInt(1) + MyInt(1) is 2
True
This is because of course MyInt(1) + MyInt(1) does not return a MyInt. It returns an int, because that's what the __add__ of an integer returns (and that's where the check for pre-allocated integers occur as well). This if anything just shows that subclassing int in general isn't particularly useful. :-)
Does this mean the language provides "distinct object guarantee" for user-defined classes that don't override new, or is it just an arbitrary implementation behavior?
It doesn't guarantee it, because there is no need to do so. The default behavior is to create a new object. You have to override it if you don't want that to happen. Having a guarantee makes no sense.

If Python guarantees that tuples created at different times are distinct objects, I could simply subclass tuple to redefine equality to mean identity.
You seem to be confused about how subclassing works: If B subclasses A, then B gets to use all of A's methods[1] -- but the A methods will be working on instances of B, not of A. This holds even for __new__:
--> class Node(tuple):
... def __new__(cls):
... obj = tuple.__new__(cls)
... print(type(obj))
... return obj
...
--> n = Node()
<class '__main__.Node'>
As #larsman pointed out in the Python reference:
for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed
However, keep in mind this passage is talking about Python's built-in types, not user-defined types (which can go crazy pretty much any way they like).
I understand the above excerpt to guarantee that Python will not return a new mutable object that is the same as an existing object, and classes that are user-defined and created in Python code are inherently mutable (again, see note above about crazy user-defined classes).
A more complete Node class (note you don't need to explicitly refererence tuple.__hash__):
class Node(tuple):
__slots__ = tuple()
__hash__ = tuple.__hash__
def __eq__(self, other):
return self is other
def __ne__(self, other):
return self is not other
--> n1 = Node()
--> n2 = Node()
--> n1 is n2
False
--> n1 == n2
False
--> n1 != n2
True
--> n1 <= n2
True
--> n1 < n2
False
As you can see from the last two comparisons, you may want to also override the __le__ and __ge__ methods.
[1] The only exception I am aware of is __hash__ -- if __eq__ is defined on the subclass but the subclass wants the parent class' __hash__ it has to explicitly say so (this is a Python 3 change).

Related

Distinct python classes instances returning the same class object [duplicate]

How much can I rely on the object's id() and its uniqueness in practice? E.g.:
Does id(a) == id(b) mean a is b or vice versa? What about the opposite?
How safe is it to save an id somewhere to be used later (e.g. into some registry instead of the object itself)?
(Written as a proposed canonical in response to Canonicals for Python: are objects with the same id() the same object, `is` operator, unbound method objects)
According to the id() documentation, an id is only guaranteed to be unique
for the lifetime of the specific object, and
within a specific interpreter instance
As such, comparing ids is not safe unless you also somehow ensure that both objects whose ids are taken are still alive at the time of comparison (and are associated with the same Python interpreter instance, but you need to really try to make that become false).
Which is exactly what is does -- which makes comparing ids redundant. If you cannot use the is syntax for whatever reason, there's always operator.is_.
Now, whether an object is still alive at the time of comparison is not always obvious (and sometimes is grossly non-obvious):
Accessing some attributes (e.g. bound methods of an object) creates a new object each time. So, the result's id may or may not be the same on each attribute access.
Example:
>>> class C(object): pass
>>> c=C()
>>> c.a=1
>>> c.a is c.a
True # same object each time
>>> c.__init__ is c.__init__
False # a different object each time
# The above two are not the only possible cases.
# An attribute may be implemented to sometimes return the same object
# and sometimes a different one:
#property
def page(self):
if check_for_new_version():
self._page=get_new_version()
return self._page
If an object is created as a result of calculating an expression and not saved anywhere, it's immediately discarded,1 and any object created after that can take up its id.
This is even true within the same code line. E.g. the result of id(create_foo()) == id(create_bar()) is undefined.
Example:
>>> id([]) #the list object is discarded when id() returns
39733320L
>>> id([]) #a new, unrelated object is created (and discarded, too)
39733320L #its id can happen to be the same
>>> id([[]])
39733640L #or not
>>> id([])
39733640L #you never really know
Due to the above safety requirements when comparing ids, saving an id instead of the object is not very useful because you have to save a reference to the object itself anyway -- to ensure that it stays alive. Neither is there any performance gain: is implementation is as simple as comparing pointers.
Finally, as an internal optimization (and implementation detail, so this may differ between implementations and releases), CPython reuses some often-used simple objects of immutable types. As of this writing, that includes small integers and some strings. So even if you got them from different places, their ids might coincide.
This does not (technically) violate the above id() documentation's uniqueness promises: the reused object stays alive through all the reuses.
This is also not a big deal because whether two variables point to the same object or not is only practical to know if the object is mutable: if two variables point to the same mutable object, mutating one will (unexpectedly) change the other, too. Immutable types don't have that problem, so for them, it doesn't matter if two variables point to two identical objects or to the same one.
1Sometimes, this is called "unnamed expression".

in python Class object is immutable object ,but it can be modify,why?

I am not familiar whih python.I found class object and instance object can be dictionary's key recently.So i assume class object and instance is mutable object.As we all kown,dictionary's key must be immutable object and like tuple must contains immutable object.in other words,If use a tuple as a dictionary's key.It can't contains list object,ect. But class object can ,class object could modify it's attribute.it confused me for a long time.Please lighten me.And i am not understand class namespace concept , instance namespace and the relationship with each other. Could you explain this for me? Thanks in advance. The following is my testing
class Student(object):
name='tests'
pass
dic={Student:'test'} #not error
print(id(Student))
Student.name = 'modified'
print(id(Student))
You need to be careful here. You're mixing 2 separate (but closely related) concepts.
The first concept is immutability. Immutable objects cannot change once you've created them.
The second concept is hashability -- The ability to construct a consistent integer value from an object that does not change over it's lifetime and to define a consistent equality function1. Note, these constraints are very important (as we'll see).
The latter concept determines what can be used as a dictionary key (or set item). By default, class instances have a well defined equality function (two objects are equal iff they have the same id()). Class instances also have an integer value that does not change over their lifetime (their id()). Because the id() is also exposed as the hash() return value, instances of classes (and, classes themselves which are instances of type) are hashable by default.
class Foo(object):
def __init__(self, a):
self.a = a
f1 = Foo(1)
f2 = Foo(1)
d = {
f1: 1,
f2: 2,
}
Here we have 2 separate Foo instances in our dictionary. Even though they're the same, they aren't equal and they have different hash values.
f1 == f2 # False -- They do not have the same id()
hash(f1) == hash(f2) # False. By default, the hash() is the id()
Ok, but not all things are hashable -- e.g. list and set instances aren't hashable. At some point, reference equality isn't so useful anymore. e.g. I write:
d = {[1, 2, 3]: 6}
print(d[1, 2, 3])
and I get a KeyError. Why? Because my two lists aren't the same list -- They just happen to have the same values. In other words, they equal, but they don't have reference equality. Now that just starts to get really confusing. To avoid all that confusion, the python devs have just decided to not expose the list's id() to the list's hash(). Instead, they raise a TypeError with a (hopefully) more helpful error message.
hash([]) # TypeError: unhashable type: 'list'
Note that equality is overridden to do the natural thing rather than compare by id():
l1 = [1]
l2 = [1]
l1 == l2 # True. Nice.
Alright, so far we've basically said that to put something in a dictionary, we need to have well behaving __hash__ and __eq__ methods and that objects have those by default. Some objects choose to remove them to avoid confusing situations. Where does immutability come in to this?
So far, our world consists of being able to store things in a table and look them up solely by the object's id(). That's super useful sometimes, but it's still really restrictive. I wouldn't be able to use integers in a lookup table naturally if all I can rely on is their id() (what if I store it using a literal but then do a lookup using the result of a computation?). Fortunately, we live in a world that lets us get around that problem -- immutability aids in the construction of a hash() value that isn't tied to the object's id() and isn't in danger of changing during the object's lifetime. This can be super useful because now I can do:
d = {(1, 2, 3): 4}
d[(1, 2) + (3,)] # 4!
Now the two tuples that I used were not the same tuple (they didn't have the same id()), but they are equal and because they're immutable, we can construct a hash() function that uses the contents of the tuple rather than it's id(). This is super useful! Note that if the tuple was mutable and we tried to play this trick, we'd (potentially) violate the condition that hash() should not change over the lifetime of the object.
1Consistent here means that if two objects are equal, then they also must have the same hash. This is necessary for resolving hash collisions which I won't discuss here in detail...

Python: Testing equivalence of sets of custom classes when all instances are unique by definition?

Using Python 2.6, with the set() builtin, not sets.set.
I have defined some custom data abstraction classes, which will be made members of some sets using the builtin set() object.
The classes are already being stored in a separate structure, before being divided up into sets. All instances of the classes are declared first. No class instances are created or deleted after the first set is declared. No two class instances are ever considered to be "equal" to each other. (Two instances of the class, containing identical data, are considered not the same. A == B is False for all A,B where B is not A.)
Given the above, will there be any reasonable difference between these strategies for testing set_a == set_b?:
Option 1: Store integers in the sets that uniquely identify instances of my class.
Option 2: Store instances of my class, and implement __hash__() and __eq__() to compare id(self) == id(other). (This may not be necessary? Do default implementations of these functions in object just do the same thing but faster?) Possibly use an instance variable that increments every time a new instance calls __init__(). (Not thread safe?)
or,
Option 3: The instances are already stored and looked up in dictionaries keyed by rather long strings. The strings are what most directly represents what the instances are, and are kept unique. I thought storing these strings in the sets would be a RAM overhead and/or create a bunch of extra runtime by calling __eq__() and __hash__(). If this is not the case, I should store the strings directly. (But I think what I've read so far tells me it is the case.)
I'm somewhat new to sets in Python. I've figured out some of what I need to know already, just want to make sure I'm not overlooking something tricky or drawing a false conclusion somewhere.
I might be misunderstanding the question, but this is how Python behaves by default:
class Foo(object):
pass
a = Foo()
b = Foo()
c = Foo()
x = set([a, b])
y = set([a, b])
z = set([a, c])
print x == y # True
print x == z # False
Do default implementations of these functions in object just do the same thing but faster?
Yes. User-defined classes have __cmp__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns id(x). docs

custom comparison for built-in containers

In my code there's numerous comparisons for equality of various containers (list, dict, etc.). The keys and values of the containers are of types float, bool, int, and str. The built-in == and != worked perfectly fine.
I just learned that the floats used in the values of the containers must be compared using a custom comparison function. I've written that function already (let's call it approxEqual(), and assume that it takes two floats and return True if they are judged to be equal and False otherwise).
I prefer that the changes to the existing code are kept to a minimum. (New classes/functions/etc can be as complicated as necessary.)
Example:
if dict1 != dict2:
raise DataMismatch
The dict1 != dict2 condition needs to be rewritten so that any floats used in values of dict1 and dict2 are compared using approxEqual function instead of __eq__.
The actual contents of dictionaries comes from various sources (parsing files, calculations, etc.).
Note: I asked a question earlier about how to override built-in float's eq. That would have been an easy solution, but I learned that Python doesn't allow overriding built-in types' __eq__ operator. Hence this new question.
The only route to altering the way built-in containers check equality is to make them contain as values, instead of the "originals", wrapped values (wrapped in a class that overrides __eq__ and __ne__). This is if you need to alter the way the containers themselves use equality checking, e.g. for the purpose of the in operator where the right-hand side operand is a list -- as well as in containers' method such as their own __eq__ (type(x).__eq__(y) is the typical way Python will perform internally what you code as x == y).
If what you're talking about is performing your own equality checks (without altering the checks performed internally by the containers themselves), then the only way is to change every cont1 == cont2 into (e.g.) same(cont1, cont2, value_same) where value_same is a function accepting two values and returning True or False like == would. That's probably too invasive WRT the criterion you specify.
If you can change the container themselves (i.e., the number of places where container objects are created is much smaller than the number of places where two containers are checked for equality), then using a container subclass which overrides __eq__ is best.
E.g.:
class EqMixin(object):
def __eq__(self, other):
return same(cont1, cont2, value_same)
(with same being as I mentioned in the A's 2nd paragraph) and
class EqM_list(EqMixin, list): pass
(and so forth for other container types you need), then wherever you have (e.g.)
x = list(someiter)
change it into
x = EqM_list(someiter)
and be sure to also catch other ways to create list objects, e.g. replace
x = [bah*2 for bah in buh]
with
x = EqM_list(bah*2 for bah in buh)
and
x = d.keys()
with
x = EqM_list(d.iterkeys())
and so forth.
Yeah, I know, what a bother -- but it's a core principle (and practice;-) of Python that builtin types (be they containers, or value types like e.g. float) themselves cannot be changed. That's a very different philosophy from e.g. Ruby's and Javascript's (and I personally prefer it but I do see how it can seem limiting at times!).
Edit: the OP specific request seems to be (in terms of this answer) "how do I implement same" for the various container types, not how to apply it without changing the == into a function call. If that's correct, then (e.g) without using iterators for simplicity:
def samelist(a, b, samevalue):
if len(a) != len(b): return False
return all(samevalue(x, y) for x, y in zip(a, b))
def samedict(a, b, samevalue):
if set(a) != set(b): return False
return all(samevalue(a[x], b[x]) for x in a))
Note that this applies to values, as requested, NOT to keys. "Fuzzying up" the equality comparison of a dict's keys (or a set's members) is a REAL problem. Look at it this way: first, how to you guarantee with absolute certainty that samevalue(a, b) and samevalue(b, c) totally implies and ensures samevalue(a, c)? This transitivity condition does not apply to most semi-sensible "fuzzy comparisons" I've ever seen, and yet it's completely indispensable for the hash-table based containers (such as dicts and sets). If you pass that hurdle, then the nightmare of making the hash values somehow "magically" consistent arises -- and what if two actually different keys in one dict "map to" equality in this sense with the same key in the other dict, which of the two corresponding values should be used then...? This way madness lies, if you ask me, so I hope that when you say values you do mean, exactly, values, and not keys!-)

Is there an object unique identifier in Python

This would be similar to the java.lang.Object.hashcode() method.
I need to store objects I have no control over in a set, and make sure that only if two objects are actually the same object (not contain the same values) will the values be overwritten.
id(x)
will do the trick for you. But I'm curious, what's wrong about the set of objects (which does combine objects by value)?
For your particular problem I would probably keep the set of ids or of wrapper objects. A wrapper object will contain one reference and compare by x==y <==> x.ref is y.ref.
It's also worth noting that Python objects have a hash function as well. This function is necessary to put an object into a set or dictionary. It is supposed to sometimes collide for different objects, though good implementations of hash try to make it less likely.
That's what "is" is for.
Instead of testing "if a == b", which tests for the same value,
test "if a is b", which will test for the same identifier.
As ilya n mentions, id(x) produces a unique identifier for an object.
But your question is confusing, since Java's hashCode method doesn't give a unique identifier. Java's hashCode works like most hash functions: it always returns the same value for the same object, two objects that are equal always get equal codes, and unequal hash values imply unequal hash codes. In particular, two different and unequal objects can get the same value.
This is confusing because cryptographic hash functions are quite different from this, and more like (though not exactly) the "unique id" that you asked for.
The Python equivalent of Java's hashCode method is hash(x).
You don't have to compare objects before placing them in a set. set() semantics already takes care of this.
class A(object):
a = 10
b = 20
def __hash__(self):
return hash((self.a, self.b))
a1 = A()
a2 = A()
a3 = A()
a4 = a1
s = set([a1,a2,a3,a4])
s
=> set([<__main__.A object at 0x222a8c>, <__main__.A object at 0x220684>, <__main__.A object at 0x22045c>])
Note: You really don't have to override hash to prove this behaviour :-)

Categories