Related
This question already has an answer here:
Confused about Python’s id() [duplicate]
(1 answer)
Closed 3 years ago.
We know that in Python the same character values within a string have the same ID value (same concept for list & tuple), as for example:
>>> var = 'wwww'
>>> print(id[0])
>>> 88293056
>>> print(id[1])
>>> 88293056
>>> print(id[2])
>>> 88293056
>>> print(id[3])
>>> 88293056
This is because all the positions (0 to 3) are referring the same object w in memory.
But what about the size of the string? If we see the size of variable var then it is showing 29.
>>> sys.getsizeof(var)
>>> 29
>>> sys.getsizeof('wwww')
>>> 29
>>> sys.getsizeof('www')
>>> 28
>>> sys.getsizeof('ww')
>>> 27
>>> sys.getsizeof('w')
>>> 26
Is this means for each character its taking 1 Byte within the string? then why sys.getsizeof('') is returning 25? is it the default size getting allocated for a string?
If all the positions are referring at the same object constant in memory (id value 88293056) then size of the variable var should be same as one character's size.
Similar thing is happening for list also.
>>> var = [1,1]
>>> print(id(var[0]))
>>> 1734203568
>>> print(id(var[1]))
>>> 1734203568
>>> sys.getsizeof(a[1])
14
>>> sys.getsizeof(a[0])
14
>>> sys.getsizeof(var)
>>> 44
Need some explanation about these.
We know that in Python the same character values within a string have the same ID value
Totally wrong.
First, "character values" have no id at all since python has no character type - a one-character string is still a string.
Then, the fact that CPython does intern some strings (as well as some integers) is an implementation detail, it's specific to the CPython implementation and is in no way part of the language itself (other implementations may not do so, or may do it according to other rules etc).
(same concept for list & tuple)
Definitly not:
>>> f = [{"foo":42}, {"foo": 42}]
>>> id(f[0])
139669285840048
>>> id(f[1])
139669257272968
>>> f[0] == f[1]
True
Or did you meant "for lists of strings and tuples of strings" ? If yes, same answer as for strings: it's only a product of CPython's strings interning and is in no way part of the language specifications.
This is because all the positions (0 to 3) are referring the same object w in memory.
Cf above. The string "www" is not made of three references to the string "w", and the evaluation of the expression "www"[0] actually yields a new string built from the first character of "www" - or, in the case of the CPython implementation, it first lookup it's cache, then either build a new string, cache it and return it, or just return the cached one, which is why you get the same ids.
wrt/ sys.getsizeof results, you have to understand that Python is not C. Python objects are not C scalar values but complex data structures (implemented mostly as structs containting pointers to other structs etc in CPython). Also, when a Python object references another (object's attributes, lists or tuples contents, dicts values etc), what's stored in the inner data structure is not the referenced object but only a reference to it, so what sys.getsizeof will use is the size of the reference (a PyObject pointer in CPython), not the size of the referenced object:
>>> l = ["foo"]
>>> sys.getsizeof(l)
80
>>> l[0] = "aaaaaaaaaaaaaaaaaaaaaaaaaaavvvvvvvvvvvvvvv"
>>> sys.getsizeof(l)
80
>>>
Similar thing is happening for list also.
var = [1,1]
CPython also caches "small" integers (for a definition of "small" that changed quite a few times... IIRC in 1.5.2 it was something like 255 or less). What you're seeing here are the size of a CPython int object and the minimal size (a larger list will have a larger size since it has to store more references) of a CPython list object on your system (those values can change depending on CPython version, target OS and compilation flags).
Also, wrt/ lists, CPython has some (once again implementation-specific) optimisations to avoid reallocating memory each time you append something, so do not expect a linear "list len -> list size" relation.
Two string variables are set to the same value. s1 == s2 always returns True, but s1 is s2 sometimes returns False.
If I open my Python interpreter and do the same is comparison, it succeeds:
>>> s1 = 'text'
>>> s2 = 'text'
>>> s1 is s2
True
Why is this?
is is identity testing, and == is equality testing. What happens in your code would be emulated in the interpreter like this:
>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False
So, no wonder they're not the same, right?
In other words: a is b is the equivalent of id(a) == id(b)
Other answers here are correct: is is used for identity comparison, while == is used for equality comparison. Since what you care about is equality (the two strings should contain the same characters), in this case the is operator is simply wrong and you should be using == instead.
The reason is works interactively is that (most) string literals are interned by default. From Wikipedia:
Interned strings speed up string
comparisons, which are sometimes a
performance bottleneck in applications
(such as compilers and dynamic
programming language runtimes) that
rely heavily on hash tables with
string keys. Without interning,
checking that two different strings
are equal involves examining every
character of both strings. This is
slow for several reasons: it is
inherently O(n) in the length of the
strings; it typically requires reads
from several regions of memory, which
take time; and the reads fills up the
processor cache, meaning there is less
cache available for other needs. With
interned strings, a simple object
identity test suffices after the
original intern operation; this is
typically implemented as a pointer
equality test, normally just a single
machine instruction with no memory
reference at all.
So, when you have two string literals (words that are literally typed into your program source code, surrounded by quotation marks) in your program that have the same value, the Python compiler will automatically intern the strings, making them both stored at the same memory location. (Note that this doesn't always happen, and the rules for when this happens are quite convoluted, so please don't rely on this behavior in production code!)
Since in your interactive session both strings are actually stored in the same memory location, they have the same identity, so the is operator works as expected. But if you construct a string by some other method (even if that string contains exactly the same characters), then the string may be equal, but it is not the same string -- that is, it has a different identity, because it is stored in a different place in memory.
The is keyword is a test for object identity while == is a value comparison.
If you use is, the result will be true if and only if the object is the same object. However, == will be true any time the values of the object are the same.
One last thing to note is you may use the sys.intern function to ensure that you're getting a reference to the same string:
>>> from sys import intern
>>> a = intern('a')
>>> a2 = intern('a')
>>> a is a2
True
As pointed out in previous answers, you should not be using is to determine equality of strings. But this may be helpful to know if you have some kind of weird requirement to use is.
Note that the intern function used to be a built-in on Python 2, but it was moved to the sys module in Python 3.
is is identity testing and == is equality testing. This means is is a way to check whether two things are the same things, or just equivalent.
Say you've got a simple person object. If it is named 'Jack' and is '23' years old, it's equivalent to another 23-year-old Jack, but it's not the same person.
class Person(object):
def __init__(self, name, age):
self.name = name
self.age = age
def __eq__(self, other):
return self.name == other.name and self.age == other.age
jack1 = Person('Jack', 23)
jack2 = Person('Jack', 23)
jack1 == jack2 # True
jack1 is jack2 # False
They're the same age, but they're not the same instance of person. A string might be equivalent to another, but it's not the same object.
This is a side note, but in idiomatic Python, you will often see things like:
if x is None:
# Some clauses
This is safe, because there is guaranteed to be one instance of the Null Object (i.e., None).
If you're not sure what you're doing, use the '=='.
If you have a little more knowledge about it you can use 'is' for known objects like 'None'.
Otherwise, you'll end up wondering why things doesn't work and why this happens:
>>> a = 1
>>> b = 1
>>> b is a
True
>>> a = 6000
>>> b = 6000
>>> b is a
False
I'm not even sure if some things are guaranteed to stay the same between different Python versions/implementations.
From my limited experience with Python, is is used to compare two objects to see if they are the same object as opposed to two different objects with the same value. == is used to determine if the values are identical.
Here is a good example:
>>> s1 = u'public'
>>> s2 = 'public'
>>> s1 is s2
False
>>> s1 == s2
True
s1 is a Unicode string, and s2 is a normal string. They are not the same type, but they are the same value.
I think it has to do with the fact that, when the 'is' comparison evaluates to false, two distinct objects are used. If it evaluates to true, that means internally it's using the same exact object and not creating a new one, possibly because you created them within a fraction of 2 or so seconds and because there isn't a large time gap in between it's optimized and uses the same object.
This is why you should be using the equality operator ==, not is, to compare the value of a string object.
>>> s = 'one'
>>> s2 = 'two'
>>> s is s2
False
>>> s2 = s2.replace('two', 'one')
>>> s2
'one'
>>> s2 is s
False
>>>
In this example, I made s2, which was a different string object previously equal to 'one' but it is not the same object as s, because the interpreter did not use the same object as I did not initially assign it to 'one', if I had it would have made them the same object.
The == operator tests value equivalence. The is operator tests object identity, and Python tests whether the two are really the same object (i.e., live at the same address in memory).
>>> a = 'banana'
>>> b = 'banana'
>>> a is b
True
In this example, Python only created one string object, and both a and b refers to it. The reason is that Python internally caches and reuses some strings as an optimization. There really is just a string 'banana' in memory, shared by a and b. To trigger the normal behavior, you need to use longer strings:
>>> a = 'a longer banana'
>>> b = 'a longer banana'
>>> a == b, a is b
(True, False)
When you create two lists, you get two objects:
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> a is b
False
In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.
If a refers to an object and you assign b = a, then both variables refer to the same object:
>>> a = [1, 2, 3]
>>> b = a
>>> b is a
True
Reference: Think Python 2e by Allen B. Downey
I believe that this is known as "interned" strings. Python does this, so does Java, and so do C and C++ when compiling in optimized modes.
If you use two identical strings, instead of wasting memory by creating two string objects, all interned strings with the same contents point to the same memory.
This results in the Python "is" operator returning True because two strings with the same contents are pointing at the same string object. This will also happen in Java and in C.
This is only useful for memory savings though. You cannot rely on it to test for string equality, because the various interpreters and compilers and JIT engines cannot always do it.
Actually, the is operator checks for identity and == operator checks for equality.
From the language reference:
Types affect almost all aspects of object behavior. Even the importance of object identity is affected in some sense: for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed. E.g., after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation, but after c = []; d = [], c and d are guaranteed to refer to two different, unique, newly created empty lists. (Note that c = d = [] assigns the same object to both c and d.)
So from the above statement we can infer that the strings, which are immutable types, may fail when checked with "is" and may succeed when checked with "is".
The same applies for int and tuple which are also immutable types.
is will compare the memory location. It is used for object-level comparison.
== will compare the variables in the program. It is used for checking at a value level.
is checks for address level equivalence
== checks for value level equivalence
is is identity testing and == is equality testing (see the Python documentation).
In most cases, if a is b, then a == b. But there are exceptions, for example:
>>> nan = float('nan')
>>> nan is nan
True
>>> nan == nan
False
So, you can only use is for identity tests, never equality tests.
The basic concept, we have to be clear, while approaching this question, is to understand the difference between is and ==.
"is" is will compare the memory location. if id(a)==id(b), then a is b returns true else it returns false.
So, we can say that is is used for comparing memory locations. Whereas,
== is used for equality testing which means that it just compares only the resultant values. The below shown code may acts as an example to the above given theory.
Code
In the case of string literals (strings without getting assigned to variables), the memory address will be same as shown in the picture. so, id(a)==id(b). The remaining of this is self-explanatory.
This question already has answers here:
Why does comparing strings using either '==' or 'is' sometimes produce a different result?
(15 answers)
Closed 8 years ago.
I am new at python and I'm currently exploring some of its core functionalities.
Could you explain me why the following example always return false in case of a string with special characters:
>>> a="x"
>>> b="x"
>>> a is b
True
>>> a="xxx"
>>> b="xxx"
>>> a is b
True
>>> a="xü"
>>> b="xü"
>>> a is b
False
>>> a="ü"
>>> b="ü"
>>> a is b
True
>>> #strange: with one special character it works as expected
I understand that the storage positions are different for strings with special characters on each assignment, I already checked it with the id() function but for which reason python handles strings in this unconsistent way?
Python (the reference implementation at least) has a cache for small integers and strings. I guess unicode strings outside the ASCII range are bigger than the cache threshold (internally unicode is stored using 16 or 32 bit wide characters, UCS-2 or UCS-4) and so they are not cached.
[edit]
Found a more complete answer at: About the changing id of a Python immutable string
Se also: http://www.laurentluce.com/posts/python-string-objects-implementation/
With is you're not testing equality between strings, you're testing equality between objects which is resolved through pointers. So your code:
>>> a="x"
>>> b="x"
>>> a is b
True
is not asking "are a and b the same character?", its asking "are a and b the same object?". Since there's a small object cache (for small integers and one byte strings, as has been said before), the answer is "yes, both variables refer to the same object in memory, the x character small object".
When you work with an object that is not eligible for the cache, as in:
>>> a="xü"
>>> b="xü"
>>> a is b
False
what is going on is that a and b now refer to different objects in memory, so the is operator resolves to false (a and b do not point to the same object!).
If the idea is comparing strings, you should use the == operator instead of is.
Two string variables are set to the same value. s1 == s2 always returns True, but s1 is s2 sometimes returns False.
If I open my Python interpreter and do the same is comparison, it succeeds:
>>> s1 = 'text'
>>> s2 = 'text'
>>> s1 is s2
True
Why is this?
is is identity testing, and == is equality testing. What happens in your code would be emulated in the interpreter like this:
>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False
So, no wonder they're not the same, right?
In other words: a is b is the equivalent of id(a) == id(b)
Other answers here are correct: is is used for identity comparison, while == is used for equality comparison. Since what you care about is equality (the two strings should contain the same characters), in this case the is operator is simply wrong and you should be using == instead.
The reason is works interactively is that (most) string literals are interned by default. From Wikipedia:
Interned strings speed up string
comparisons, which are sometimes a
performance bottleneck in applications
(such as compilers and dynamic
programming language runtimes) that
rely heavily on hash tables with
string keys. Without interning,
checking that two different strings
are equal involves examining every
character of both strings. This is
slow for several reasons: it is
inherently O(n) in the length of the
strings; it typically requires reads
from several regions of memory, which
take time; and the reads fills up the
processor cache, meaning there is less
cache available for other needs. With
interned strings, a simple object
identity test suffices after the
original intern operation; this is
typically implemented as a pointer
equality test, normally just a single
machine instruction with no memory
reference at all.
So, when you have two string literals (words that are literally typed into your program source code, surrounded by quotation marks) in your program that have the same value, the Python compiler will automatically intern the strings, making them both stored at the same memory location. (Note that this doesn't always happen, and the rules for when this happens are quite convoluted, so please don't rely on this behavior in production code!)
Since in your interactive session both strings are actually stored in the same memory location, they have the same identity, so the is operator works as expected. But if you construct a string by some other method (even if that string contains exactly the same characters), then the string may be equal, but it is not the same string -- that is, it has a different identity, because it is stored in a different place in memory.
The is keyword is a test for object identity while == is a value comparison.
If you use is, the result will be true if and only if the object is the same object. However, == will be true any time the values of the object are the same.
One last thing to note is you may use the sys.intern function to ensure that you're getting a reference to the same string:
>>> from sys import intern
>>> a = intern('a')
>>> a2 = intern('a')
>>> a is a2
True
As pointed out in previous answers, you should not be using is to determine equality of strings. But this may be helpful to know if you have some kind of weird requirement to use is.
Note that the intern function used to be a built-in on Python 2, but it was moved to the sys module in Python 3.
is is identity testing and == is equality testing. This means is is a way to check whether two things are the same things, or just equivalent.
Say you've got a simple person object. If it is named 'Jack' and is '23' years old, it's equivalent to another 23-year-old Jack, but it's not the same person.
class Person(object):
def __init__(self, name, age):
self.name = name
self.age = age
def __eq__(self, other):
return self.name == other.name and self.age == other.age
jack1 = Person('Jack', 23)
jack2 = Person('Jack', 23)
jack1 == jack2 # True
jack1 is jack2 # False
They're the same age, but they're not the same instance of person. A string might be equivalent to another, but it's not the same object.
This is a side note, but in idiomatic Python, you will often see things like:
if x is None:
# Some clauses
This is safe, because there is guaranteed to be one instance of the Null Object (i.e., None).
If you're not sure what you're doing, use the '=='.
If you have a little more knowledge about it you can use 'is' for known objects like 'None'.
Otherwise, you'll end up wondering why things doesn't work and why this happens:
>>> a = 1
>>> b = 1
>>> b is a
True
>>> a = 6000
>>> b = 6000
>>> b is a
False
I'm not even sure if some things are guaranteed to stay the same between different Python versions/implementations.
From my limited experience with Python, is is used to compare two objects to see if they are the same object as opposed to two different objects with the same value. == is used to determine if the values are identical.
Here is a good example:
>>> s1 = u'public'
>>> s2 = 'public'
>>> s1 is s2
False
>>> s1 == s2
True
s1 is a Unicode string, and s2 is a normal string. They are not the same type, but they are the same value.
I think it has to do with the fact that, when the 'is' comparison evaluates to false, two distinct objects are used. If it evaluates to true, that means internally it's using the same exact object and not creating a new one, possibly because you created them within a fraction of 2 or so seconds and because there isn't a large time gap in between it's optimized and uses the same object.
This is why you should be using the equality operator ==, not is, to compare the value of a string object.
>>> s = 'one'
>>> s2 = 'two'
>>> s is s2
False
>>> s2 = s2.replace('two', 'one')
>>> s2
'one'
>>> s2 is s
False
>>>
In this example, I made s2, which was a different string object previously equal to 'one' but it is not the same object as s, because the interpreter did not use the same object as I did not initially assign it to 'one', if I had it would have made them the same object.
The == operator tests value equivalence. The is operator tests object identity, and Python tests whether the two are really the same object (i.e., live at the same address in memory).
>>> a = 'banana'
>>> b = 'banana'
>>> a is b
True
In this example, Python only created one string object, and both a and b refers to it. The reason is that Python internally caches and reuses some strings as an optimization. There really is just a string 'banana' in memory, shared by a and b. To trigger the normal behavior, you need to use longer strings:
>>> a = 'a longer banana'
>>> b = 'a longer banana'
>>> a == b, a is b
(True, False)
When you create two lists, you get two objects:
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> a is b
False
In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.
If a refers to an object and you assign b = a, then both variables refer to the same object:
>>> a = [1, 2, 3]
>>> b = a
>>> b is a
True
Reference: Think Python 2e by Allen B. Downey
I believe that this is known as "interned" strings. Python does this, so does Java, and so do C and C++ when compiling in optimized modes.
If you use two identical strings, instead of wasting memory by creating two string objects, all interned strings with the same contents point to the same memory.
This results in the Python "is" operator returning True because two strings with the same contents are pointing at the same string object. This will also happen in Java and in C.
This is only useful for memory savings though. You cannot rely on it to test for string equality, because the various interpreters and compilers and JIT engines cannot always do it.
Actually, the is operator checks for identity and == operator checks for equality.
From the language reference:
Types affect almost all aspects of object behavior. Even the importance of object identity is affected in some sense: for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed. E.g., after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation, but after c = []; d = [], c and d are guaranteed to refer to two different, unique, newly created empty lists. (Note that c = d = [] assigns the same object to both c and d.)
So from the above statement we can infer that the strings, which are immutable types, may fail when checked with "is" and may succeed when checked with "is".
The same applies for int and tuple which are also immutable types.
is will compare the memory location. It is used for object-level comparison.
== will compare the variables in the program. It is used for checking at a value level.
is checks for address level equivalence
== checks for value level equivalence
is is identity testing and == is equality testing (see the Python documentation).
In most cases, if a is b, then a == b. But there are exceptions, for example:
>>> nan = float('nan')
>>> nan is nan
True
>>> nan == nan
False
So, you can only use is for identity tests, never equality tests.
The basic concept, we have to be clear, while approaching this question, is to understand the difference between is and ==.
"is" is will compare the memory location. if id(a)==id(b), then a is b returns true else it returns false.
So, we can say that is is used for comparing memory locations. Whereas,
== is used for equality testing which means that it just compares only the resultant values. The below shown code may acts as an example to the above given theory.
Code
In the case of string literals (strings without getting assigned to variables), the memory address will be same as shown in the picture. so, id(a)==id(b). The remaining of this is self-explanatory.
As you may have noticed before, CPython sometimes stores a single copy of identical immutable objects.
e.g.
>>> a = "hello"
>>> b = "hello"
>>> a is b
True
>>> a, b = 7734, 7734
>>> a is b
True
It appears that the hashing for what I assume is heap is performed after type inferencing
>>> a, b = 7734, 07734
>>> a is b
False
>>> a, b = 7734, 017066
>>> a is b
True
Is there any way to introspect the interpreter and print out this supposed heap of immutable objects?
No, interned objects are maintained in a range of locations, no one method exists to list them all.
Strings can be interned, as you discovered, and you can intern strings yourself by using the intern() function.
Small integers between -5 and 256 are interned.
Tuples are reused; the empty tuple (()) is a singleton, and 2000 each of tuple sizes 1 through to 20 are kept cached for recycling. (Just the tuple objects, not the contents).
None is a singleton, as are Ellipsis, NotImplemented, True and False.
As of Python 3.3, instance __dict__ dictionaries can share keys to save on memory.
The compiler can mark immutable (and in certain circumstances, mutable) sourcecode literals as constants, store them as such with the bytecode and re-use them each time the bytecode is run. This applies to strings, numbers, tuples, lists (if used with an in statement) and as of Python 3.2 sets (again, when used with in).
There may be more I haven't discovered yet.
These optimizations all help to avoid too much heap churn. And apart from None, Ellipsis, NotImplemented, True and False being a singletons they are all CPython-specific optimisations, they are not part of the Python language definition itself.
It's a little more complicated than you make it out to be. For instance, in your examples with large integers, the same object is not reused when the uses aren't part of the same expression.
>>> a = 7734
>>> b = 7734
>>> a is b
False
On the other hand, as your first example shows, this does work with strings...but not all strings.
>>> a = "this string includes spaces"
>>> b = "this string includes spaces"
>>> a is b
False
The following objects are actually interned by default: small integers, the empty tuple, and strings that look like Python identifiers. What you're seeing with large integers and other immutable objects is an optimization due to the fact that they're being used in the same expression.