Different storage position of equal strings with special characters [duplicate]

Different storage position of equal strings with special characters [duplicate] - python

This question already has answers here:
Why does comparing strings using either '==' or 'is' sometimes produce a different result?
(15 answers)
Closed 8 years ago.
I am new at python and I'm currently exploring some of its core functionalities.
Could you explain me why the following example always return false in case of a string with special characters:
>>> a="x"
>>> b="x"
>>> a is b
True
>>> a="xxx"
>>> b="xxx"
>>> a is b
True
>>> a="xü"
>>> b="xü"
>>> a is b
False
>>> a="ü"
>>> b="ü"
>>> a is b
True
>>> #strange: with one special character it works as expected
I understand that the storage positions are different for strings with special characters on each assignment, I already checked it with the id() function but for which reason python handles strings in this unconsistent way?

Python (the reference implementation at least) has a cache for small integers and strings. I guess unicode strings outside the ASCII range are bigger than the cache threshold (internally unicode is stored using 16 or 32 bit wide characters, UCS-2 or UCS-4) and so they are not cached.
[edit]
Found a more complete answer at: About the changing id of a Python immutable string
Se also: http://www.laurentluce.com/posts/python-string-objects-implementation/

With is you're not testing equality between strings, you're testing equality between objects which is resolved through pointers. So your code:
>>> a="x"
>>> b="x"
>>> a is b
True
is not asking "are a and b the same character?", its asking "are a and b the same object?". Since there's a small object cache (for small integers and one byte strings, as has been said before), the answer is "yes, both variables refer to the same object in memory, the x character small object".
When you work with an object that is not eligible for the cache, as in:
>>> a="xü"
>>> b="xü"
>>> a is b
False
what is going on is that a and b now refer to different objects in memory, so the is operator resolves to false (a and b do not point to the same object!).
If the idea is comparing strings, you should use the == operator instead of is.

Related

`is` vs `==` for comparing primitives [duplicate]

I've started learning Python (python 3.3) and I was trying out the is operator. I tried this:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
>>> c = 'isitthespace'
>>> d = 'isitthespace'
>>> c is d
True
>>> e = 'isitthespace?'
>>> f = 'isitthespace?'
>>> e is f
False
It seems like the space and the question mark make the is behave differently. What's going on?
EDIT: I know I should be using ==, I just wanted to know why is behaves like this.

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.
Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:
Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)
An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.
Single characters are unique.
Examples
Alphanumeric string literals always share memory:
>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True
Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:
(interpreter)
>>> x='`!##$%^&*() \][=-. >:"?<a'; y='`!##$%^&*() \][=-. >:"?<a';
>>> z='`!##$%^&*() \][=-. >:"?<a';
>>> x is y
True
>>> x is z
False
(file)
x='`!##$%^&*() \][=-. >:"?<a';
y='`!##$%^&*() \][=-. >:"?<a';
z=(lambda : '`!##$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)
Output: True and False
For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:
>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False
Single characters always share memory, of course:
>>> chr(0x20) is ' '
True

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.
This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.
To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).
If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.
However, this seems to only be the case when the string has been assigned to a variable:
Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.
>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384
However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".
>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984
Checking the id of "bar" again after setting a equal to "bar" returns the same id.
>>> id("bar")
4322268984
So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
You can do:
>>> id(a) == id(b)
False
But, note that if a and b were directly in the comparison it would work.
>>> id('is it the space?') == id('is it the space?')
True
In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).
You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

Two or more identical strings of consecutive alphanumeric (only) characters are stored in one structure, thus they share their memory reference. There are posts about this phenomenon all over the internet since the 1990's. It has evidently always been that way. I have never seen a reasonable guess as to why that's the case. I only know that it is. Furthermore, if you split and re-join alphanumeric strings to remove spaces between words, the resulting identical alphanumeric strings do NOT share a reference, which I find odd. See below:
Add any non-alphanumeric value identically to both strings, and they instantly become copies, but not shared references.
a ="abbacca"; b = "abbacca"; a is b => True
a ="abbacca "; b = "abbacca "; a is b => False
a ="abbacca?"; b = "abbacca?"; a is b => False
~Dr. C.

'is' operator compare the actual object.
c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

why id is different in when both string are the same? [duplicate]

I've started learning Python (python 3.3) and I was trying out the is operator. I tried this:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
>>> c = 'isitthespace'
>>> d = 'isitthespace'
>>> c is d
True
>>> e = 'isitthespace?'
>>> f = 'isitthespace?'
>>> e is f
False
It seems like the space and the question mark make the is behave differently. What's going on?
EDIT: I know I should be using ==, I just wanted to know why is behaves like this.

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.
Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:
Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)
An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.
Single characters are unique.
Examples
Alphanumeric string literals always share memory:
>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True
Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:
(interpreter)
>>> x='`!##$%^&*() \][=-. >:"?<a'; y='`!##$%^&*() \][=-. >:"?<a';
>>> z='`!##$%^&*() \][=-. >:"?<a';
>>> x is y
True
>>> x is z
False
(file)
x='`!##$%^&*() \][=-. >:"?<a';
y='`!##$%^&*() \][=-. >:"?<a';
z=(lambda : '`!##$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)
Output: True and False
For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:
>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False
Single characters always share memory, of course:
>>> chr(0x20) is ' '
True

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.
This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.
To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).
If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.
However, this seems to only be the case when the string has been assigned to a variable:
Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.
>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384
However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".
>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984
Checking the id of "bar" again after setting a equal to "bar" returns the same id.
>>> id("bar")
4322268984
So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
You can do:
>>> id(a) == id(b)
False
But, note that if a and b were directly in the comparison it would work.
>>> id('is it the space?') == id('is it the space?')
True
In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).
You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

Two or more identical strings of consecutive alphanumeric (only) characters are stored in one structure, thus they share their memory reference. There are posts about this phenomenon all over the internet since the 1990's. It has evidently always been that way. I have never seen a reasonable guess as to why that's the case. I only know that it is. Furthermore, if you split and re-join alphanumeric strings to remove spaces between words, the resulting identical alphanumeric strings do NOT share a reference, which I find odd. See below:
Add any non-alphanumeric value identically to both strings, and they instantly become copies, but not shared references.
a ="abbacca"; b = "abbacca"; a is b => True
a ="abbacca "; b = "abbacca "; a is b => False
a ="abbacca?"; b = "abbacca?"; a is b => False
~Dr. C.

'is' operator compare the actual object.
c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

If same elements are referring same objects in Python datatypes (string/list/tuples) then how the size of the variable is determined? [duplicate]

This question already has an answer here:
Confused about Python’s id() [duplicate]
(1 answer)
Closed 3 years ago.
We know that in Python the same character values within a string have the same ID value (same concept for list & tuple), as for example:
>>> var = 'wwww'
>>> print(id[0])
>>> 88293056
>>> print(id[1])
>>> 88293056
>>> print(id[2])
>>> 88293056
>>> print(id[3])
>>> 88293056
This is because all the positions (0 to 3) are referring the same object w in memory.
But what about the size of the string? If we see the size of variable var then it is showing 29.
>>> sys.getsizeof(var)
>>> 29
>>> sys.getsizeof('wwww')
>>> 29
>>> sys.getsizeof('www')
>>> 28
>>> sys.getsizeof('ww')
>>> 27
>>> sys.getsizeof('w')
>>> 26
Is this means for each character its taking 1 Byte within the string? then why sys.getsizeof('') is returning 25? is it the default size getting allocated for a string?
If all the positions are referring at the same object constant in memory (id value 88293056) then size of the variable var should be same as one character's size.
Similar thing is happening for list also.
>>> var = [1,1]
>>> print(id(var[0]))
>>> 1734203568
>>> print(id(var[1]))
>>> 1734203568
>>> sys.getsizeof(a[1])
14
>>> sys.getsizeof(a[0])
14
>>> sys.getsizeof(var)
>>> 44
Need some explanation about these.

We know that in Python the same character values within a string have the same ID value
Totally wrong.
First, "character values" have no id at all since python has no character type - a one-character string is still a string.
Then, the fact that CPython does intern some strings (as well as some integers) is an implementation detail, it's specific to the CPython implementation and is in no way part of the language itself (other implementations may not do so, or may do it according to other rules etc).
(same concept for list & tuple)
Definitly not:
>>> f = [{"foo":42}, {"foo": 42}]
>>> id(f[0])
139669285840048
>>> id(f[1])
139669257272968
>>> f[0] == f[1]
True
Or did you meant "for lists of strings and tuples of strings" ? If yes, same answer as for strings: it's only a product of CPython's strings interning and is in no way part of the language specifications.
This is because all the positions (0 to 3) are referring the same object w in memory.
Cf above. The string "www" is not made of three references to the string "w", and the evaluation of the expression "www"[0] actually yields a new string built from the first character of "www" - or, in the case of the CPython implementation, it first lookup it's cache, then either build a new string, cache it and return it, or just return the cached one, which is why you get the same ids.
wrt/ sys.getsizeof results, you have to understand that Python is not C. Python objects are not C scalar values but complex data structures (implemented mostly as structs containting pointers to other structs etc in CPython). Also, when a Python object references another (object's attributes, lists or tuples contents, dicts values etc), what's stored in the inner data structure is not the referenced object but only a reference to it, so what sys.getsizeof will use is the size of the reference (a PyObject pointer in CPython), not the size of the referenced object:
>>> l = ["foo"]
>>> sys.getsizeof(l)
80
>>> l[0] = "aaaaaaaaaaaaaaaaaaaaaaaaaaavvvvvvvvvvvvvvv"
>>> sys.getsizeof(l)
80
>>>
Similar thing is happening for list also.
var = [1,1]
CPython also caches "small" integers (for a definition of "small" that changed quite a few times... IIRC in 1.5.2 it was something like 255 or less). What you're seeing here are the size of a CPython int object and the minimal size (a larger list will have a larger size since it has to store more references) of a CPython list object on your system (those values can change depending on CPython version, target OS and compilation flags).
Also, wrt/ lists, CPython has some (once again implementation-specific) optimisations to avoid reallocating memory each time you append something, so do not expect a linear "list len -> list size" relation.

Python: id() behavior in the Interpreter [duplicate]

This question already has answers here:
when does Python allocate new memory for identical strings?
(5 answers)
Closed 9 years ago.
I came across this weird behaviour which happens only in an interactive Python session, but not when I write a script and execute it.
String is an immutable data type in Python, hence:
>>> s2='string'
>>> s1='string'
>>> s1 is s2
True
Now, the weird part:
>>> s1='a string'
>>> s2='a string'
>>> s1 is s2
False
I have seen that having a whitespace in the string causes this behaviour. If I put this in a script and run it, the result is True in both cases.
Would anyone have a clue about this? Thanks.
EDIT:
Okay, the above question and answers give some ideas. Now here is another experiment:
>>> s2='astringbstring'
>>> s1='astringbstring'
>>> s1 is s2
True
In this case the strings are definitely longer than 'a string', but are still having the same identifiers.

Many thanks to #eryksun for the corrections!
This is because of a mechanism call interning in Python:
Enter string in the table of “interned” strings and return the
interned string – which is string itself or a copy. Interning strings
is useful to gain a little performance on dictionary lookup – if the
keys in a dictionary are interned, and the lookup key is interned, the
key comparisons (after hashing) can be done by a pointer compare
instead of a string compare. Normally, the names used in Python
programs are automatically interned, and the dictionaries used to hold
module, class or instance attributes have interned keys.
Changed in version 2.3: Interned strings are not immortal (like they
used to be in Python 2.2 and before); you must keep a reference to the
return value of intern() around to benefit from it.
CPython will automatically intern short certain strings (1 letter strings, keywords, strings without spaces that have been assigned) to increase lookup speed and comparison speed: eg., 'dog' is 'dog' will be a pointer comparison instead of a full string comparison. However, automatic interning for all (longer) strings requires a lot more memory which is not always feasible, and thus they may not share the same identity which makes the results of id() different, for eg.,:
# different id when not assigned
In [146]: id('dog')
Out[146]: 4380547672
In [147]: id('dog')
Out[147]: 4380547552
# if assigned, the strings will be interned (though depends on implementation)
In [148]: a = 'dog'
In [149]: b = 'dog'
In [150]: id(a)
Out[150]: 4380547352
In [151]: id(b)
Out[151]: 4380547352
In [152]: a is b
Out[152]: True
For integers, at least on my machine, CPython will automatically intern up to 256 automatically:
In [18]: id(256)
Out[18]: 140511109257408
In [19]: id(256)
Out[19]: 140511109257408
In [20]: id(257)
Out[20]: 140511112156576
In [21]: id(257)
Out[21]: 140511110188504
UPDATE thanks to #eryksun: in this case the string 'a string' is not interned because CPython only interns strings without spaces, not because of the length as I instantly assumed: for eg., ASCII letters, digits, and underscore.
For more details, you can also refer to Alex Martelli's answer here.

Python values actually aren't equivalent [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Python “is” operator behaves unexpectedly with integers
I'm learning Python, and am curious as to why:
x = 500
x is 500
returns False, but:
y = 100
y is 100
returns True?

Python reuses small integers. That is, all 1s (for example) are the same 1 object. The range is -5 to 255, if I remember correctly, though this is a CPython implementation detail that should not be relied upon. I am pretty sure Jython and IronPython, for example, handle this differently.
The reason this works out fine is that ints are immutable. That is, you can't change a 4 to a 5 in-place. if a has a value of 4, a = 5 is actually pointing a to a different object, not changing the value a contains. Python doesn't share any mutable types (such as lists) where unexpectedly having multiple references to the same object might cause problems.
You should use == for comparing most things. is is for checking to see whether two references point to the same object; it is roughly equivalent to id(x) == id(y).

is tests for identity - x is y asks if they are the same object, not if they are simply 'equivalent'. So you also have, eg:
>>> x = []
>>> y = []
>>> z = x
>>> x is y
False
>>> x is z
True
For equivalence, you want to test equality:
>>> x = 500
>>> x == 500
True
Python (or, at least, cpython - the major implementation) does some optimisations so that certain immutable objects only exist once throughout the lifetime of the interpreter. So, every 5 throughout your program will be the same integer object. The same thing happens with string literals, for example.

"is" compare objects IDs and "==" will compare object values. So, if you need to compare values, go with "==" and if you whant to compare objects, go with "is".
As in Python everything is an object, is compares objects IDs, it's faster, but some times unpredictable. You need to be very sure of what you are doing to use "is" for simple comparsion.
About the situation above, I found here: http://docs.python.org/c-api/int.html the following remark:
The current implementation keeps an array of integer objects for all
integers between -5 and 256, when you create an int in that range you
actually just get back a reference to the existing object. So it
should be possible to change the value of 1. I suspect the behaviour
of Python in this case is undefined. :-)
So, you can do the following test and see this behaviour:
>>> a = 256
>>> id(a)
19707932
>>> id(256)
19707932
>>> a = 257
>>> id(a)
26286076
>>> id(257)
26286064
So, for integers above 256, "is" will not work. Be careful using "is" for comparsion.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Different storage position of equal strings with special characters [duplicate] - python

Related

`is` vs `==` for comparing primitives [duplicate]

why id is different in when both string are the same? [duplicate]

If same elements are referring same objects in Python datatypes (string/list/tuples) then how the size of the variable is determined? [duplicate]

Python: id() behavior in the Interpreter [duplicate]

Python values actually aren't equivalent [duplicate]

Categories

Resources