Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True
Related
This question already has answers here:
Python string interning
(2 answers)
About the changing id of an immutable string
(5 answers)
Closed 3 years ago.
The following two codes are equivalent, but the first one takes about 700M memory, the latter one takes only about 100M memory(via windows task manager). What happens here?
def a():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
lst.append(t)
return lst
_ = a()
def a():
lst = []
for i in range(10**7):
t = "a" * 2
lst.append(t)
return lst
_ = a()
#vurmux presented the right reason for the different memory usage: string interning, but some important details seem to be missing.
CPython-implementation interns some strings during the compilation, e.g "a"*2 - for more info about how/why "a"*2 gets interned see this SO-post.
Clarification: As #MartijnPieters has correctly pointed out in his comment: the important thing is whether the compiler does the constant-folding (e.g. evaluates the multiplication of two constants "a"*2) or not. If constant-folding is done, the resulting constant will be used and all elements in the list will be references to the same object, otherwise not. Even if all string constants get interned (and thus constant folding performed => string interned) - still it was sloppy to speak about interning: constant folding is the key here, as it explains the behavior also for types which have no interning at all, for example floats (if we would use t=42*2.0).
Whether constant folding has happened, can be easily verified with dis-module (I call your second version a2()):
>>> import dis
>>> dis.dis(a2)
...
4 18 LOAD_CONST 2 ('aa')
20 STORE_FAST 2 (t)
...
As we can see, during the run time the multiplication isn't performed, but directly the result (which was computed during the compiler time) of the multiplication is loaded - the resulting list consists of references to the same object (the constant loaded with 18 LOAD_CONST 2):
>>> len({id(s) for s in a2()})
1
There, only 8 bytes per reference are needed, that means about 80Mb (+overalocation of the list + memory needed for the interpreter) memory needed.
In Python3.7 constant folding isn't performed if the resulting string has more than 4096 characters, so replacing "a"*2 with "a"*4097 leads to the following byte-code:
>>> dis.dis(a1)
...
4 18 LOAD_CONST 2 ('a')
20 LOAD_CONST 3 (4097)
22 BINARY_MULTIPLY
24 STORE_FAST 2 (t)
...
Now, the multiplication isn't precalculated, the references in the resulting string will be of different objects.
The optimizer is yet not smart enough to recognize, that t is actually "a" in t=t*2, otherwise it would be able to perform the constant folding, but for now the resulting byte-code for your first version (I call it a2()):
...
5 22 LOAD_CONST 3 (2)
24 LOAD_FAST 2 (t)
26 BINARY_MULTIPLY
28 STORE_FAST 2 (t)
...
and it will return a list with 10^7 different objects (but all object being equal) inside:
>>> len({id(s) for s in a1()})
10000000
i.e. you will need about 56 bytes per string (sys.getsizeof returns 51, but because the pymalloc-memory-allocator is 8-byte aligned, 5 bytes will be wasted) + 8 bytes per reference (assuming 64bit-CPython-version), thus about 610Mb (+overalocation of the list + memory needed for the interpreter).
You can enforce the interning of the string via sys.intern:
import sys
def a1_interned():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
# here ensure, that the string-object gets interned
# returned value is the interned version
t = sys.intern(t)
lst.append(t)
return lst
And realy, we can now not only see, that less memory is needed, but also that the list has references to the same object (see it online for a slightly smaller size(10^5) here):
>>> len({id(s) for s in a1_interned()})
1
>>> all((s=="aa" for s in a1_interned())
True
String interning can save a lot of memory, but it is sometimes tricky to understand, whether/why a string gets interned or not. Calling sys.intern explicitly eliminates this uncertainty.
The existence of additional temporary objects referenced by t is not the problem: CPython uses reference counting for memory managment, so an object gets deleted as soon as there is no references to it - without any interaction from the garbage collector, which in CPython is only used to break-up cycles (which is different to for example Java's GC, as Java doesn't use reference counting). Thus, temporary variables are really temporaries - those objects cannot be accumulated to make any impact on memory usage.
The problem with the temporary variable t is only that it prevents peephole optimization during the compilation, which is performed for "a"*2 but not for t*2.
This difference is exist because of string interning in Python interpreter:
String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.
Let me show it in a simple example:
>>> t1 = 'a'
>>> t2 = t1 * 2
>>> t2 is 'aa'
False
>>> t1 = 'a'
>>> t2 = 'a'*2
>>> t2 is 'aa'
True
When you use the first variant, the Python string interning is not used so the interpreter creates additional internal variables to store temporal data. It can't optimize many-lines-code this way.
I am not a Python guru, but I think the interpreter works this way:
t = "a"
t = t * 2
In the first line it creates an object for t. In the second line it creates a temporary object for t right of the = sign and writes the result in the third place in the memory (with GC called later). So the second variant should use at least 3 times less memory than the first.
P.S. You can read more about string interning here.
In python for comparisons like this, does python create a temporary object for the string constant "help" and then continue with the equality comparison ? The object would be GCed after some point.
s1 = "nohelp"
if s1 == "help":
# Blah Blah
String literals, like all Python constants, are created during compile time, when the source code is translated to byte code. And because all Python strings are immutable the interpreter can re-use the same string object if it encounters the same string literal in multiple places. It can even do that if the literal string is created via concatenation of literals, but not if the string is built by concatenating a string literal to an existing string object.
Here's a short demo that creates a few identical strings inside and outside of functions. It also dumps the disassembled byte code of one of the functions.
from __future__ import print_function
from dis import dis
def f1(s):
a = "help"
print('f1', id(s), id(a))
return s > a
def f2(s):
a = "help"
print('f2', id(s), id(a))
return s > a
a = "help"
print(id(a))
print(f1("he" + "lp"))
b = "h"
print(f2(b + "elp"))
print("\nf1")
dis(f1)
typical output on a 32 bit machine running Python 2.6.6
3073880672
f1 3073880672 3073880672
False
f2 3073636576 3073880672
False
f1
26 0 LOAD_CONST 1 ('help')
3 STORE_FAST 1 (a)
27 6 LOAD_GLOBAL 0 (print)
9 LOAD_CONST 2 ('f1')
12 LOAD_GLOBAL 1 (id)
15 LOAD_FAST 0 (s)
18 CALL_FUNCTION 1
21 LOAD_GLOBAL 1 (id)
24 LOAD_FAST 1 (a)
27 CALL_FUNCTION 1
30 CALL_FUNCTION 3
33 POP_TOP
28 34 LOAD_FAST 0 (s)
37 LOAD_FAST 1 (a)
40 COMPARE_OP 4 (>)
43 RETURN_VALUE
Note that the ids of all the "help" strings are identical, apart from the one constructed with b + "elp".
(BTW, Python will concatenate adjacent string literals, so instead of writing "he" + "lp" I could've written "he" "lp", or even "he""lp").
The string literals themselves are not freed until the process is cleaning itself up at termination, however a string like b would be GC'ed if it went out of scope.
Note that in CPython (standard Python) when objects are GC'ed their memory is returned to Python's allocation system for recycling, not to the OS. Python does return unneeded memory to the OS, but only in special circumstances. See Releasing memory in Python and Why doesn't memory get released to system after large queries (or series of queries) in django?
Another question that discusses this topic: Why strings object are cached in python
Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True
Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True
The is operator compares the memory addresses of two objects, and returns True if they're the same. Why, then, does it not work reliably with strings?
Code #1
>>> a = "poi"
>>> b = "poi"
>>> a is b
True
Code #2
>>> ktr = "today is a fine day"
>>> ptr = "today is a fine day"
>>> ktr is ptr
False
I have created two strings whose content is the same but they are living on different memory addresses. Why is the output of the is operator not consistent?
I believe it has to do with string interning. In essence, the idea is to store only a single copy of each distinct string, to increase performance on some operations.
Basically, the reason why a is b works is because (as you may have guessed) there is a single immutable string that is referenced by Python in both cases. When a string is large (and some other factors that I don't understand, most likely), this isn't done, which is why your second example returns False.
EDIT: And in fact, the odd behavior seems to be a side-effect of the interactive environment. If you take your same code and place it into a Python script, both a is b and ktr is ptr return True.
a="poi"
b="poi"
print a is b # Prints 'True'
ktr = "today is a fine day"
ptr = "today is a fine day"
print ktr is ptr # Prints 'True'
This makes sense, since it'd be easy for Python to parse a source file and look for duplicate string literals within it. If you create the strings dynamically, then it behaves differently even in a script.
a="p" + "oi"
b="po" + "i"
print a is b # Oddly enough, prints 'True'
ktr = "today is" + " a fine day"
ptr = "today is a f" + "ine day"
print ktr is ptr # Prints 'False'
As for why a is b still results in True, perhaps the allocated string is small enough to warrant a quick search through the interned collection, whereas the other one is not?
is is identity testing. It will work on smaller some strings(because of cache) but not on bigger other strings. Since str is NOT a ptr. [thanks erykson]
See this code:
>>> import dis
>>> def fun():
... str = 'today is a fine day'
... ptr = 'today is a fine day'
... return (str is ptr)
...
>>> dis.dis(fun)
2 0 LOAD_CONST 1 ('today is a fine day')
3 STORE_FAST 0 (str)
3 6 LOAD_CONST 1 ('today is a fine day')
9 STORE_FAST 1 (ptr)
4 12 LOAD_FAST 0 (str)
15 LOAD_FAST 1 (ptr)
18 COMPARE_OP 8 (is)
21 RETURN_VALUE
>>> id(str)
26652288
>>> id(ptr)
27604736
#hence this comparison returns false: ptr is str
Notice the IDs of str and ptr are different.
BUT:
>>> x = "poi"
>>> y = "poi"
>>> id(x)
26650592
>>> id(y)
26650592
#hence this comparison returns true : x is y
IDs of x and y are the same. Hence is operator works on "ids" and not on "equalities"
See the below link for a discussion on when and why python will allocate a different memory location for identical strings(read the question as well).
When does python allocate new memory for identical strings
Also sys.intern on python3.x and intern on python2.x should help you allocate the strings in the same memory location, regardless of the size of the string.
is is not the same as ==.
Basically, is checks if the two objects are the same, while == compares the values of those objects (strings, like everything in python, are objects).
So you should use is when you really know what objects you're looking at (ie. you've made the objects, or are comparing with None as the question comments point out), and you want to know if two variables are referencing the exact same object in memory.
In your examples, however, you're looking at str objects that python is handling behind the scenes, so without diving deep into how python works, you don't really know what to expect. You would have the same problem with ints or floats. Other answers do a good job of explaining the "behind the scenes" stuff (string interning), but you mostly shouldn't have to worry about it in day-to-day programming.
Note that this is a CPython specific optimization. If you want your code to be portable, you should avoid it. For example, in PyPy
>>>> a = "hi"
>>>> b = "hi"
>>>> a is b
False
It's also worth pointing out that a similar thing happens for small integers
>>> a = 12
>>> b = 12
>>> a is b
True
which again you should not rely on, because other implementations might not include this optimization.