Does python create an object for string constants in equality comparisons? - python

In python for comparisons like this, does python create a temporary object for the string constant "help" and then continue with the equality comparison ? The object would be GCed after some point.
s1 = "nohelp"
if s1 == "help":
# Blah Blah

String literals, like all Python constants, are created during compile time, when the source code is translated to byte code. And because all Python strings are immutable the interpreter can re-use the same string object if it encounters the same string literal in multiple places. It can even do that if the literal string is created via concatenation of literals, but not if the string is built by concatenating a string literal to an existing string object.
Here's a short demo that creates a few identical strings inside and outside of functions. It also dumps the disassembled byte code of one of the functions.
from __future__ import print_function
from dis import dis
def f1(s):
a = "help"
print('f1', id(s), id(a))
return s > a
def f2(s):
a = "help"
print('f2', id(s), id(a))
return s > a
a = "help"
print(id(a))
print(f1("he" + "lp"))
b = "h"
print(f2(b + "elp"))
print("\nf1")
dis(f1)
typical output on a 32 bit machine running Python 2.6.6
3073880672
f1 3073880672 3073880672
False
f2 3073636576 3073880672
False
f1
26 0 LOAD_CONST 1 ('help')
3 STORE_FAST 1 (a)
27 6 LOAD_GLOBAL 0 (print)
9 LOAD_CONST 2 ('f1')
12 LOAD_GLOBAL 1 (id)
15 LOAD_FAST 0 (s)
18 CALL_FUNCTION 1
21 LOAD_GLOBAL 1 (id)
24 LOAD_FAST 1 (a)
27 CALL_FUNCTION 1
30 CALL_FUNCTION 3
33 POP_TOP
28 34 LOAD_FAST 0 (s)
37 LOAD_FAST 1 (a)
40 COMPARE_OP 4 (>)
43 RETURN_VALUE
Note that the ids of all the "help" strings are identical, apart from the one constructed with b + "elp".
(BTW, Python will concatenate adjacent string literals, so instead of writing "he" + "lp" I could've written "he" "lp", or even "he""lp").
The string literals themselves are not freed until the process is cleaning itself up at termination, however a string like b would be GC'ed if it went out of scope.
Note that in CPython (standard Python) when objects are GC'ed their memory is returned to Python's allocation system for recycling, not to the OS. Python does return unneeded memory to the OS, but only in special circumstances. See Releasing memory in Python and Why doesn't memory get released to system after large queries (or series of queries) in django?
Another question that discusses this topic: Why strings object are cached in python

Related

Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?
>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)
Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
The NFKC form of both the super and subscript characters is the lowercase u:
>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'
So in the end, all you have is a single identifier, Xu:
>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
1 0 LOAD_CONST 0 (42)
2 STORE_NAME 0 (Xu)
2 4 LOAD_NAME 1 (print)
6 LOAD_NAME 0 (Xu)
8 LOAD_NAME 0 (Xu)
10 LOAD_NAME 0 (Xu)
12 BUILD_TUPLE 3
14 CALL_FUNCTION 1
16 POP_TOP
18 LOAD_CONST 1 (None)
20 RETURN_VALUE
The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.
Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.
Python, as of version 3.0, supports non-ASCII identifiers. When parsing the identifiers are converted using NFKC normalization and any identifiers where the normalized value is the same are considered the same identifier.
See PEP 3131 for more details. https://www.python.org/dev/peps/pep-3131/

'is' vs '==' in input and assignment [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

Python: Why do certain strings match using "is" and others do not? [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

python string with whitespace doesn't point to same memory address [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

In which scenario it is useful to use Disassembly on python?

The dis module can be effectively used to disassemble Python methods, functions and classes into low-level interpreter instructions.
I know that dis information can be used for:
1. Find race condition in programs that use threads
2. Find possible optimizations
From your experience, do you know any other scenarios where Disassembly Python feature could be useful?
dis is useful, for example, when you have different code doing the same thing and you wonder where the performance difference lies in.
Example: list += [item] vs list.append(item)
def f(x): return 2*x
def f1(func, nums):
result = []
for item in nums:
result += [fun(item)]
return result
def f2(func, nums):
result = []
for item in nums:
result.append(fun(item))
return result
timeit.timeit says that f2(f, range(100)) is approximately twice as fast than f1(f, range(100). Why?
(Interestingly f2 is roughly as fast as map(f, range(100)) is.)
f1
You can see the whole output of dis by calling dis.dis(f1), here is line 4.
4 19 LOAD_FAST 2 (result)
22 LOAD_FAST 1 (fun)
25 LOAD_FAST 3 (item)
28 CALL_FUNCTION 1
31 BUILD_LIST 1
34 INPLACE_ADD
35 STORE_FAST 2 (result)
38 JUMP_ABSOLUTE 13
>> 41 POP_BLOCK
f2
Again, here is only line 4:
4 19 LOAD_FAST 2 (result)
22 LOAD_ATTR 0 (append)
25 LOAD_FAST 1 (fun)
28 LOAD_FAST 3 (item)
31 CALL_FUNCTION 1
34 CALL_FUNCTION 1
37 POP_TOP
38 JUMP_ABSOLUTE 13
>> 41 POP_BLOCK
Spot the difference
In f1 we need to:
Call fun on item (opcode 28)
Make a list out of it (opcode 31, expensive!)
Add it to result (opcode 34)
Store the returned value in result (opcode 35)
In f2, instead, we just:
Call fun on item (opcode 31)
Call append on result (opcode 34; C code: fast!)
This explains why the (imho) more expressive list += [value] is much slower than the list.append() method.
Other than that, dis.dis is mainly useful for curiosity and for trying to reconstruct code out of .pyc files you don't have the source of without spending a fortune :)
I see the dis module as being, essentially, a learning tool. Understanding what opcodes a certain snippet of Python code generates is a start to getting more "depth" to your grasp of Python -- rooting the "abstract" understanding of its semantics into a sample of (a bit more) concrete implementation. Sometimes the exact reason a certain Python snippet behaves the way it does may be hard to grasp "top-down" with pure reasoning from the "rules" of Python semantics: in such cases, reinforcing the study with some "bottom-up" verification (based on a possible implementation, of course -- other implementations would also be possible;-) can really help the study's effectiveness.
For day-to-day Python programming, not much. However, it is useful if you want to find out why doing something one way is faster than another way. I've also sometimes used it to figure out exactly how the interpreter handles some obscure bits of code. But really, I come up with a practical use-case for it very infrequently.
On the other hand, if your goal is to understand python rather than just being able to program in it, then it is an invaluable tool. For instance, ever wonder how function definition works? Here you go:
>>> def f():
... def foo(x=[1, 2, 3]):
... y = [4,]
... return x + y
...
>>> dis(f)
2 0 LOAD_CONST 1 (1)
3 LOAD_CONST 2 (2)
6 LOAD_CONST 3 (3)
9 BUILD_LIST 3
12 LOAD_CONST 4 (<code object foo at 0xb7690770, file "<stdin>", line 2>)
15 MAKE_FUNCTION 1
18 STORE_FAST 0 (foo)
21 LOAD_CONST 0 (None)
24 RETURN_VALUE
You can see that this happens by pushing the constants 1, 2, and 3 onto the stack, putting what's in the stack into a list, loading that into a code object, making the code function into an object, and storing it into a variable foo.

Categories