How does Python's re-declaration of variables work internally? - python

I'm fairly new to Python so this might seem like a trivial question to some. But I'm curious about how Python works internally when you bind a new object to a variable, referring to the previous object bound to the same variable name. Please see the code below as an example - I understand that python breaks the bond with the original object 'hello', bind it to the new object, but what is the sequence of events here? how does python break the bond with the original object but also refer to it?
greeting = 'hello'
greeting = f'y{greeting[1:len(greeting)]}'
In addition to the explanation, I would also very much appreciate some contexts. I understand that strings are immutable but what about other types like floats and integers?
Does it matter whether I understand how python operates internally? Also, where would be a good place to learn more about how Python works internally if it does?
Hope I'm being clear with my questions.

An explanation through the medium of the disassembly:
>>> dis.dis('''greeting = 'hello'
... greeting = f'y{greeting[1:len(greeting)]}'
... ''')
1 0 LOAD_CONST 0 ('hello')
2 STORE_NAME 0 (greeting)
2 4 LOAD_CONST 1 ('y')
6 LOAD_NAME 0 (greeting)
8 LOAD_CONST 2 (1)
10 LOAD_NAME 1 (len)
12 LOAD_NAME 0 (greeting)
14 CALL_FUNCTION 1
16 BUILD_SLICE 2
18 BINARY_SUBSCR
20 FORMAT_VALUE 0
22 BUILD_STRING 2
24 STORE_NAME 0 (greeting)
26 LOAD_CONST 3 (None)
28 RETURN_VALUE
The number on the far left indicates where the bytecode for a particular line begins. Line 1 is pretty self-explanatory, so I'll explain line 2.
As you might notice, your f-string doesn't survive compilation; it becomes a bunch of raw opcodes mixing the loading of constant segments with the evaluation of formatting placeholders, eventually leading to the stack being topped by all the fragments that will make up the final string. When they're all on the stack, it then puts all the fragments together at the end with BUILD_STRING 2 (which says "Take the top two values off the stack and combine them into a single string").
greeting is just a name holding a binding. It doesn't actually hold a value, just a reference to whatever object it's currently bound to. And the original reference is pushed onto the stack (with LOAD_NAME) entirely before the STORE_NAME that pops the top of the stack and rebinds greeting.
In short, the reason it works is that the value of greeting is no longer needed by the time it's replaced; it's used to make the new string, then discarded in favor of the new string.

In your second line, Python evaluates the right side of the assignment statement, which creates a string that uses the old binding for greeting. Only after evaluating that expression does it handle the assignment operator, which binds that string to the name. It's all very linear.
Floats and integers are also immutable. Only lists and dictionaries are mutable. Actually, it's not clear how you would modify an integer object in any case. You can't refer to the inside of the object. It's important to remember that in this case:
i = 3
j = 4
i = i + j
the last line just creates a new integer/float object and binds it to i. None of this attempts to modify the integer object 3.
I wrote this article that tries to delineate the difference between Python objects and the names we use:
https://github.com/timrprobocom/documents/blob/main/UnderstandingPythonObjects.md

Related

'is' vs '==' in input and assignment [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

id() of numpy.float64 objects are the same, even if their values differ? [duplicate]

This question already has answers here:
Why is the id of a Python class not unique when called quickly?
(6 answers)
Closed 7 years ago.
I get the following:
import numpy
print id(numpy.float64(100)) == id(numpy.float64(10))
print numpy.float64(100) == numpy.float64(10)
gives:
True
False
Note that if I create the two float64 objects and then compare them then it appears to work as expected:
a = numpy.float64(10)
b = numpy.float64(100)
print a==b, id(a)==id(b)
gives:
False
False
Based on https://docs.python.org/2/reference/datamodel.html shouldn't the id of two objects always differ if their values differ? How can I get the id to match if the values are different?
Is this some sort of bug in numpy?
This looks like a quirk of memory reuse rather than a NumPy bug.
The line
id(numpy.float64(100)) == id(numpy.float64(10))
first creates a float numpy.float64(100) and then calls the id function on it. This memory is then immediately freed by Python's garbage collector because there are no more references to it. The memory slot is free to be reused by any new objects that are created.
When numpy.float64(10) is created, it occupies the same memory location, hence the memory addresses returned by id compare equal.
This chain of events is perhaps clearer when you look at the bytecode:
>>> dis.dis('id(numpy.float64(100)) == id(numpy.float64(10))')
0 LOAD_NAME 0 (id)
3 LOAD_NAME 1 (numpy)
6 LOAD_ATTR 2 (float64)
9 LOAD_CONST 0 (100)
12 CALL_FUNCTION 1 (1 positional, 0 keyword pair) # call numpy.float64(100)
15 CALL_FUNCTION 1 (1 positional, 0 keyword pair) # get id of object
# gc runs and frees memory occupied by numpy.float64(100)
18 LOAD_NAME 0 (id)
21 LOAD_NAME 1 (numpy)
24 LOAD_ATTR 2 (float64)
27 LOAD_CONST 1 (10)
30 CALL_FUNCTION 1 (1 positional, 0 keyword pair) # call numpy.float64(10)
33 CALL_FUNCTION 1 (1 positional, 0 keyword pair) # get id of object
36 COMPARE_OP 2 (==) # compare the two ids
39 RETURN_VALUE
Imagine that you put a book on a shelf, and then someone notices that you're not using it and takes the book off the shelf to free some space. You then move to put a different book on the shelf at a convenient location.
If you suddenly realize that you had two different books at the very same location, do you think there's a bug in reality? ;-)
After you call id on numpy.float64(100), you have no way to refer to that float object you made, and so the interpreter is perfectly free to reuse that id.
According to the Docs, the id() function is described like this
id(object)
Return the “identity” of an object. This is an integer (or
long integer) which is guaranteed to be unique and constant for this
object during its lifetime. Two objects with non-overlapping lifetimes
may have the same id() value.
So that may happen, but not necessarily always. As they do not exist simultaneously in time, they end up catching the same id.
However, in this case
a = numpy.float64(10)
b = numpy.float64(100)
print a==b, id(a)==id(b)
They do exist simultaneously, so they cannot have the same id.

Python: Why do certain strings match using "is" and others do not? [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

python string with whitespace doesn't point to same memory address [duplicate]

Something about the id of objects of type str (in python 2.7) puzzles me. The str type is immutable, so I would expect that once it is created, it will always have the same id. I believe I don't phrase myself so well, so instead I'll post an example of input and output sequence.
>>> id('so')
140614155123888
>>> id('so')
140614155123848
>>> id('so')
140614155123808
so in the meanwhile, it changes all the time. However, after having a variable pointing at that string, things change:
>>> so = 'so'
>>> id('so')
140614155123728
>>> so = 'so'
>>> id(so)
140614155123728
>>> not_so = 'so'
>>> id(not_so)
140614155123728
So it looks like it freezes the id, once a variable holds that value. Indeed, after del so and del not_so, the output of id('so') start changing again.
This is not the same behaviour as with (small) integers.
I know there is not real connection between immutability and having the same id; still, I am trying to figure out the source of this behaviour. I believe that someone whose familiar with python's internals would be less surprised than me, so I am trying to reach the same point...
Update
Trying the same with a different string gave different results...
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
>>> id('hello')
139978087896384
Now it is equal...
CPython does not promise to intern all strings by default, but in practice, a lot of places in the Python codebase do reuse already-created string objects. A lot of Python internals use (the C-equivalent of) the sys.intern() function call to explicitly intern Python strings, but unless you hit one of those special cases, two identical Python string literals will produce different strings.
Python is also free to reuse memory locations, and Python will also optimize immutable literals by storing them once, at compile time, with the bytecode in code objects. The Python REPL (interactive interpreter) also stores the most recent expression result in the _ name, which muddles up things some more.
As such, you will see the same id crop up from time to time.
Running just the line id(<string literal>) in the REPL goes through several steps:
The line is compiled, which includes creating a constant for the string object:
>>> compile("id('foo')", '<stdin>', 'single').co_consts
('foo', None)
This shows the stored constants with the compiled bytecode; in this case a string 'foo' and the None singleton. Simple expressions consisting of that produce an immutable value may be optimised at this stage, see the note on optimizers, below.
On execution, the string is loaded from the code constants, and id() returns the memory location. The resulting int value is bound to _, as well as printed:
>>> import dis
>>> dis.dis(compile("id('foo')", '<stdin>', 'single'))
1 0 LOAD_NAME 0 (id)
3 LOAD_CONST 0 ('foo')
6 CALL_FUNCTION 1
9 PRINT_EXPR
10 LOAD_CONST 1 (None)
13 RETURN_VALUE
The code object is not referenced by anything, reference count drops to 0 and the code object is deleted. As a consequence, so is the string object.
Python can then perhaps reuse the same memory location for a new string object, if you re-run the same code. This usually leads to the same memory address being printed if you repeat this code. This does depend on what else you do with your Python memory.
ID reuse is not predictable; if in the meantime the garbage collector runs to clear circular references, other memory could be freed and you'll get new memory addresses.
Next, the Python compiler will also intern any Python string stored as a constant, provided it looks enough like a valid identifier. The Python code object factory function PyCode_New will intern any string object that contains only ASCII letters, digits or underscores, by calling intern_string_constants(). This function recurses through the constants structures and for any string object v found there executes:
if (all_name_chars(v)) {
PyObject *w = v;
PyUnicode_InternInPlace(&v);
if (w != v) {
PyTuple_SET_ITEM(tuple, i, v);
modified = 1;
}
}
where all_name_chars() is documented as
/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
Since you created strings that fit that criterion, they are interned, which is why you see the same ID being used for the 'so' string in your second test: as long as a reference to the interned version survives, interning will cause future 'so' literals to reuse the interned string object, even in new code blocks and bound to different identifiers. In your first test, you don't save a reference to the string, so the interned strings are discarded before they can be reused.
Incidentally, your new name so = 'so' binds a string to a name that contains the same characters. In other words, you are creating a global whose name and value are equal. As Python interns both identifiers and qualifying constants, you end up using the same string object for both the identifier and its value:
>>> compile("so = 'so'", '<stdin>', 'single').co_names[0] is compile("so = 'so'", '<stdin>', 'single').co_consts[0]
True
If you create strings that are either not code object constants, or contain characters outside of the letters + numbers + underscore range, you'll see the id() value not being reused:
>>> some_var = 'Look ma, spaces and punctuation!'
>>> some_other_var = 'Look ma, spaces and punctuation!'
>>> id(some_var)
4493058384
>>> id(some_other_var)
4493058456
>>> foo = 'Concatenating_' + 'also_helps_if_long_enough'
>>> bar = 'Concatenating_' + 'also_helps_if_long_enough'
>>> foo is bar
False
>>> foo == bar
True
The Python compiler either uses the peephole optimizer (Python versions < 3.7) or the more capable AST optimizer (3.7 and newer) to pre-calculate (fold) the results of simple expressions involving constants. The peepholder limits it's output to a sequence of length 20 or less (to prevent bloating code objects and memory use), while the AST optimizer uses a separate limit for strings of 4096 characters. This means that concatenating shorter strings consisting only of name characters can still lead to interned strings if the resulting string fits within the optimizer limits of your current Python version.
E.g. on Python 3.7, 'foo' * 20 will result in a single interned string, because constant folding turns this into a single value, while on Python 3.6 or older only 'foo' * 6 would be folded:
>>> import dis, sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
>>> dis.dis("'foo' * 20")
1 0 LOAD_CONST 0 ('foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo')
2 RETURN_VALUE
and
>>> dis.dis("'foo' * 6")
1 0 LOAD_CONST 2 ('foofoofoofoofoofoo')
2 RETURN_VALUE
>>> dis.dis("'foo' * 7")
1 0 LOAD_CONST 0 ('foo')
2 LOAD_CONST 1 (7)
4 BINARY_MULTIPLY
6 RETURN_VALUE
This behavior is specific to the Python interactive shell. If I put the following in a .py file:
print id('so')
print id('so')
print id('so')
and execute it, I receive the following output:
2888960
2888960
2888960
In CPython, a string literal is treated as a constant, which we can see in the bytecode of the snippet above:
2 0 LOAD_GLOBAL 0 (id)
3 LOAD_CONST 1 ('so')
6 CALL_FUNCTION 1
9 PRINT_ITEM
10 PRINT_NEWLINE
3 11 LOAD_GLOBAL 0 (id)
14 LOAD_CONST 1 ('so')
17 CALL_FUNCTION 1
20 PRINT_ITEM
21 PRINT_NEWLINE
4 22 LOAD_GLOBAL 0 (id)
25 LOAD_CONST 1 ('so')
28 CALL_FUNCTION 1
31 PRINT_ITEM
32 PRINT_NEWLINE
33 LOAD_CONST 0 (None)
36 RETURN_VALUE
The same constant (i.e. the same string object) is loaded 3 times, so the IDs are the same.
In your first example a new instance of the string 'so' is created each time, hence different id.
In the second example you are binding the string to a variable and Python can then maintain a shared copy of the string.
A more simplified way to understand the behaviour is to check the following Data Types and Variables.
Section "A String Pecularity" illustrates your question using special characters as example.
So while Python is not guaranteed to intern strings, it will frequently reuse the same string, and is may mislead. It's important to know that you shouldn't check id or is for equality of strings.
To demonstrate this, one way I've discovered to force a new string in Python 2.6 at least:
>>> so = 'so'
>>> new_so = '{0}'.format(so)
>>> so is new_so
False
and here's a bit more Python exploration:
>>> id(so)
102596064
>>> id(new_so)
259679968
>>> so == new_so
True

Yield only as many are required from a generator

I wish to yield from a generator only as many items are required.
In the following code
a, b, c = itertools.count()
I receive this exception:
ValueError: too many values to unpack
I've seen several related questions, however I have zero interest in the remaining items from the generator, I only wish to receive as many as I ask for, without providing that quantity in advance.
It seems to me that Python determines the number of items you want, but then proceeds to try to read and store more than that number (generating ValueError).
How can I yield only as many items as I require, without passing in how many items I want?
Update0
To aid in understanding the approximate behaviour I believe should be possible, here's a code sample:
def unpack_seq(ctx, items, seq):
for name in items:
setattr(ctx, name, seq.next())
import itertools, sys
unpack_seq(sys.modules[__name__], ["a", "b", "c"], itertools.count())
print a, b, c
If you can improve this code please do.
Alex Martelli's answer suggests to me the byte op UNPACK_SEQUENCE is responsible for the limitation. I don't see why this operation should require that generated sequences must also exactly match in length.
Note that Python 3 has different unpack syntaxes which probably invalidate technical discussion in this question.
You need a deep bytecode hack - what you request cannot be done at Python source code level, but (if you're willing to commit to a specific version and release of Python) may be doable by post-processing the bytecode after Python has compiled it. Consider, e.g.:
>>> def f(someit):
... a, b, c = someit()
...
>>> dis.dis(f)
2 0 LOAD_FAST 0 (someit)
3 CALL_FUNCTION 0
6 UNPACK_SEQUENCE 3
9 STORE_FAST 1 (a)
12 STORE_FAST 2 (b)
15 STORE_FAST 3 (c)
18 LOAD_CONST 0 (None)
21 RETURN_VALUE
>>>
As you see, the UNPACK_SEQUENCE 3 bytecode occurs without the iterator someit being given the least indication of it (the iterator has already been called!) -- so you must prefix it in the bytecode with a "get exactly N bytes" operation, e.g.:
>>> def g(someit):
... a, b, c = limit(someit(), 3)
...
>>> dis.dis(g)
2 0 LOAD_GLOBAL 0 (limit)
3 LOAD_FAST 0 (someit)
6 CALL_FUNCTION 0
9 LOAD_CONST 1 (3)
12 CALL_FUNCTION 2
15 UNPACK_SEQUENCE 3
18 STORE_FAST 1 (a)
21 STORE_FAST 2 (b)
24 STORE_FAST 3 (c)
27 LOAD_CONST 0 (None)
30 RETURN_VALUE
where limit is your own function, easily implemented (e.g via itertools.slice). So, the original 2-bytecode "load fast; call function" sequence (just before the unpack-sequence bytecode) must become this kind of 5-bytecode sequence, with a load-global bytecode for limit before the original sequence, and the load-const; call function sequence after it.
You can of course implement this bytecode munging in a decorator.
Alternatively (for generality) you could work on altering the function's original source, e.g. via parsing and alteration of the AST, and recompiling to byte code (but that does require the source to be available at decoration time, of course).
Is this worth it for production use? Absolutely not, of course -- ridiculous amount of work for a tiny "syntax sugar" improvement. However, it can be an instructive project to pursue to gain mastery in bytecode hacking, ast hacking, and other black magic tricks that you will probably never need but are surely cool to know when you want to move beyond the language of mere language wizard to that of world-class guru -- indeed I suspect that those who are motivated to become gurus are typically people who can't fail to yield to the fascination of such "language internals' mechanics"... and those who actually make it to that exalted level are the subset wise enough to realize such endeavors are "just play", and pursue them as a spare-time activity (mental equivalent of weight lifting, much like, say, sudoku or crosswords;-) without letting them interfere with the important tasks (delivering value to users by deploying solid, clear, simple, well-performing, well-tested, well-documented code, most often without even the slightest hint of black magic to it;-).
You need to make sure that the number of items on both sides matches. One way is to use islice from the itertools module:
from itertools import islice
a, b, c = islice(itertools.count(), 3)
Python does not work the way you desire. In any assignment, like
a, b, c = itertools.count()
the right-hand side is evaluated first, before the left-hand side.
The right-hand side can not know how many items are on the left-hand side unless you tell it.
Or use a list comprehension, since you know the desired number of items:
ic = itertools.count()
a,b,c = [ic.next() for i in range(3)]
or even simpler:
a,b,c = range(3)

Categories