This question already has answers here:
The `is` operator behaves unexpectedly with non-cached integers
(2 answers)
Python3 multiple assignment and memory address [duplicate]
(3 answers)
Closed 4 years ago.
I found a strange behavior with the semicolon ";" in Python.
>>> x=20000;y=20000
>>> x is y
True
>>> x=20000
>>> y=20000
>>> x is y
False
>>> x=20000;
>>> y=20000
>>> x is y
False
Why does the first test return "True", and the others return "False"? My Python version is 3.6.5.
In the interactive interpreter, the first semi-colon line is read and evaluated in one pass. As such, the interpreter recognizes that 20000 is the same immutable int value in each assignment, and so can (it doesn't have to, but does) make x and y references to the same object.
The important point is that this is simply an optimization that the interactive interpreter chooses to make; it's not something guaranteed by the language or some special property of the ; that joins two statements into one.
In the following two examples, by the time y=20000 is read and evaluated, x=20000 (with or without the semi-colon) has already been evaluated and forgotten. Since 20000 isn't in the range (-5 to 257) of pre-allocated int values, CPython doesn't try to find another instance of 20000 already in memory; it just creates a new one for y.
The is operator checks whether two values are the same object in memory. It's not meant to be used for checking for equality. For what is worth, you could consider the fact that it sometimes returns True and sometimes False just to be a matter of luck (even if it isn't).
For example, the results are different in an interactive session and in a standalone program:
$ cat test.py
x = 200000; y = 200000
print(x is y)
xx = 200000
yy = 200000
print(xx is yy)
$ python test.py
True
True
Or you have this other example:
>>> x = 50 + 50; y = 50 + 50
>>> x is y
True
>>> x = 5000 + 5000; y = 5000 + 5000
>>> x is y
False
This happens because the interpreter caches small numbers so they are always the same object, but it doesn't for large numbers, so both additions in the second case create a new 10000 object. It has nothing to do with the semicolon.
Related
here is my code
x = 5
y = 5
print(x is y)
print(id(x))
print(id(y))
and the output is
True
1903991482800
1903991482800
I don't know why x and y have the same location here
please help me illustrate this problom!
Thanks!
your issue is technically a complicated concept, but I will try to explain it to you in simple terms.
Let's say a number, say '3', is stored in your memory. When you declare a = 3, what the Python interpreter actually does is make that variable 'a' point to the memory location where 3 is stored. So if the number 3 is stored in an address like 'xxyyzz', then the moment you declare a = 3, the variable a points to the memory address 'xxyyzz'. Similarly, when you declare another variable b = 3, what happens is variable 'b' also points to the memory location 'xxyyzz'. The 'is' operator in Python compares the memory addresses of those variables, so you get id(a)==id(b) as True.
Hope this helps!
I have noticed that it's common for beginners to have the following simple logical error. Since they genuinely don't understand the problem, a) their questions can't really be said to be caused by a typo (a full explanation would be useful); b) they lack the understanding necessary to create a proper example, explain the problem with proper terminology, and ask clearly. So, I am asking on their behalf, to make a canonical duplicate target.
Consider this code example:
x = 1
y = x + 2
for _ in range(5):
x = x * 2 # so it will be 2 the first time, then 4, then 8, then 16, then 32
print(y)
Each time through the loop, x is doubled. Since y was defined as x + 2, why doesn't it change when x changes? How can I make it so that the value is automatically updated, and I get the expected output
4
6
10
18
34
?
Declarative programming
Many beginners expect Python to work this way, but it does not. Worse, they may inconsistently expect it to work that way. Carefully consider this line from the example:
x = x * 2
If assignments were like mathematical formulas, we'd have to solve for x here. The only possible (numeric) value for x would be zero, since any other number is not equal to twice that number. And how should we account for the fact that the code previously says x = 1? Isn't that a contradiction? Should we get an error message for trying to define x two different ways? Or expect x to blow up to infinity, as the program keeps trying to double the old value of x
Of course, none of those things happen. Like most programming languages in common use, Python is a declarative language, meaning that lines of code describe actions that occur in a defined order. Where there is a loop, the code inside the loop is repeated; where there is something like if/else, some code might be skipped; but in general, code within the same "block" simply happens in the order that it's written.
In the example, first x = 1 happens, so x is equal to 1. Then y = x + 2 happens, which makes y equal to 3 for the time being. This happened because of the assignment, not because of x having a value. Thus, when x changes later on in the code, that does not cause y to change.
Going with the (control) flow
So, how do we make y change? The simplest answer is: the same way that we gave it this value in the first place - by assignment, using =. In fact, thinking about the x = x * 2 code again, we already have seen how to do this.
In the example code, we want y to change multiple times - once each time through the loop, since that is where print(y) happens. What value should be assigned? It depends on x - the current value of x at that point in the process, which is determined by using... x. Just like how x = x * 2 checks the existing value of x, doubles it, and changes x to that doubled result, so we can write y = x + 2 to check the existing value of x, add two, and change y to be that new value.
Thus:
x = 1
for _ in range(5):
x = x * 2
y = x + 2
print(y)
All that changed is that the line y = x + 2 is now inside the loop. We want that update to happen every time that x = x * 2 happens, immediately after that happens (i.e., so that the change is made in time for the print(y)). So, that directly tells us where the code needs to go.
defining relationships
Suppose there were multiple places in the program where x changes:
x = x * 2
y = x + 2
print(y)
x = 24
y = x + 2
print(y)
Eventually, it will get annoying to remember to update y after every line of code that changes x. It's also a potential source of bugs, that will get worse as the program grows.
In the original code, the idea behind writing y = x + 2 was to express a relationship between x and y: we want the code to treat y as if it meant the same thing as x + 2, anywhere that it appears. In mathematical terms, we want to treat y as a function of x.
In Python, like most other programming languages, we express the mathematical concept of a function, using something called... a function. In Python specifically, we use the def function to write functions. It looks like:
def y(z):
return z + 2
We can write whatever code we like inside the function, and when the function is "called", that code will run, much like our existing "top-level" code runs. When Python first encounters the block starting with def, though, it only creates a function from that code - it doesn't run the code yet.
So, now we have something named y, which is a function that takes in some z value and gives back (i.e., returns) the result of calculating z + 2. We can call it by writing something like y(x), which will give it our existing x value and evaluate to the result of adding 2 to that value.
Notice that the z here is the function's own name for the value was passed in, and it does not have to match our own name for that value. In fact, we don't have to have our own name for that value at all: for example, we can write y(1), and the function will compute 3.
What do we mean by "evaluating to", or "giving back", or "returning"? Simply, the code that calls the function is an expression, just like 1 + 2, and when the value is computed, it gets used in place, in the same way. So, for example, a = y(1) will make a be equal to 3:
The function receives a value 1, calling it z internally.
The function computes z + 2, i.e. 1 + 2, getting a result of 3.
The function returns the result of 3.
That means that y(1) evaluated to 3; thus, the code proceeds as if we had put 3 where the y(1) is.
Now we have the equivalent of a = 3.
For more about using functions, see How do I get a result (output) from a function? How can I use the result later?.
Going back to the beginning of this section, we can therefore use calls to y directly for our prints:
x = x * 2
print(y(x))
x = 24
print(y(x))
We don't need to "update" y when x changes; instead, we determine the value when and where it is used. Of course, we technically could have done that anyway: it only matters that y is "correct" at the points where it's actually used for something. But by using the function, the logic for the x + 2 calculation is wrapped up, given a name, and put in a single place. We don't need to write x + 2 every time. It looks trivial in this example, but y(x) would do the trick no matter how complicated the calculation is, as long as x is the only needed input. The calculation only needs to be written once: inside the function definition, and everything else just says y(x).
It's also possible to make the y function use the x value directly from our "top-level" code, rather than passing it in explicitly. This can be useful, but in the general case it gets complicated and can make code much harder to understand and prone to bugs. For a proper understanding, please read Using global variables in a function and Short description of the scoping rules?.
This question already has answers here:
Python string interning
(2 answers)
About the changing id of an immutable string
(5 answers)
Closed 3 years ago.
The following two codes are equivalent, but the first one takes about 700M memory, the latter one takes only about 100M memory(via windows task manager). What happens here?
def a():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
lst.append(t)
return lst
_ = a()
def a():
lst = []
for i in range(10**7):
t = "a" * 2
lst.append(t)
return lst
_ = a()
#vurmux presented the right reason for the different memory usage: string interning, but some important details seem to be missing.
CPython-implementation interns some strings during the compilation, e.g "a"*2 - for more info about how/why "a"*2 gets interned see this SO-post.
Clarification: As #MartijnPieters has correctly pointed out in his comment: the important thing is whether the compiler does the constant-folding (e.g. evaluates the multiplication of two constants "a"*2) or not. If constant-folding is done, the resulting constant will be used and all elements in the list will be references to the same object, otherwise not. Even if all string constants get interned (and thus constant folding performed => string interned) - still it was sloppy to speak about interning: constant folding is the key here, as it explains the behavior also for types which have no interning at all, for example floats (if we would use t=42*2.0).
Whether constant folding has happened, can be easily verified with dis-module (I call your second version a2()):
>>> import dis
>>> dis.dis(a2)
...
4 18 LOAD_CONST 2 ('aa')
20 STORE_FAST 2 (t)
...
As we can see, during the run time the multiplication isn't performed, but directly the result (which was computed during the compiler time) of the multiplication is loaded - the resulting list consists of references to the same object (the constant loaded with 18 LOAD_CONST 2):
>>> len({id(s) for s in a2()})
1
There, only 8 bytes per reference are needed, that means about 80Mb (+overalocation of the list + memory needed for the interpreter) memory needed.
In Python3.7 constant folding isn't performed if the resulting string has more than 4096 characters, so replacing "a"*2 with "a"*4097 leads to the following byte-code:
>>> dis.dis(a1)
...
4 18 LOAD_CONST 2 ('a')
20 LOAD_CONST 3 (4097)
22 BINARY_MULTIPLY
24 STORE_FAST 2 (t)
...
Now, the multiplication isn't precalculated, the references in the resulting string will be of different objects.
The optimizer is yet not smart enough to recognize, that t is actually "a" in t=t*2, otherwise it would be able to perform the constant folding, but for now the resulting byte-code for your first version (I call it a2()):
...
5 22 LOAD_CONST 3 (2)
24 LOAD_FAST 2 (t)
26 BINARY_MULTIPLY
28 STORE_FAST 2 (t)
...
and it will return a list with 10^7 different objects (but all object being equal) inside:
>>> len({id(s) for s in a1()})
10000000
i.e. you will need about 56 bytes per string (sys.getsizeof returns 51, but because the pymalloc-memory-allocator is 8-byte aligned, 5 bytes will be wasted) + 8 bytes per reference (assuming 64bit-CPython-version), thus about 610Mb (+overalocation of the list + memory needed for the interpreter).
You can enforce the interning of the string via sys.intern:
import sys
def a1_interned():
lst = []
for i in range(10**7):
t = "a"
t = t * 2
# here ensure, that the string-object gets interned
# returned value is the interned version
t = sys.intern(t)
lst.append(t)
return lst
And realy, we can now not only see, that less memory is needed, but also that the list has references to the same object (see it online for a slightly smaller size(10^5) here):
>>> len({id(s) for s in a1_interned()})
1
>>> all((s=="aa" for s in a1_interned())
True
String interning can save a lot of memory, but it is sometimes tricky to understand, whether/why a string gets interned or not. Calling sys.intern explicitly eliminates this uncertainty.
The existence of additional temporary objects referenced by t is not the problem: CPython uses reference counting for memory managment, so an object gets deleted as soon as there is no references to it - without any interaction from the garbage collector, which in CPython is only used to break-up cycles (which is different to for example Java's GC, as Java doesn't use reference counting). Thus, temporary variables are really temporaries - those objects cannot be accumulated to make any impact on memory usage.
The problem with the temporary variable t is only that it prevents peephole optimization during the compilation, which is performed for "a"*2 but not for t*2.
This difference is exist because of string interning in Python interpreter:
String interning is the method of caching particular strings in memory as they are instantiated. The idea is that, since strings in Python are immutable objects, only one instance of a particular string is needed at a time. By storing an instantiated string in memory, any future references to that same string can be directed to refer to the singleton already in existence, instead of taking up new memory.
Let me show it in a simple example:
>>> t1 = 'a'
>>> t2 = t1 * 2
>>> t2 is 'aa'
False
>>> t1 = 'a'
>>> t2 = 'a'*2
>>> t2 is 'aa'
True
When you use the first variant, the Python string interning is not used so the interpreter creates additional internal variables to store temporal data. It can't optimize many-lines-code this way.
I am not a Python guru, but I think the interpreter works this way:
t = "a"
t = t * 2
In the first line it creates an object for t. In the second line it creates a temporary object for t right of the = sign and writes the result in the third place in the memory (with GC called later). So the second variant should use at least 3 times less memory than the first.
P.S. You can read more about string interning here.
This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
The `is` operator behaves unexpectedly with non-cached integers
(2 answers)
Closed 5 years ago.
So Python 3.6.2 has some weird behavior with their assignment of id's for integer values.
For any integer value in the range [-5, 256], any variable assigned a given value will also be assigned the same ID as any other variable with the same value. This effect can be seen below.
>>> a, b = -5, -5
>>> id(a), id(b)
(1355597296, 1355597296)
>>> a, b = -6, -6
>>> id(a), id(b)
(2781041259312, 2781041260912)
In fact, to see the ID pairs in action, you can just run this simple program that prints out the number and id in the range that I'm talking about...
for val in range(-6, 258):
print(format(val, ' 4d'), ':', format(id(val), '11x'))
If you add some other variables with values outside this range, you will see the boundary condition (i.e. -6 and 257) values id's change within the python interpreter, but never the values here.
This means (at least to me) that Python has taken the liberty to hardcode the addresses of variables that hold values in a seemingly arbitrary range of numbers.
In practice, this can be a little dangerous for a beginning Python learner: since the ID's assigned are the same within what is a a normal range of operation for beginners, they may be inclined to use logic that might get them in trouble, even though it seemingly works, and makes sense...
One possible (though a bit odd) problem might be printing an incrementing number:
a = 0
b = 10
while a is not b:
a = a + 1
print(a)
This logic, though not in the standard Pythonic way, works and is fine as long as b is in the range of statically defined numbers [-5. 256]
However, as soon as b is raised out of this range, we see the same strange behavior. In this case, it actually throws the code into an infinite loop.
I know that using 'is' to compare values is really not a good idea, but this produces inconsistent results when using the 'is' operator, and it is not immediately obvious to someone new to the language, and it would be especially confusing for new programmers that mistakenly used this method.
So my question is...
a) Why (was Python written to behave this way), and
b) Should it be changed?
p.s. In order to properly demonstrate the range in a usable script, I had to do some odd tweaks that really are improper code. However, I still hold my argument, since my method would not show any results if this odd glitch didn't exist.
for val in range(-6, 300):
a = int(float(val))
b = int(float(val))
print(format(a, ' 4d'), format(id(a), '11x'), ':',format(b, ' 4d'), format(id(b), '11x'), ':', a is b)
val = val + 1
The float(int(val)) is necessary to force Python to give each value a new address/id rather than the pointer to the object that it is accessing.
This is documented behavior of Python:
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range you actually just get back a reference to the existing object.
source
It helps to save memory and to make operations a bit faster.
It is implementation-specific. For example, IronPython has a range between -1000 and 1000 in which it it re-uses integers.
I am using Anaconda (Python 3.6).
In the interactive mode, I did object identity test for positive integers >256:
# Interactive test 1
>>> x = 1000
>>> y = 1000
>>> x is y
False
Clearly, large integers (>256) writing in separate lines are not reused in interactive mode.
But if we write the assignment in one line, the large positive integer object is reused:
# Interactive test 2
>>> x, y = 1000, 1000
>>> x is y
True
That is, in interactive mode, writing the integer assignments in one or separate lines would make a difference for reusing the integer objects (>256). For integers in [-5,256] (as described https://docs.python.org/2/c-api/int.html), caching mechanism ensures that only one object is created, whether or not the assignment is in the same or different lines.
Now let's consider small negative integers less than -5 (any negative integer beyond the range [-5, 256] would serve the purpose), surprising results come out:
# Interactive test 3
>>> x, y = -6, -6
>>> x is y
False # inconsistent with the large positive integer 1000
>>> -6 is -6
False
>>> id(-6), id(-6), id(-6)
(2280334806256, 2280334806128, 2280334806448)
>>> a = b =-6
>>> a is b
True # different result from a, b = -6, -6
Clearly, this demonstrates inconsistency for object identity test between large positive integers (>256) and small negative integers (<-5). And for small negative integers (<-5), writing in the form a, b = -6, -6 and a = b =-6 also makes a difference (in contrast, it doesn't which form is used for large integers). Any explanations for these strange behaviors?
For comparison, let's move on to IDE run (I am using PyCharm with the same Python 3.6 interpreter), I run the following script
# IDE test case
x = 1000
y = 1000
print(x is y)
It prints True, different from the interactive run. Thanks to #Ahsanul Haque, who already gave a nice explanation to the inconsistency between IDE run and interactive run. But it still remains to answer my question on the inconsistency between large positive integer and small negative integer in the interactive run.
Only one copy of a particular constant is created for a particular source code and reused if needed further. So, in pycharm, you are getting x is y == True.
But, in the interpreter, things are different. Here, only one line/statement runs at once. A particular constant is created for each new line. It is not reused in the next line. So, x is not y here.
But, if you can initialize in same line, you can have the same behavior (Reusing the same constant).
>>> x,y = 1000, 1000
>>> x is y
True
>>> x = 1000
>>> y = 1000
>>> x is y
False
>>>
Edit:
A block is a piece of Python program text that is executed as a unit.
In an IDE, the whole module get executed at once i.e. the whole module is a block. But in interactive mode, each instruction is actually a block of code that is executed at once.
As I said earlier, a particular constant is created once for a block of code and reused if reappears in that block of code again.
This is main difference between IDE and interpreter.
Then, why actually interpreter gives same output as IDE for smaller numbers? This is when, integer caching comes into consideration.
If numbers are smaller, then they are cached and reused in next code block. So, we get the same id in the IDE.
But if they are bigger, they are not cached. Rather a new copy is created. So, as expected, the id is different.
Hope this makes sense now,
When you run 1000 is 1000 in the interactive shell or as part of the bigger script, CPython generates the bytecode like
In [3]: dis.dis('1000 is 1000')
...:
1 0 LOAD_CONST 0 (1000)
2 LOAD_CONST 0 (1000)
4 COMPARE_OP 8 (is)
6 RETURN_VALUE
What it does is:
Loads two constants (LOAD_CONST pushes co_consts[consti] onto the stack -- docs)
Compares them using is (True if operands refer to the same object; False otherwise)
Returns the result
As CPython only creates one Python object for a constant used in a code block, 1000 is 1000 will result in a single integer constant being created:
In [4]: code = compile('1000 is 1000', '<string>', 'single') # code object
In [5]: code.co_consts # constants used by the code object
Out[5]: (1000, None)
According to the bytecode above, Python will load that same object twice and compare it with itself, so the expression will evaluate to True:
In [6]: eval(code)
Out[6]: True
The results are different for -6, because -6 is not immediately recognized as a constant:
In [7]: ast.dump(ast.parse('-6'))
Out[7]: 'Module(body=[Expr(value=UnaryOp(op=USub(), operand=Num(n=6)))])'
-6 is an expression negating the value of the integer literal 6.
Nevertheless, the bytecode for -6 is -6 is virtually the same as the first bytecode sample:
In [8]: dis.dis('-6 is -6')
1 0 LOAD_CONST 1 (-6)
2 LOAD_CONST 2 (-6)
4 COMPARE_OP 8 (is)
6 RETURN_VALUE
So Python loads two -6 constants and compares them using is.
How does the -6 expression become a constant? CPython has a peephole optimizer, capable of optimizing simple expressions involving constants by evaluating them right after the compilation, and storing the results in the table of constants.
As of CPython 3.6, folding unary operations is handled by fold_unaryops_on_constants in Python/peephole.c. In particular, - (unary minus) is evaluated by PyNumber_Negative that returns a new Python object (-6 is not cached). After that, the newly created object is inserted to the consts table. However, the optimizer does not check whether the result of the expression can be reused, so the results of identical expressions end up being distinct Python objects (again, as of CPython 3.6).
To illustrate this, I'll compile the -6 is -6 expression:
In [9]: code = compile('-6 is -6', '<string>', 'single')
There're two -6 constants in the co_consts tuple
In [10]: code.co_consts
Out[10]: (6, None, -6, -6)
and they have different memory addresses
In [11]: [id(const) for const in code.co_consts if const == -6]
Out[11]: [140415435258128, 140415435258576]
Of course, this means that -6 is -6 evaluates to False:
In [12]: eval(code)
Out[12]: False
For the most part the explanation above remains valid in presence of variables. When executed in the interactive shell, these three lines
>>> x = 1000
>>> y = 1000
>>> x is y
False
are parts of three different code blocks, so the 1000 constant won't be reused. However, if you put them all in one code block (like a function body) the constant will be reused.
In contrast, the x, y = 1000, 1000 line is always executed in one code block (even in the interactive shell), and therefore CPython always reuses the constant. In x, y = -6, -6, -6 isn't reused for the reasons explained in the first part of my answer.
x = y = -6 is trivial. Since there's exactly one Python object involved, x is y would return True even if you replaced -6 with something else.
For complement the answer of the Ahsanul Haque, Try this in any IDE:
x = 1000
y = 1000
print (x is y)
print('\ninitial id x: ',id(x))
print('initial id y: ',id(y))
x=2000
print('\nid x after change value: ',id(x))
print('id y after change x value: ', id(y))
initial id x: 139865953872336
initial id y: 139865953872336
id x after change value: 139865953872304
id y after change x value: 139865953872336
Very likely you will see the same ID for 'x' and 'y', then run the code in the interpreter and ids will be different.
>x=1000
>y=1000
>id(x)
=> 139865953870576
>id(y)
=> 139865953872368
See Here.