Python's special treatment to certain integers. Why? [duplicate] - python

This question already has an answer here:
What's with the integer cache maintained by the interpreter?
(1 answer)
Closed 2 years ago.
I came across this phrase:
"Python keeps an array of ints between -5 and 256. When you create
an int in that range, you get a reference to a pre-existing object"
You can verify with this code:
def check(n,d):
a = b = n
a -= d
b -= d
return a is b
Now, check(500,10) returns False. But check(500,300) returns True. Why does Python compiler would do such a thing? Isn't it a perfect recipe for bugs?

CPython (the reference interpreter) does it because it saves a lot of memory (and a small amount of execution time) to have the most commonly used ints served from a cache. Incrementing a number uses a shared temporary, not a unique value in each place you do the increment. Iterating a bytes or bytearray object can go much faster by directly pulling the cached entries. It's not a language guarantee though, so never write code like this (which relies on it).
It's not a bug factory because:
Relying on object identity tests for ints is a terrible idea in the first place; you should always be using == to compare ints, and
ints are immutable; it's impossible to modify the cached entries without writing intentionally evil ctypes or C extension modules. Normal Python code can't trigger bugs due to this cache.

Related

How is Python statement x=x+1 implemented?

In C, a statement x=x+1 will change the content at the same memory that is allocated for x. But in Python, since a variable can have different types, x at the left and right side of = may be of different types, which means they may refer to different pieces of memory. If so, after x changes its reference from the old memory to the new memory, the old memory can be reclaimed by the garbage collection mechanism. If it is the case, the following code may trigger the garbage collection process many times thus is very low efficient:
for i in range(1000000000)
i=i+1
Is my guess correct?
Update:
I need to correct the typo in the code to make the question clearer:
x=0
for i in range(1000000000)
x=x+1
#SvenMarnach, do you mean the integers 0,1,2,...,999999999 (which the label x once referred to) all exist in memory if garbage collection is not activated?
id can be used to track the 'allocation' of memory to objects. It should be used with caution, but here I think it's illuminating. id is a bit like a c pointer - that is, some how related to 'where' the object is located in memory.
In [18]: for i in range(0,1000,100):
...: print(i,id(i))
...: i = i+1
...: print(i,id(i))
...:
0 10914464
1 10914496
100 10917664
101 10917696
200 10920864
201 10920896
300 140186080959760
301 140185597404720
400 140186080959760
401 140185597404720
...
900 140186080959760
901 140185597404720
In [19]: id(1)
Out[19]: 10914496
Small integers (<256) are cached - that is, integer 1, once created is 'reused'.
In [20]: id(202)
Out[20]: 10920928 # same id as in the loop
In [21]: id(302)
Out[21]: 140185451618128 # different id
In [22]: id(901)
Out[22]: 140185597404208
In [23]: id(i)
Out[23]: 140185597404720 # = 901, but different id
In this loop, the first few iterations create or reuse small integers. But it appears that when creating larger integers, it is 'reusing' memory. It may not be full blown garbage collection, but the code is somehow optimized to avoid unnecessary memory use.
Generally as Python programmers don't focus on those details. Write clean reliable Python code. In this example, modifying an iteration variable in the loop is poor practice (even if it is just an example).
You are mostly correct, though I think a few clarifications may help.
First, the concept of variables in C in Python is rather different. In C, a variable generally references a fixed location in memory, as you stated yourself. In Python, a variable is just a label that can be attached to any object. An object could have multiple such labels, or none at all, and labels can be freely moved between objects. An assignment in C copies a new value to a memory location, while an assignment in Python attaches a new label to an object.
Integers are also very different in both languages. In C, an integer has a fixed size, and stores an integer value in a format native to the hardware. In Python, integers have arbitrary precision. They are stored as array of "digits" (usually 30-bit integers in CPython) together with a Python type header storing type information. Bigger integers will occupy more memory than smaller integers.
Moreover, integer objects in Python are immutable – they can't be changed once created. This means every arithmetic operation creates a new integer object. So the loop in your code indeed creates a new integer object in each iteration.
However, this isn't the only overhead. It also creates a new integer object for i in each iteration, which is dropped at the end of the loop body. And the arithmetic operation is dynamic – Python needs to look up the type of x and its __add__() method in each iteration to figure out how to add objects of this type. And function call overhead in Python is rather high.
Garbage collection and memory allocation on the other hand are rather fast in CPython. Garbage collection for integers relies completely on reference counting (no reference cycles possible here), which is fast. And for allocation, CPython uses an arena allocator for small objects that can quickly reuse memory slots without calling the system allocator.
So in summary, yes, compared to the same code in C, this code will run awfully slow in Python. A modern C compiler would simply compute the result of this loop at compile time and load the result to a register, so it would finish basically immediately. If raw speed for integer arithmetic is what you want, don't write that code in Python.

How does `is` work in the case of ephemeral objects sharing the same memory address? [duplicate]

This question already has answers here:
How can two Python objects have same id but 'is' operator returns False?
(2 answers)
Why is the id of a Python class not unique when called quickly?
(6 answers)
Unnamed Python objects have the same id
(2 answers)
Closed 4 years ago.
Note that this question might be (is?) specific to CPython.
Say you have some list, and check copies of the list for identity against each other:
>>> a=list(range(10))
>>> b,c=a[:],a[:]
>>> b is c
False
>>> id(b), id(c)
(3157888272304, 3157888272256)
No great shakes there. But if we do this in a more ephemeral way, things might seem a bit weird at first:
>>> a[:] is a[:]
False # <- two ephemeral copies not the same object (duh)
>>> id(a[:]),id(a[:])
(3157888272544, 3157888272544) # <- but two other ephemerals share the same id..? hmm....
...until we recognize what is probably going on here. I have not confirmed it by looking at the CPython implementation (I can barely read c++ so it would be a waste of time, to be honest), but it at least seems obvious that even though two objects have the same id, CPython is smart enough to know that they aren't the same object.
Assuming this is correct, my question is: what criteria is CPython using to determine whether the two ephemeral objects are the not the same object, given that they have the same id (presumably for efficiency reasons- see below)? Is it perhaps looking at the time it was marked to be garbage collected? The time it was created? Or something else...?
My theory on why they have the same id is that, likely, CPython knows an ephemeral copy of the list was already made and is waiting to be garbage collected, and it just efficiently re-uses the same memory location. It would be great if an answer could clarify/confirm this as well.
Two unmutable objects, sharing the same address, would, as you are concerned, be indistinguishable from each other.
The thing is that when you do a[:] is a[:] both objetcts are not at the same address - in order for the identity operator is to compare both objects, both operands have to exist - so, there is still a reference to the object at the left hand side when the native code for is is actually run.
On the other hand, when you do id(a[:]),id(a[:]) the object inside the parentheses on the first call is left without any references as soon as the id function call is done, and is destroyed, freeing the memory block to be used by the second a[:].

Is the empty tuple in Python a "constant" [duplicate]

This question already has answers here:
compare object to empty tuple with the 'is' operator in Python 2.x
(4 answers)
Closed 6 years ago.
I want to make my code more (memory-)efficient. Right now we have a lot of functions that take an iterable as parameter like:
def foo(para,meter,iterable):
#...
pass
and sometimes we have to provide it an empty list to do its work properly: foo(14,25,[]). The problem is that each time a new list is constructed: it requires to allocate on the heap, and a list seems to 64 bytes of memory (on my own machine, tested with sys.getsizeof([])) whereas the empty tuple only
takes a (potentially one time) 48 bytes.
I was therefore wondering whether the empty tuple is a constant. Since tuples are immutable, one can easily make the tuple with length 0 (so ()) a constant in the program. This would decrease the "construction time" (well there is none since it only would set a reference to the constant) and reduce the amount of memory allocated.
My question is whether there are guarantees regarding the Python interpreter (that is any popular interpreter) that the empty tuple is indeed a constant such that () does not require construction time nor allocates additional memory.
Testing it with id(..) seems to support the theory that there is indeed only one zero-tuple:
>>> id(())
140290183798856
>>> a = ()
>>> id(a)
140290183798856
but it could be possible that at runtime the Python interpreter forks the tuple for some reason.
In CPython, the empty tuple is a singleton. Only one copy is created, ever, then reused whenever you use () or use tuple() on an empty generator.
The PyTuple_new() function essentially does this:
if (size == 0 && free_list[0]) {
op = free_list[0];
Py_INCREF(op);
// ...
return (PyObject *) op;
}
So if the tuple size is 0 (empty) and free_list[0] object exists (the existing empty tuple singleton), just use that.
See How is tuple implemented in CPython? for more details on free_list; CPython will also re-use already-created tuple instances up to length 20.
This is an implementation detail. Other implementations (Jython, IronPython, PyPy) do not have to do the same.

Integers v/s Floats in python:Cannot understand the behavior

I was playing a bit in my python shell while learning about mutability of objects.
I found something strange:
>>> x=5.0
>>> id(x)
48840312
>>> id(5.0)
48840296
>>> x=x+3.0
>>> id(x) # why did x (now 8.0) keep the same id as 5.0?
48840296
>>> id(5.0)
36582128
>>> id(5.0)
48840344
Why is the id of 5.0 reused after the statement x=x+3.0?
Fundamentally, the answer to your question is "calling id() on numbers will give you unpredictable results". The reason for this is because unlike languages like Java, where primitives literally are their value in memory, "primitives" in Python are still objects, and no guarantee is provided that exactly the same object will be used every time, merely that a functionally equivalent one will be.
CPython caches the values of the integers from -5 to 256 for efficiency (ensuring that calls to id() will always be the same), since these are commonly used and can be effectively cached, however nothing about the language requires this to be the case, and other implementations may chose not to do so.
Whenever you write a double literal in Python, you're asking the interpreter to convert the string into a valid numerical object. If it can, Python will reuse existing objects, but if it cannot easily determine whether an object exits already, it will simply create a new one.
This is not to say that numbers in Python are mutable - they aren't. Any instance of a number, such as 5.0, in Python cannot be changed by the user after being created. However there's nothing wrong, as far as the interpreter is concerned, with constructing more than one instance of the same number.
Your specific example of the object representing x = 5.0 being reused for the value of x += 3.0 is an implementation detail. Under the covers, CPython may, if it sees fit, reuse numerical objects, both integers and floats, to avoid the costly activity of constructing a whole new object. I stress however, this is an implementation detail; it's entirely possible certain cases will not display this behavior, and CPython could at any time change its number-handling logic to no longer behave this way. You should avoid writing any code that relies on this quirk.
The alternative, as eryksun points out, is simply that you stumbled on an object being garbage collected and replaced in the same location. From the user's perspective, there's no difference between the two cases, and this serves to stress that id() should not be used on "primitives".
The Devil is in the details
PyObject* PyInt_FromLong(long ival)
Return value: New reference.
Create a new integer object with a value of ival.
The current implementation keeps an array of integer objects for all integers between -5 and 256, when you create an int in that range
you actually just get back a reference to the existing object. So it
should be possible to change the value of 1. I suspect the behaviour
of Python in this case is undefined. :-)
Note This is true only for CPython and may not apply for other Python Distribution.

How is the __len__ method implemented in various container types in Python? [duplicate]

This question already has answers here:
Cost of len() function
(5 answers)
Closed 9 years ago.
Until now, when I used the len function with various container types (let's say the list type for now), I assumed that each container type has a field member which stores the length of that particular object.. Coming from Java, this made a lot of sense. But when I come to think about it, I don't think this is true, this made me confused.
Whenever I'm using the len function on an object which implement __length__, does it calculates the length by iterating on the object's elements, or just returning the length somehow immediately?
The question came to me actually from using the dict built-in type. I added some elements (a lot of them) to the dictionary and eventually I needed to get the amount of elements in the dictionary, so because I'm not sure what is the time complexity of the len function, I decided to count the elements as I insert them... but I'm not sure this is the right solution to my problem.
This is an example code for my question:
d = {}
count = 0
for i in range(10 ** 6):
d[i] = True
count += 1
VS
d = {i: True for i in range(10 ** 6)}
count = len(d)
Second solution looks nicer (and shorter) to me... and I know that theoretically the time complexity is the same whether the len function is instant or not, in the second solution I'm afraid it iterates twice to 10 ** 6 (first for the dictionary comprehension, and second for the length calculation).
Enlighten me please.
You are very definitely over-thinking this. Python is not really the language that you should be using if you're worried about optimising at this level.
That said, on the whole Python's containers do know their own lengths, without having to iterate. The built-in types are implemented in C (in the CPython implementation), and I'd have to dig into the actual code to find out exactly where it's implemented, but len is always a constant-time call.

Categories