python deep copy and function call [duplicate]

python deep copy and function call [duplicate] - python

I'm doing some things in Python (3.3.3), and I came across something that is confusing me since to my understanding classes get a new id each time they are called.
Lets say you have this in some .py file:
class someClass: pass
print(someClass())
print(someClass())
The above returns the same id which is confusing me since I'm calling on it so it shouldn't be the same, right? Is this how Python works when the same class is called twice in a row or not? It gives a different id when I wait a few seconds but if I do it at the same like the example above it doesn't seem to work that way, which is confusing me.
>>> print(someClass());print(someClass())
<__main__.someClass object at 0x0000000002D96F98>
<__main__.someClass object at 0x0000000002D96F98>
It returns the same thing, but why? I also notice it with ranges for example
for i in range(10):
print(someClass())
Is there any particular reason for Python doing this when the class is called quickly? I didn't even know Python did this, or is it possibly a bug? If it is not a bug can someone explain to me how to fix it or a method so it generates a different id each time the method/class is called? I'm pretty puzzled on how that is doing it because if I wait, it does change but not if I try to call the same class two or more times.

The id of an object is only guaranteed to be unique during that object's lifetime, not over the entire lifetime of a program. The two someClass objects you create only exist for the duration of the call to print - after that, they are available for garbage collection (and, in CPython, deallocated immediately). Since their lifetimes don't overlap, it is valid for them to share an id.
It is also unsuprising in this case, because of a combination of two CPython implementation details: first, it does garbage collection by reference counting (with some extra magic to avoid problems with circular references), and second, the id of an object is related to the value of the underlying pointer for the variable (ie, its memory location). So, the first object, which was the most recent object allocated, is immediately freed - it isn't too surprising that the next object allocated will end up in the same spot (although this potentially also depends on details of how the interpreter was compiled).
If you are relying on several objects having distinct ids, you might keep them around - say, in a list - so that their lifetimes overlap. Otherwise, you might implement a class-specific id that has different guarantees - eg:
class SomeClass:
next_id = 0
def __init__(self):
self.id = SomeClass.nextid
SomeClass.nextid += 1

If you read the documentation for id, it says:
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
And that's exactly what's happening: you have two objects with non-overlapping lifetimes, because the first one is already out of scope before the second one is ever created.
But don't trust that this will always happen, either. Especially if you need to deal with other Python implementations, or with more complicated classes. All that the language says is that these two objects may have the same id() value, not that they will. And the fact that they do depends on two implementation details:
The garbage collector has to clean up the first object before your code even starts to allocate the second object—which is guaranteed to happen with CPython or any other ref-counting implementation (when there are no circular references), but pretty unlikely with a generational garbage collector as in Jython or IronPython.
The allocator under the covers have to have a very strong preference for reusing recently-freed objects of the same type. This is true in CPython, which has multiple layers of fancy allocators on top of basic C malloc, but most of the other implementations leave a lot more to the underlying virtual machine.
One last thing: The fact that the object.__repr__ happens to contain a substring that happens to be the same as the id as a hexadecimal number is just an implementation artifact of CPython that isn't guaranteed anywhere. According to the docs:
If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value (given an appropriate environment). If this is not possible, a string of the form <...some useful description…> should be returned.
The fact that CPython's object happens to put hex(id(self)) (actually, I believe it's doing the equivalent of sprintf-ing its pointer through %p, but since CPython's id just returns the same pointer cast to a long that ends up being the same) isn't guaranteed anywhere. Even if it has been true since… before object even existed in the early 2.x days. You're safe to rely on it for this kind of simple "what's going on here" debugging at the interactive prompt, but don't try to use it beyond that.

I sense a deeper problem here. You should not be relying on id to track unique instances over the lifetime of your program. You should simply see it as a non-guaranteed memory location indicator for the duration of each object instance. If you immediately create and release instances then you may very well create consecutive instances in the same memory location.
Perhaps what you need to do is track a class static counter that assigns each new instance with a unique id, and increments the class static counter for the next instance.

It's releasing the first instance since it wasn't retained, then since nothing has happened to the memory in the meantime, it instantiates a second time to the same location.

Try this, try calling the following:
a = someClass()
for i in range(0,44):
print(someClass())
print(a)
You'll see something different. Why? Cause the memory that was released by the first object in the "foo" loop was reused. On the other hand a is not reused since it's retained.

A example where the memory location (and id) is not released is:
print([someClass() for i in range(10)])
Now the ids are all unique.

Related

Method called when an object is deleted

I'm writing a small application in Python 3 where objects (players) hold sets of connected objects of the same type (other players) together with data that belong to the pair of objects (the result of a played game). From time to time it is necessary to delete objects (in my example when they drop out of a tournament). If this is the case, the to-be-deleted object also must be removed from all sets of all connected objects.
So, when an object detects that it shall be deleted, it shall walk through all connected objects in its own set and call the method .remove(self) on all connected objects. When this is done, it is ready to be destroyed.
Is it possible to have this done by simply calling del player42? I've read What is the __del__ method and how do I call it? and from there (and from other resources) I learned that the method __del__ is not reliable, because it will be called only when the to-be-deleted object is garbage collected, but this can happen much later than it really should be performed.
Is there another "magic" method in Python 3 objects that will be called immediately when the del command is performed on the object?

The del command deletes a specific reference to an object, not the object itself. There may still be many other references to that object (especially in parent/child relationships) that will prevent the object from being garbage collected, and __del__ from being called. If you have time-dependent and order-specific teardown requirements, you should implement a method on your objects like destroy() that you can call explicitly.
If you really want to use __del__ for some reason, you could make heavy use of weakrefs to prevent hard references that would prevent garbage collection, but it would be less deterministic with the potential for race conditions and other hard to diagnose bugs.

I guess this is not about memory management, but about getting rid of the back references from the objects to their containers. As garbage collection can assumed to be highly non-deterministic, you should not rely on it to perform runtime-relevant operations such as removing objects from a collection.
Instead, design your system in a different, less coupled way. For example, don't keep the items in collections associated with players -- store them in separate inventories, and access them only through them. You can then just delete objects from the inventory. This is a bit similar to certain forms of database normalization.
To achieve this kind of design (updating things referred to from differnt places), games tend to use their special design patterns, for example entity component sytems.

Can you safely change a Python object's type in a C extension?

Question
Suppose that I have implemented two Python types using the C extension API and that the types are identical (same data layouts/C struct) with the exception of their names and a few methods. Assuming that all methods respect the data layout, can you safely change the type of an object from one of these types into the other in a C function?
Notably, as of Python 3.9, there appears to be a function Py_SET_TYPE, but the documentation is not clear as to whether/when this is safe to do. I'm interested in knowing both how to use this function safely and whether types can be safely changed prior to version 3.9.
Motivation
I'm writing a Python C extension to implement a Persistent Hash Array Mapped Trie (PHAMT); in case it's useful, the source code is here (as of writing, it is at this commit). A feature I would like to add is the ability to create a Transient Hash Array Mapped Trie (THAMT) from a PHAMT. THAMTs can be created from PHAMTs in O(1) time and can be mutated in-place efficiently. Critically, THAMTs have the exact same underlying C data-structure as PHAMTs—the only real difference between a PHAMT and a THAMT is a few methods encapsulated by their Python types. This common structure allows one to very efficiently turn a THAMT back into a PHAMT once one has finished performing a set of edits. (This pattern typically reduces the number of memory allocations when performing a large number of updates to a PHAMT).
A very convenient way to implement the conversion from THAMT to PHAMT would be to simply change the type pointers of the THAMT objects from the THAMT type to the PHAMT type. I am confident that I can write code that safely navigates this change, but I can imagine that doing so might, for example, break the Python garbage collector.
(To be clear: the motivation is just context as to how the question arose. I'm not looking for help implementing the structures described in the Motivation, I'm looking for an answer to the Question, above.)

The supported way
It is officially possible to change an object's type in Python, as long as the memory layouts are compatible... but this is mostly limited to types not implemented in C. With some restrictions, it is possible to do
# Python attribute assignment, not C struct member assignment
obj.__class__ = some_new_class
to change an object's class, with one of the restrictions being that both the old and new classes must be "heap types", which all classes implemented in Python are and most classes implemented in C are not. (types.ModuleType and subclasses of that type are also specifically permitted, despite types.ModuleType not being a heap type. See the source for exact restrictions.)
If you want to create a heap type from C, you can, but the interface is pretty different from the normal way of defining Python types from C. Plus, for __class__ assignment to work, you have to not set the Py_TPFLAGS_IMMUTABLETYPE flag, and that means that people will be able to monkey-patch your classes in ways you might not like (or maybe you see that as an upside).
If you want to go that route, I suggest looking at the CPython 3.10 _functools module source code for an example. (They set the Py_TPFLAGS_IMMUTABLETYPE flag, which you'll have to make sure not to do.)
The unsupported way
There was an attempt at one point to allow __class__ assignment for non-heap types, as long as the memory layouts worked. It got abandoned because it caused problems with some built-in immutable types, where the interpreter likes to reuse instances. For example, allowing (1).__class__ = SomethingElse would have caused a lot of problems. You can read more in the big comment in the source code for the __class__ setter. (The comment is slightly out of date, particularly regarding the Py_TPFLAGS_IMMUTABLETYPE flag, which was added after the comment was written.)
As far as I know, this was the only problem, and I don't think any more problems have been added since then. The interpreter isn't going to aggressively reuse instances of your classes, so as long as you're not doing anything like that, and the memory layouts are compatible, I think changing the type of your objects should work for now, even for non-heap-types. However, it is not officially supported, so even if I'm right about this working for now, there's no guarantee it'll keep working.
Py_SET_TYPE only sets an object's type pointer. It doesn't do any refcount fixing that might be needed. It's a very low-level operation. If neither the old class nor the new class are heap types, no extra refcount fixing is needed, but if the old class is a heap type, you will have to decref the old class, and if the new class is a heap type, you will have to incref the new class.
If you need to decref the old class, make sure to do it after changing the object's class and possibly incref'ing the new class.

According to the language reference, chapter 3 "Data model" (see here):
An object’s type determines the operations that the object supports (e.g., “does it have a length?”) and also defines the possible values for objects of that type. The type() function returns an object’s type (which is an object itself). Like its identity, an object’s type is also unchangeable.[1]
which, to my mind states that the type must never change, and changing it would be illegal as it would break the language specification. The footnote however states that
[1] It is possible in some cases to change an object’s type, under certain controlled conditions. It generally isn’t a good idea though, since it can lead to some very strange behaviour if it is handled incorrectly.
I don't know of any method to change the type of an object from within python itself, so the "possible" may indeed refer to the CPython function.
As far as I can see a PyObject is defined internally as a
struct _object {
_PyObject_HEAD_EXTRA
Py_ssize_t ob_refcnt;
PyTypeObject *ob_type;
};
So the reference counting should still work. On the other hand you will segfault the interpreter if you set the type to something that is not a PyTypeObject, or if the pointer is free()d, so the usual caveats.
Apart from that I agree that the specification is a little ambiguous, but the question of "legality" may not have a good answer. The long and short of it seems to me to be "do not change types unless you know what your are doing, and if you are not hacking on CPython itself you do not know what you are doing".
Edit: The Py_SET_TYPE function was added in Python 3.9 based on this commit. Apparently, people used to just set the type using
Py_TYPE(obj) = typeobj;
So the inclusion (without being formerly announced as far as I can see) is more akin to adding a convenience function.

Why true immutability is impossible in Python?

I was reading the documentation for attrs. It says:
Please note that true immutability is impossible in Python
I am wondering what is the reason for that. Why someone cannot have an immutable list in Python while it is possible in C++? What is the main difference here?

TLDR; "True Immutability" is only possible on an impervious stone tablet, but it's counter-productive to the discussion of mutability, and why it is used / is important. It's not worth being technically correct at the expense of being practically wrong.
This is a bad argument of semantics. Python allows re-defining variable names with different types in an otherwise strongly typed language, which is where some of the confusion comes from, but to be clear the object a variable name refers to can very much be properly immutable.
Take for instance a tuple with a few numbers in it:
>>> tup_A = (1,2,3)
It is not possible to change the values of any of the objects in the tuple:
>>> tup_A[0] = 10
TypeError: 'tuple' object does not support item assignment
It is possible to overwrite the variable name tup_A with some other value, but then it will be a different object entirely even if it was related to the original. For example a slice of a tuple creates an entirely new object rather than a view of the original:
>>> id(tup_A)
2887473131072
>>> tup_A = tup_A[:1]
>>> id(tup_A)
2887473037616
I believe the article mentioned may also be somewhat referring to the possibility of creating custom immutable types (classes). This is also a bad argument because there are plenty of mechanisms to enforce immutability. In particular, the tools for customizing attribute access, and the #property function can be used to great effect for this. Once these methods are used to implement immutability, one would have to intentionally break the class to mutate data which was not meant to be mutated. This is of course possible because python is primarily distributed as source code, but the same could theoretically be said for the python c api. Tuples don't have to be immutable if you re-write python, but that's so far beyond the point, it's fair to say it's just wrong.
Immutability is a tool with a specific purpose. It is a good idea to use it whenever possible so an accidental slip-up will produce an error message rather than a silent bug. If you encounter errors like this, you should never ask "how can I mutate this value which was intended to be immutable?", but rather ask "why is this value not meant to be mutated, and how am I intended to utilize it?"
P.S. You could probably even mutate a tuple without editing cpython using the ctypes library by getting the actual memory locations of the objects contained within it, and overwriting the pointers, but this would break lots of things (like garbage collection ref counting). Don't do this. It's another one of those "so far beyond the point" things.

Actually other languages like C++ treat variable like storage containers. But Python treats them as a reference to memory address. Lists can be modified in place, i.e. in same memory location. But we have tuples, whose values can't be updates in place.
I don't know exactly why python treats variable in this way, but I think it is necessary for
dynamic typing. True immutability isn't possible, may refers to dynamic typing feature. You can Google to know more.
Please let me know, if this is what you wanted.

True immutability is impossible if your memory is mutable.
Think of immutability as checks, but not a hard guarantee that everything will stay the same.

threading.Timer in Python -- two functions agree on variable identity, but not the value

Are there any caveats similar to variable caching when using threading.Timer in Python?
I'm observing an effect similar to not putting "volatile" keyword in other languages, but I've read that this doesn't apply to Python and yet something is off. In summary, two methods (threads?) agree on the identity of a list variable but disagree on the contents.
I have a class with member variable self.x (a list) which is assigned once in constructor and then its identity never changes (but it can be cleared and refilled).
A Timer is also started in the constructor that periodically updates the contents of self.x. I'm not using any locks though I probably should (eventually).
Other users of the class instance then sometimes try to read the contents of the list.
Problem: At the end of the timer handler, the list is populated correctly (I've printed its contents in the logs) but when a user of the class instance reads the instance variable, it's empty! (more specifically, same value as it was initialized in the constructor, i.e. if I put some other item in there, it'll be that value).
The weird thing is that the getter and the timer agree on the identity of the list! (id(self.x) returns the same value). Also, I was not able to repro this in tests, even though I'm doing the same thing.
Any idea what I might be doing wrong?
Thanks in advance!

Seems like I misunderstood how multiprocessing works. What happened is the code forked into multiple processes and even though the id() of objects remained the same, I didn't realize they were in different (forked) processes. I knew this was happening but I thought that the fork would take care of copying the timer as well. I've changed the code to create the object (and timers) once the fork is complete and it seems to have solved the problem.

How can I understand if a memory address is used or not?

I am doing some experiments with the Python garbage collector, I would like to check if a memory address is used or not. In the following example, I have de-referenced the string (surely) at ls[2]. If I run the garbage collector, I can still see surely at the original address. I would like to be sure that the address is now writable. Is there a way to check it in Python?
from ctypes import string_at
from sys import getsizeof
import gc
ls = ['This','will be','surely','deleted']
idsurely= id(ls[2])
sizesurely = getsizeof(ls[2])
ls[2] = 'probably'
print(ls)
print(string_at(idsurely,sizesurely))
gc.collect()
# I check there is nothing in the garbage
print(gc.garbage)
print(string_at(idsurely,sizesurely))
I am interested in this mainly from a theoretical point of view so I am not saying that is something that has practical usage. My goal is to show how memory works for a tutorial. I want to show that the data is still there and that just that the bytes at the address can be now written. So the output of the script is up to now as expected. I just want to prove the last passage.

Not possible.
There is no central registry of used or unused memory addresses in Python. There isn't even a central registry of all objects (the cyclic GC doesn't know about all of them), and even if you had a registry of all objects, that wouldn't be enough to determine what memory locations are in use. Additionally, you can't just read arbitrary memory addresses, or write to arbitrary deallocated addresses. That'll quickly lead to segfaults or worse.
Finally, I would strongly advise against using this kind of thing in a tutorial even if you did find something to make it work. When you put something in a tutorial, a large fraction of people reading the tutorial are going to think it's something they're supposed to learn. Programming newbies should not be mislead into thinking that examining possibly-deallocated memory locations is something they should be doing.

Your experiments are way off base. id (solely as a CPython implementation detail) does get the memory address of the object in question, but we're talking about the Python object itself, not the data it contains. sys.getsizeof returns a number that roughly corresponds to how much memory the object occupies, but there is no guarantee that memory is contiguous.
By sheer coincidence, this almost works on str (though it will perform a buffer overread if the string in question has cached copies of its UTF-8 or wchar_t form, so you're risking crashing your program), but even then your test is flawed; CPython interns string literals that look like legal variable names, so if the string in question appears as a literal anywhere else in your program (including as the name of some class or function in some module you imported), it won't actually go away when you replace it. Similar implicit caches can occur if the literal string appears in any function, anywhere (it ends up being not only interned, but stored in the constants for that function).
Update: On testing, in an actual script, the reference count for 'surely' when you hold onto a copy of it is 3, which drops to 2 when you replace it with 'probably'. Turns out constants are being cached even at global scope. The only reason the interactive interpreter doesn't exhibit this behavior is that it effectively evals each line separately, so the constant cache is discarded when the eval completes.
And even if all that's not a problem, most (almost all) memory managers (CPython's specialized small object heap and the general heap it's built on) don't actually zero out memory when its released, so if you do look at the same address shortly after it really was released, it'll probably have pretty similar data in it.
Lastly, your gc.collect() call won't change anything except by coincidence (of whatever happens during gc possibly allocating memory by side-effect). str is not a garbage collected type, as it cannot contain references to other Python objects, so it's impossible for it to be a link in a reference cycle, and the CPython garbage collector is solely concerned with collecting cyclic garbage; CPython is reference counted, so anything that's not part of a reference cycle is cleaned up automatically and immediately when the last reference disappears.
The short answer this all leads up to is: There is no way to determine, within CPython, non-heuristically, if a particular memory address has been released to the free store and made available for reuse. CPython's memory management scheme is pure implementation detail, and exposing APIs at that level of detail would create compatibility concerns when people depended on them.
The closest you're going to get is using something like the tracemalloc module to perform basic snapshotting and compute differences in the snapshot. That's not going to give you a window into whether a specific address is still in use though AFAICT; at best it can tell you where an address that's definitely in use was allocated.
The other approach (specific to CPython) you can use is to just check the reference counts before replacing the object; sys.getrefcount for a given name/attribute reports 2, then deling (or rebinding) that name/attribute will release it (assuming no threads that might create additional references between the test and the del/rebind). You expect 2, not 1, because calling sys.getrefcount creates a temporary reference to the object in question. If it reports a number greater than 2, deling/rebinding could still lead to the object being deleted eventually when the cyclic garbage collectors runs, if the object was part of a reference cycle, but for a reference count of 2 (or 1 for something otherwise unnamed, e.g. sys.getrefcount(''.join(('f', '9')) or the like), the behavior will be deterministic.

From the documentation about gc:
... the collector supplements the reference counting already used in Python...
And from gc.is_tracked():
Returns True if the object is currently tracked by the garbage collector, False otherwise. As a general rule, instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are.
Strings are not tracked by the garbage collector:
In [1]: import gc
In [2]: test = 'surely'
Out[2]: 'surely'
In [3]: gc.is_tracked(test)
Out[3]: False
Looking at the documentation, there doesn't seem to be a method for accessing the reference counting from within the language.
Note that at least for me, using string_at doesn't work from the interactive interpreter. It does work in a script.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.