Presumably both mylist.reverse() and list.reverse(mylist) end up executing reverse_slice in listobject.c via list_reverse_impl or PyList_Reverse. But how do they actually get there? What are the paths from the Python expressions to the C code in that C file? What connects them? And which of those two reverse functions (if any) do they go through?
Update for the bounty: Dimitris' answer (Update 2: I meant the original version, before it was expanded now) and the comments under it explain parts, but I'm still missing a few things and would like to see a comprehensive answer.
How do the two paths from the two Python expressions converge? If I understand things correctly, disassembling and discussing the byte code and what happens to the stack, particularly LOAD_METHOD, would clarify this. (As somewhat done by the comments under Dimitris' answer.)
What is the "unbound method" pushed onto the stack? Is it a "C function" (which one?) or a "Python object"?
How can I tell that it's the list_reverse function in the listobject.c.h file? I don't think the Python interpreter is like "let's look for a file that sounds similar and for a function that sounds similar". I rather suspect that the list type is defined somewhere and is somehow "registered" under the name "list", and that the reverse function is "registered" under the name "reverse" (maybe that's what the LIST_REVERSE_METHODDEF macro does?).
I'm not interested (for this question) about stack frames, argument handling, and similar things (so maybe not much of what goes on inside call_function). Really what interests me here is what I said originally, the path from the Python expressions to that C code in that C file. And preferably how I can find such paths in general.
To explain my motivation: For another question I wanted to know what C code does the work when I call list.reverse(mylist). I was fairly confident I had found it by browsing around and searching for the names. But I want to be more certain and generally understand the connections better.
PyList_Reverse is part of the C-API, you'd call it if you were manipulating Python lists in C, it isn't used in any of the two cases.
These both go through list_reverse_impl (actually list_reverse which wraps list_reverse_impl) which is the C function that implements both list.reverse and list_instance.reverse.
Both calls are handled by call_function in ceval, getting there after the CALL_METHOD opcode generated for them is executed (dis.dis the statements to see it). call_function has gone under a good deal of changes in Python 3.8 (with the introduction of PEP 590) so what happens from there on is probably too big of a subject to get into in a single question.
Additional Questions:
How do the two paths from the two python expressions converge? If I understand things correctly, disassembling and discussing the byte code and what happens to the stack, particularly LOAD_METHOD, would clarify this.
Let's start after both expressions compile to their respective bytecode representations:
l = [1, 2, 3, 4]
Case A, for l.reverse() we have:
1 0 LOAD_NAME 0 (l)
2 LOAD_METHOD 1 (reverse)
4 CALL_METHOD 0
6 RETURN_VALUE
Case B, for list.reverse(l) we have:
1 0 LOAD_NAME 0 (list)
2 LOAD_METHOD 1 (reverse)
4 LOAD_NAME 2 (l)
6 CALL_METHOD 1
8 RETURN_VALUE
We can safely ignore the RETURN_VALUE opcode, doesn't really matter here.
Let's focus on the individual implementations for each op code, namely, LOAD_NAME, LOAD_METHOD and CALL_METHOD. We can see what gets pushed onto the value stack by viewing what operations are called on it. (Note, it is initialized to point to the value stack located inside the frame object for each expression.)
LOAD_NAME:
What is performed in this case is pretty straight-forward. Given our name, l or list in each case, (each name is found in `co->co_names, a tuple that stores the names we use inside the code object) the steps are:
Look for the name inside locals. If found, go to 4.
Look for the name inside globals. If found, go to 4.
Look for the name inside builtins. If found, go to 4.
If found, push value denoted by the name onto the stack. Else, NameError.
In case A, the name l is found in the globals. In case B, it is found in the builtins. So, after the LOAD_NAME, the stack looks like:
Case A: stack_pointer -> [1, 2, 3, 4]
Case B: stack_pointer -> <type list>
LOAD_METHOD:
First, I should not that this opcode is generated only if an attribute access is performed (i.e obj.attr). You could also grab a method and call it via a = obj.attr and then a() but that would result in a CALL_FUNCTION opcode generated (see further down for a bit more).
After loading the name of the callable (reverse in both cases) we search the object on the top of the stack (either [1, 2, 3, 4] or list) for a method named reverse. This is done with _PyObject_GetMethod, its documentation states:
Return 1 if a method is found, 0 if it's a regular attribute
from __dict__ or something returned by using a descriptor
protocol.
A method is only found in Case A, when we access the attribute (reverse) through an instance of the list object. In case B, the callable is returned after the descriptor protocol is invoked, so the return value is 0 (but we do of-course get the object back!).
Here we diverge on the value returned:
In Case A:
SET_TOP(meth);
PUSH(obj); // self
We have a SET_TOP followed by a PUSH. We moved the method to the top of the stack and then we push the value again. In this case, stack_pointer now looks:
stack_pointer -> [1, 2, 3, 4]
<reverse method of lists>
In Case B we have:
SET_TOP(NULL);
Py_DECREF(obj);
PUSH(meth);
Again a SET_TOP followed by a PUSH. The reference count of obj (i.e list) is decreased because, as far as I can tell, it isn't really needed anymore. In this case, the stack now looks like so:
stack_pointer -> <reverse method of lists>
NULL
For case B, we have an additional LOAD_NAME. Following the previous steps, the stack for Case B now becomes:
stack_pointer -> [1, 2, 3, 4]
<reverse method of lists>
NULL
Pretty similar.
CALL_METHOD:
This doesn't make any modifications to the stack. Both cases result in a call to call_function passing the thread state, the stack pointer and the number of positional arguments (oparg).
The only difference is in the expression used to pass the positional arguments.
For case A we need to account for the implicit self that should be inserted as the first positional argument. Since the op code generated for it doesn't signal that a positional argument has been passed (because none have been explicitly passed):
4 CALL_METHOD 0
we call call_function with oparg + 1 = 0 + 1 = 1 to signal that one positional argument exists on the stack ([1, 2, 3, 4]).
In case B, where we explicitly pass the instance as a first argument, this is accounted for:
6 CALL_METHOD 1
so the call to call_function can immediately pass oparg as the value for the positional arguments.
What is the "unbound method" pushed onto the stack? Is it a "C function" (which one?) or a "Python object"?
It is a Python object that wraps around a C function. The Python object is a method descriptor and the C function it wraps is list_reverse.
All built-in methods and function are implemented in C. During initialization, CPython initializes all builtins (see list here) and adds wrappers around all the methods. These wrappers (objects) are descriptors that are used to implement Methods and Functions.
When a method is retrieved from a class via one of its instances, it is said to be bound to that instance. This can be seen by peeking at the __self__ attribute assigned to it:
m = [1, 2, 3, 4].reverse
m() # use __self__
print(m.__self__) # [4, 3, 2, 1]
This method can still be called even without the instance that qualifies it. It is bound to that instance. (NOTE: This is handled by the CALL_FUNCTION opcode, not the LOAD/CALL_METHOD ones).
An unbound method is one that is not yet bound to an instance. list.reverse is unbound, it is waiting to be invoked through an instance to bind to it.
Something being unbound doesn't mean that it cannot be called, list.reverse is called just fine if you explicitly pass the self argument yourself as an argument. Remember that methods are just special functions that (among other things) implicitly pass self as a first argument after being bound to an instance.
How can I tell that it's the list_reverse function in the listobject.c.h file?
This is easy, you can see the list's methods getting initialized in listobject.c. LIST_REVERSE_METHODDEF is simply a macro that, when substituted, adds the list_reverse function to that list. The tp_methods of a list are then wrapped inside function objects as stated previously.
Things might seem complicated here because CPython uses an internal tool, argument clinic, to automate argument handling. This kinda moves definitions around, obfuscating slightly.
Related
Would somebody please clear up some technicalities for me.
In my course, it says that a variable doesn't contain a value per se, but a reference in computer memory where the value can be found.
For example:
a = [1, 2, 3]
a contains the reference to the location in computer memory where [1, 2, 3] can be found, sort of like an address.
Does this mean that in my computer, this value of [1, 2, 3] already exists in memory, or am I creating this value [1, 2, 3] on the spot?
a = [1, 2, 3]
causes the following actions by the Python interpreter:
Construct a list containing the elements 1, 2, 3.
Create the variable a
Make the variable from step 2 refer to the list from step 1.
A disassembly of the function might actually be enlightening in this case. Note: this answer is specific to the implementation and version of Python. This was generated with CPython 3.8.9.
Consider the function:
def x():
a = [1,2,3]
Very simple. Assign the list [1,2,3] to a local variable a.
Now let's look at the byte code that Python generated for this function:
import dis
dis.dis(x)
2 0 LOAD_CONST 1 (1)
2 LOAD_CONST 2 (2)
4 LOAD_CONST 3 (3)
6 BUILD_LIST 3
8 STORE_FAST 0 (a)
10 LOAD_CONST 0 (None)
12 RETURN_VALUE
I won't get into detail what all these byte codes mean, but you can see the list of instructions the python compiler has turned that simple function into. It loads three constants (1, 2, and 3), onto Python's stack, and uses the BUILD_LIST 3 operation to build a list from three items on the stack, and replaces them with a reference to the new list. This reference is then STOREd in the local variable 0 (which the programmer named a). Further code would use this.
So, the function actually translates your function into, roughly, the commands for "build a new list with contents 0, 1, 2" and "store the reference into a".
So, for a local variable, it is a "slot" (that the programmer has named 'a', and the compiler has named 0) with a reference to a list it just built.
Side note: The constants 1, 2, and 3 that are loaded onto the stack actually exist as references to integer objects in Python, which can have their own functions. For efficiency, CPython keeps a cache of common small numbers so there aren't copies. This is not true of many other programming languages. For example, C and Java can both have variables that contain just an integer without being a reference to an object.
In my course, it says that a variable doesn't contain a value per se, but a reference in computer memory where the value can be found.
That's true, depending on the definition of "memory" and definition of "value".
Memory might refer to virtual memory or RAM (physical memory). You don't have access to physical RAM directly in Python.
The definition of memory might include CPU registers or it might not. Some might argue that a CPU register is not memory. But still, a value may be stored there. Not in Python, though.
"value" may be an "address" again.
sort of like an address.
IMHO, it's good enough to think of it being an address. It doesn't behave like a pointer in C++, though. You can't directly write to it and the address may change over time.
this value of [1, 2, 3] already exists in memory
First, it exists in your PY file on disk. The Python interpreter will load that file into memory, so yes, those "values" exist in memory as text characters.
The interpreter may then
find that those values already exist in memory and reuse those
find a place for these values and store them in a different format (not as text characters but as int objects).
or am I creating this value [1, 2, 3] on the spot?
As mentioned before, it's kinda both. They exist in memory as text before and are then created again as their proper data types or not created because they already exist.
Additionally, to the excellent answers by #Thomas Weller, and #Barmar, you can see how the objects are stored in memory by using id() for each of the objects once they are mapped to a variable.
a = [1,2,3]
hex(id(a))
'0x1778bff00'
Furthermore, as #Max points out in their comment, this list type object is also just storing multiple int objects in this case which have their own memory location. These can be checked by the same logic -
[hex(id(i)) for i in a]
['0x10326e930', '0x10326e950', '0x10326e970']
Now, if you create another list object b which stores the 3 int objects and the previously defined list object a, you can see these refer to the same memory locations -
b = [1,2,3,a]
[hex(id(i)) for i in b]
['0x10326e930', '0x10326e950', '0x10326e970', '0x1778bff00']
And this also shows the behavior of self-referencing objects, such as an object that stores itself. But for this b has to be defined once initially already, since without a memory allocated to b you wont be able to store it into another object (in this case, store it in itself)
b = [1,2,3,b]
hex(id(b)) #'0x1590ad480'
[hex(id(i)) for i in b]
['0x10326e930', '0x10326e950', '0x10326e970', '0x17789f380']
However, if you map the same list of elements of 2 different variables, while the int objects still point to the same memory, these 2 variables have different memory locations, as expected -
d = [1,2,3]
e = [1,2,3]
print('d ->',hex(id(d)))
print('e ->',hex(id(e)))
print('elements of d ->',[hex(id(i)) for i in d])
print('elements of e ->',[hex(id(i)) for i in e])
d -> 0x16a838840
e -> 0x16a37d880
elements of d -> ['0x10326e930', '0x10326e950', '0x10326e970']
elements of e -> ['0x10326e930', '0x10326e950', '0x10326e970']
Redefining a variable with the same elements will keep the same int objects these point to, but the memory location for the variable changes.
d = [1,2,3]
print('old d ->',hex(id(d)))
d = [1,2,3]
print('new d ->',hex(id(d)))
old d -> 0x16a5e7040
new d -> 0x16a839080
What does a variable actually contain?
In short, that's not a question that even makes sense to ask about Python.
In another sense, that depends on the Python implementation. The semantics of variable in Python are simple, but somewhat abstract: a variable associates an object with a name. That's it.
In a = [1,2,3], the name is a and the object is a value of type list. Until the name a goes out of scope, is deleted with del a, or is assigned a new value, a refers to the list [1,2,3].
There is no deeper level, like "a is an address in memory where the list can be found". Python doesn't have a concept of an address space that you can access by location: it just has names for objects that exist... somewhere. Where that somewhere might be isn't important, and Python doesn't provide anyway to find out. The only two things you can do with a name are 1) look up its value and 2) make it refer to something other value.
This question already has answers here:
"Least Astonishment" and the Mutable Default Argument
(33 answers)
Closed 6 months ago.
I had a very difficult time with understanding the root cause of a problem in an algorithm. Then, by simplifying the functions step by step I found out that evaluation of default arguments in Python doesn't behave as I expected.
The code is as follows:
class Node(object):
def __init__(self, children = []):
self.children = children
The problem is that every instance of Node class shares the same children attribute, if the attribute is not given explicitly, such as:
>>> n0 = Node()
>>> n1 = Node()
>>> id(n1.children)
Out[0]: 25000176
>>> id(n0.children)
Out[0]: 25000176
I don't understand the logic of this design decision? Why did Python designers decide that default arguments are to be evaluated at definition time? This seems very counter-intuitive to me.
The alternative would be quite heavyweight -- storing "default argument values" in the function object as "thunks" of code to be executed over and over again every time the function is called without a specified value for that argument -- and would make it much harder to get early binding (binding at def time), which is often what you want. For example, in Python as it exists:
def ack(m, n, _memo={}):
key = m, n
if key not in _memo:
if m==0: v = n + 1
elif n==0: v = ack(m-1, 1)
else: v = ack(m-1, ack(m, n-1))
_memo[key] = v
return _memo[key]
...writing a memoized function like the above is quite an elementary task. Similarly:
for i in range(len(buttons)):
buttons[i].onclick(lambda i=i: say('button %s', i))
...the simple i=i, relying on the early-binding (definition time) of default arg values, is a trivially simple way to get early binding. So, the current rule is simple, straightforward, and lets you do all you want in a way that's extremely easy to explain and understand: if you want late binding of an expression's value, evaluate that expression in the function body; if you want early binding, evaluate it as the default value of an arg.
The alternative, forcing late binding for both situation, would not offer this flexibility, and would force you to go through hoops (such as wrapping your function into a closure factory) every time you needed early binding, as in the above examples -- yet more heavy-weight boilerplate forced on the programmer by this hypothetical design decision (beyond the "invisible" ones of generating and repeatedly evaluating thunks all over the place).
In other words, "There should be one, and preferably only one, obvious way to do it [1]": when you want late binding, there's already a perfectly obvious way to achieve it (since all of the function's code is only executed at call time, obviously everything evaluated there is late-bound); having default-arg evaluation produce early binding gives you an obvious way to achieve early binding as well (a plus!-) rather than giving TWO obvious ways to get late binding and no obvious way to get early binding (a minus!-).
[1]: "Although that way may not be obvious at first unless you're Dutch."
The issue is this.
It's too expensive to evaluate a function as an initializer every time the function is called.
0 is a simple literal. Evaluate it once, use it forever.
int is a function (like list) that would have to be evaluated each time it's required as an initializer.
The construct [] is literal, like 0, that means "this exact object".
The problem is that some people hope that it to means list as in "evaluate this function for me, please, to get the object that is the initializer".
It would be a crushing burden to add the necessary if statement to do this evaluation all the time. It's better to take all arguments as literals and not do any additional function evaluation as part of trying to do a function evaluation.
Also, more fundamentally, it's technically impossible to implement argument defaults as function evaluations.
Consider, for a moment the recursive horror of this kind of circularity. Let's say that instead of default values being literals, we allow them to be functions which are evaluated each time a parameter's default values are required.
[This would parallel the way collections.defaultdict works.]
def aFunc( a=another_func ):
return a*2
def another_func( b=aFunc ):
return b*3
What is the value of another_func()? To get the default for b, it must evaluate aFunc, which requires an eval of another_func. Oops.
Of course in your situation it is difficult to understand. But you must see, that evaluating default args every time would lay a heavy runtime burden on the system.
Also you should know, that in case of container types this problem may occur -- but you could circumvent it by making the thing explicit:
def __init__(self, children = None):
if children is None:
children = []
self.children = children
The workaround for this, discussed here (and very solid), is:
class Node(object):
def __init__(self, children = None):
self.children = [] if children is None else children
As for why look for an answer from von Löwis, but it's likely because the function definition makes a code object due to the architecture of Python, and there might not be a facility for working with reference types like this in default arguments.
I thought this was counterintuitive too, until I learned how Python implements default arguments.
A function's an object. At load time, Python creates the function object, evaluates the defaults in the def statement, puts them into a tuple, and adds that tuple as an attribute of the function named func_defaults. Then, when a function is called, if the call doesn't provide a value, Python grabs the default value out of func_defaults.
For instance:
>>> class C():
pass
>>> def f(x=C()):
pass
>>> f.func_defaults
(<__main__.C instance at 0x0298D4B8>,)
So all calls to f that don't provide an argument will use the same instance of C, because that's the default value.
As far as why Python does it this way: well, that tuple could contain functions that would get called every time a default argument value was needed. Apart from the immediately obvious problem of performance, you start getting into a universe of special cases, like storing literal values instead of functions for non-mutable types to avoid unnecessary function calls. And of course there are performance implications galore.
The actual behavior is really simple. And there's a trivial workaround, in the case where you want a default value to be produced by a function call at runtime:
def f(x = None):
if x == None:
x = g()
This comes from python's emphasis on syntax and execution simplicity. a def statement occurs at a certain point during execution. When the python interpreter reaches that point, it evaluates the code in that line, and then creates a code object from the body of the function, which will be run later, when you call the function.
It's a simple split between function declaration and function body. The declaration is executed when it is reached in the code. The body is executed at call time. Note that the declaration is executed every time it is reached, so you can create multiple functions by looping.
funcs = []
for x in xrange(5):
def foo(x=x, lst=[]):
lst.append(x)
return lst
funcs.append(foo)
for func in funcs:
print "1: ", func()
print "2: ", func()
Five separate functions have been created, with a separate list created each time the function declaration was executed. On each loop through funcs, the same function is executed twice on each pass through, using the same list each time. This gives the results:
1: [0]
2: [0, 0]
1: [1]
2: [1, 1]
1: [2]
2: [2, 2]
1: [3]
2: [3, 3]
1: [4]
2: [4, 4]
Others have given you the workaround, of using param=None, and assigning a list in the body if the value is None, which is fully idiomatic python. It's a little ugly, but the simplicity is powerful, and the workaround is not too painful.
Edited to add: For more discussion on this, see effbot's article here: http://effbot.org/zone/default-values.htm, and the language reference, here: http://docs.python.org/reference/compound_stmts.html#function
I'll provide a dissenting opinion, by addessing the main arguments in the other posts.
Evaluating default arguments when the function is executed would be bad for performance.
I find this hard to believe. If default argument assignments like foo='some_string' really add an unacceptable amount of overhead, I'm sure it would be possible to identify assignments to immutable literals and precompute them.
If you want a default assignment with a mutable object like foo = [], just use foo = None, followed by foo = foo or [] in the function body.
While this may be unproblematic in individual instances, as a design pattern it's not very elegant. It adds boilerplate code and obscures default argument values. Patterns like foo = foo or ... don't work if foo can be an object like a numpy array with undefined truth value. And in situations where None is a meaningful argument value that may be passed intentionally, it can't be used as a sentinel and this workaround becomes really ugly.
The current behaviour is useful for mutable default objects that should be shared accross function calls.
I would be happy to see evidence to the contrary, but in my experience this use case is much less frequent than mutable objects that should be created anew every time the function is called. To me it also seems like a more advanced use case, whereas accidental default assignments with empty containers are a common gotcha for new Python programmers. Therefore, the principle of least astonishment suggests default argument values should be evaluated when the function is executed.
In addition, it seems to me that there exists an easy workaround for mutable objects that should be shared across function calls: initialise them outside the function.
So I would argue that this was a bad design decision. My guess is that it was chosen because its implementation is actually simpler and because it has a valid (albeit limited) use case. Unfortunately, I don't think this will ever change, since the core Python developers want to avoid a repeat of the amount of backwards incompatibility that Python 3 introduced.
Python function definitions are just code, like all the other code; they're not "magical" in the way that some languages are. For example, in Java you could refer "now" to something defined "later":
public static void foo() { bar(); }
public static void main(String[] args) { foo(); }
public static void bar() {}
but in Python
def foo(): bar()
foo() # boom! "bar" has no binding yet
def bar(): pass
foo() # ok
So, the default argument is evaluated at the moment that that line of code is evaluated!
Because if they had, then someone would post a question asking why it wasn't the other way around :-p
Suppose now that they had. How would you implement the current behaviour if needed? It's easy to create new objects inside a function, but you cannot "uncreate" them (you can delete them, but it's not the same).
For instance:
a = some_process_that_generates_integer_result()
b = a
Someone told me that b and a will point to same chunk of integer object, thus b would modify the reference count of that object. The code is executed in function PyObject* ast2obj_expr(void* _o) in Python-ast.c:
static PyObject* ast2obj_object(void *o)
{
if (!o)
o = Py_None;
Py_INCREF((PyObject*)o);
return (PyObject*)o;
}
......
case Num_kind:
result = PyType_GenericNew(Num_type, NULL, NULL);
if (!result) goto failed;
value = ast2obj_object(o->v.Num.n);
if (!value) goto failed;
if (PyObject_SetAttrString(result, "n", value) == -1)
goto failed;
Py_DECREF(value);
break;
However, I think modifying reference count without ownership change is really futile. What I expect is that each variable holding primitive values (floats, integers, etc.) always have their own value, instead of referring to a same object.
And in the execution of my simple test code, I found the break point in the above Num_kind branch is never reached:
def some_function(x, y):
return (x+y)*(x-y)
a = some_function(666666,66666)
print a
b = a
print a
print b
b = a + 999999
print a
print b
b = a
print a
print b
I'm using the python2.7-dbg program provided by Debian. I'm sure the program and the source code matches, because many other break points works properly.
So, what does CPython actually do on primitive type objects?
First of all, there are no “primitive objects” in Python. Everything is an object, of the same kind, and they are all handled in the same way on the language level. As such, the following assignments work the same way regardless of the values which are assigned:
a = some_process_that_generates_integer_result()
b = a
In Python, assignments are always reference copies. So whatever the function returns, its reference is copied into the variable a. And then in the second line, the reference is again copied into the variable b. As such, both variables will refer to the exact same object.
You can easily verify this by using the id() function which will tell you the identity of an object:
print id(a)
print id(b)
This will print the same identifying number twice. Note though, that wil doing just this, you copied the reference two more times: It’s not variables that are passed to functions but copies of references.
This is different from other languages where you often differentiate between “call by value” and “call by reference”. The former means that you create a copy of the value and pass it to a function, which means that new memory is allocated for that value; the latter means that the actual reference is passed and changes to that reference affect the original variable as well.
What Python does is often called “call by assignment”: every function call where you pass arguments is essentially an assignment into new variables (which are then available to the function). And an assignment copies the reference.
When everything is an object, this is actually a very simple strategy. And as I said above, what happens with integers is then no different to what happens to other objects. The only “special” thing about integers is that they are immutable, so you cannot change their values. This means that an integer object always refers to the exact same value. This makes it easy to share the object (in memory) with multiple values. Every operation that yields a new result gives you a different object, so when you do a series of arithmetic operations, you are actually changing what object a variable is pointing to all the time.
The same happens with other immutable objects too, for example strings. Every operation that yields a changed string gives you a different string object.
Assignments with mutable objects however are the same too. It’s just that changing the value of those objects is possible, so they appear different. Consider this example:
a = [1] # creates a new list object
b = a # copies the reference to that same list object
c = [2] # creates a new list object
b = a + c # concats the two lists and creates a new list object
d = b
# at this point, we have *three* list objects
d.append(3) # mutates the list object
print(d)
print(b) # same result since b and d reference the same list object
Now coming back to your question and the C code you cite there, you are actually looking at the wrong part of CPython to get an explanation there. AST is the abstract syntax tree that the parser creates when parsing a file. It reflects the syntax structure of a program but says nothing about the actual run-time behavior yet.
The code you showed for the Num_kind is actually responsible for creating Num AST objects. You can get an idea of this when using the ast module:
>>> import ast
>>> doc = ast.parse('foo = 5')
# the document contains an assignment
>>> doc.body[0]
<_ast.Assign object at 0x0000000002322278>
# the target of that assignment has the id `foo`
>>> doc.body[0].targets[0].id
'foo'
# and the value of that assignment is the `Num` object that was
# created in that C code, with that `n` property containing the value
>>> doc.body[0].value
<_ast.Num object at 0x00000000023224E0>
>>> doc.body[0].value.n
5
If you want to get an idea of the actual evaluation of Python code, you should first look at the byte code. The byte code is what is being executed at run-time by the virtual machine. You can use the dis module to see byte code for Python code:
>>> def test():
foo = 5
>>> import dis
>>> dis.dis(test)
2 0 LOAD_CONST 1 (5)
3 STORE_FAST 0 (foo)
6 LOAD_CONST 0 (None)
9 RETURN_VALUE
As you can see, there are two major byte code instructions here: LOAD_CONST and STORE_FAST. LOAD_CONST will just load a constant value onto the evaluation stack. In this example, we just load a constant number, but we could also load the value from a function call instead (try playing with the dis module to figure out how it works).
The assignment itself is made using STORE_FAST. The byte code interpreter does the following for that instruction:
TARGET(STORE_FAST)
{
v = POP();
SETLOCAL(oparg, v);
FAST_DISPATCH();
}
So it essentially gets the value (the reference to the integer object) from the stack, and then calls SETLOCAL which essentially will just assign the value to local variable.
Note though, that this does not increase the reference count of that value. That’s what happens with LOAD_CONST, or any other byte code instruction that gets a value from somewhere:
TARGET(LOAD_CONST)
{
x = GETITEM(consts, oparg);
Py_INCREF(x);
PUSH(x);
FAST_DISPATCH();
}
So tl;dr: Assignments in Python are always reference copies. References are also copied whenever a value is used (but in many other situations that copied reference only exists for a short time). The AST is responsible for creating an object representation of parsed programs (only the syntax), while the byte code interpreter runs the previously compiled byte code to do actual stuff at run-time and deal with real objects.
I have a framework with some C-like language. Now I'm re-writing that framework and the language is being replaced with Python.
I need to find appropriate Python replacement for the following code construction:
SomeFunction(&arg1)
What this does is a C-style pass-by-reference so the variable can be changed inside the function call.
My ideas:
just return the value like v = SomeFunction(arg1)
is not so good, because my generic function can have a lot of arguments like SomeFunction(1,2,'qqq','vvv',.... and many more)
and I want to give the user ability to get the value she wants.
Return the collection of all the arguments no matter have they changed or not, like: resulting_list = SomeFunction(1,2,'qqq','vvv',.... and many more) interesting_value = resulting_list[3]
this can be improved by giving names to the values and returning dictionary interesting_value = resulting_list['magic_value1']
It's not good because we have constructions like
DoALotOfStaff( [SomeFunction1(1,2,3,&arg1,'qq',val2),
SomeFunction2(1,&arg2,v1),
AnotherFunction(),
...
], flags1, my_var,... )
And I wouldn't like to load the user with list of list of variables, with names or indexes she(the user) should know. The kind-of-references would be very useful here ...
Final Response
I compiled all the answers with my own ideas and was able to produce the solution. It works.
Usage
SomeFunction(1,12, get.interesting_value)
AnotherFunction(1, get.the_val, 'qq')
Explanation
Anything prepended by get. is kind-of reference, and its value will be filled by the function. There is no need in previous defining of the value.
Limitation - currently I support only numbers and strings, but these are sufficient form my use-case.
Implementation
wrote a Getter class which overrides getattribute and produces any variable on demand
all newly created variables has pointer to their container Getter and support method set(self,value)
when set() is called it checks if the value is int or string and creates object inheriting from int or str accordingly but with addition of the same set() method. With this new object we replace our instance in the Getter container
Thank you everybody. I will mark as "answer" the response which led me on my way, but all of you helped me somehow.
I would say that your best, cleanest, bet would be to construct an object containing the values to be passed and/or modified - this single object can be passed, (and will automatically be passed by reference), in as a single parameter and the members can be modified to return the new values.
This will simplify the code enormously and you can cope with optional parameters, defaults, etc., cleanly.
>>> class C:
... def __init__(self):
... self.a = 1
... self.b = 2
...
>>> c=C
>>> def f(o):
... o.a = 23
...
>>> f(c)
>>> c
<class __main__.C at 0x7f6952c013f8>
>>> c.a
23
>>>
Note
I am sure that you could extend this idea to have a class of parameter that carried immutable and mutable data into your function with fixed member names plus storing the names of the parameters actually passed then on return map the mutable values back into the caller parameter name. This technique could then be wrapped into a decorator.
I have to say that it sounds like a lot of work compared to re-factoring your existing code to a more object oriented design.
This is how Python works already:
def func(arg):
arg += ['bar']
arg = ['foo']
func(arg)
print arg
Here, the change to arg automatically propagates back to the caller.
For this to work, you have to be careful to modify the arguments in place instead of re-binding them to new objects. Consider the following:
def func(arg):
arg = arg + ['bar']
arg = ['foo']
func(arg)
print arg
Here, func rebinds arg to refer to a brand new list and the caller's arg remains unchanged.
Python doesn't come with this sort of thing built in. You could make your own class which provides this behavior, but it will only support a slightly more awkward syntax where the caller would construct an instance of that class (equivalent to a pointer in C) before calling your functions. It's probably not worth it. I'd return a "named tuple" (look it up) instead--I'm not sure any of the other ways are really better, and some of them are more complex.
There is a major inconsistency here. The drawbacks you're describing against the proposed solutions are related to such subtle rules of good design, that your question becomes invalid. The whole problem lies in the fact that your function violates the Single Responsibility Principle and other guidelines related to it (function shouldn't have more than 2-3 arguments, etc.). There is really no smart compromise here:
either you accept one of the proposed solutions (i.e. Steve Barnes's answer concerning your own wrappers or John Zwinck's answer concerning usage of named tuples) and refrain from focusing on good design subtleties (as your whole design is bad anyway at the moment)
or you fix the design. Then your current problem will disappear as you won't have the God Objects/Functions (the name of the function in your example - DoALotOfStuff really speaks for itself) to deal with anymore.
This question already has answers here:
Why variable = object doesn't work like variable = number
(10 answers)
Closed 4 years ago.
There is this code:
# assignment behaviour for integer
a = b = 0
print a, b # prints 0 0
a = 4
print a, b # prints 4 0 - different!
# assignment behaviour for class object
class Klasa:
def __init__(self, num):
self.num = num
a = Klasa(2)
b = a
print a.num, b.num # prints 2 2
a.num = 3
print a.num, b.num # prints 3 3 - the same!
Questions:
Why assignment operator works differently for fundamental type and
class object (for fundamental types it copies by value, for class object it copies by reference)?
How to copy class objects only by value?
How to make references for fundamental types like in C++ int& b = a?
This is a stumbling block for many Python users. The object reference semantics are different from what C programmers are used to.
Let's take the first case. When you say a = b = 0, a new int object is created with value 0 and two references to it are created (one is a and another is b). These two variables point to the same object (the integer which we created). Now, we run a = 4. A new int object of value 4 is created and a is made to point to that. This means, that the number of references to 4 is one and the number of references to 0 has been reduced by one.
Compare this with a = 4 in C where the area of memory which a "points" to is written to. a = b = 4 in C means that 4 is written to two pieces of memory - one for a and another for b.
Now the second case, a = Klass(2) creates an object of type Klass, increments its reference count by one and makes a point to it. b = a simply takes what a points to , makes b point to the same thing and increments the reference count of the thing by one. It's the same as what would happen if you did a = b = Klass(2). Trying to print a.num and b.num are the same since you're dereferencing the same object and printing an attribute value. You can use the id builtin function to see that the object is the same (id(a) and id(b) will return the same identifier). Now, you change the object by assigning a value to one of it's attributes. Since a and b point to the same object, you'd expect the change in value to be visible when the object is accessed via a or b. And that's exactly how it is.
Now, for the answers to your questions.
The assignment operator doesn't work differently for these two. All it does is add a reference to the RValue and makes the LValue point to it. It's always "by reference" (although this term makes more sense in the context of parameter passing than simple assignments).
If you want copies of objects, use the copy module.
As I said in point 1, when you do an assignment, you always shift references. Copying is never done unless you ask for it.
Quoting from Data Model
Objects are Python’s abstraction for data. All data in a Python
program is represented by objects or by relations between objects. (In
a sense, and in conformance to Von Neumann’s model of a “stored
program computer,” code is also represented by objects.)
From Python's point of view, Fundamental data type is fundamentally different from C/C++. It is used to map C/C++ data types to Python. And so let's leave it from the discussion for the time being and consider the fact that all data are object and are manifestation of some class. Every object has an ID (somewhat like address), Value, and a Type.
All objects are copied by reference. For ex
>>> x=20
>>> y=x
>>> id(x)==id(y)
True
>>>
The only way to have a new instance is by creating one.
>>> x=3
>>> id(x)==id(y)
False
>>> x==y
False
This may sound complicated at first instance but to simplify a bit, Python made some types immutable. For example you can't change a string. You have to slice it and create a new string object.
Often copying by reference gives unexpected results for ex.
x=[[0]*8]*8 may give you a feeling that it creates a two dimensional list of 0s. But in fact it creates a list of the reference of the same list object [0]s. So doing x[1][1] would end up changing all the duplicate instance at the same time.
The Copy module provides a method called deepcopy to create a new instance of the object rather than a shallow instance. This is beneficial when you intend to have two distinct object and manipulate it separately just as you intended in your second example.
To extend your example
>>> class Klasa:
def __init__(self, num):
self.num = num
>>> a = Klasa(2)
>>> b = copy.deepcopy(a)
>>> print a.num, b.num # prints 2 2
2 2
>>> a.num = 3
>>> print a.num, b.num # prints 3 3 - different!
3 2
It doesn't work differently. In your first example, you changed a so that a and b reference different objects. In your second example, you did not, so a and b still reference the same object.
Integers, by the way, are immutable. You can't modify their value. All you can do is make a new integer and rebind your reference. (like you did in your first example)
Suppose you and I have a common friend. If I decide that I no longer like her, she is still your friend. On the other hand, if I give her a gift, your friend received a gift.
Assignment doesn't copy anything in Python, and "copy by reference" is somewhere between awkward and meaningless (as you actually point out in one of your comments). Assignment causes a variable to begin referring to a value. There aren't separate "fundamental types" in Python; while some of them are built-in, int is still a class.
In both cases, assignment causes the variable to refer to whatever it is that the right-hand-side evaluates to. The behaviour you're seeing is exactly what you should expect in that environment, per the metaphor. Whether your "friend" is an int or a Klasa, assigning to an attribute is fundamentally different from reassigning the variable to a completely other instance, with the correspondingly different behaviour.
The only real difference is that the int doesn't happen to have any attributes you can assign to. (That's the part where the implementation actually has to do a little magic to restrict you.)
You are confusing two different concepts of a "reference". The C++ T& is a magical thing that, when assigned to, updates the referred-to object in-place, and not the reference itself; that can never be "reseated" once the reference is initialized. This is useful in a language where most things are values. In Python, everything is a reference to begin with. The Pythonic reference is more like an always-valid, never-null, not-usable-for-arithmetic, automatically-dereferenced pointer. Assignment causes the reference to start referring to a different thing completely. You can't "update the referred-to object in-place" by replacing it wholesale, because Python's objects just don't work like that. You can, of course, update its internal state by playing with its attributes (if there are any accessible ones), but those attributes are, themselves, also all references.