I was wondering if there is a proper Python convention to distinguish between functions that either alter their arguments in place or functions that leave their arguments in tact and return an altered copy. For example, consider two functions that apply some permutation. Function f takes a list as an argument and shuffles the elements around, while function g takes a list, copies it, applies f and then returns the altered copy.
For the above example, I figured that f could be called permute(x), while g could be permutation(x), using verbs and nouns to distinguish. This is not always really optimal, though, and in some cases it might lead to confusion as to whether an argument is going to get changed along the way - especially if f were to have some other value to return.
Is there a standard way to deal with this problem?
There is no handy naming convention, not written down in places like PEP-8 at any rate.
The Python standard library does use such a convention to a certain extend. Compare:
listobj.sort()
listobj.reverse()
with
sorted_listobj = sorted(listobj)
reversed_iterator = reversed(listobj)
Using a verb when acting on the object directly, an adjective when returning a new.
The convention isn't consistent. When enumerating an iterable, you use enumerate(), not enumerated(). filter() doesn't alter the input sequence in place. round() doesn't touch the input number object. compile() produces a new bytecode object. Etc. But none of those operations have in-place equivalents in the Python standard library anyway, and I am not aware of any use of adjectives where the input is altered in-place.
Related
In python and (optionally) pep8, is there a conventional a way to signal to the user that a passed parameter (e.g. a dict) will be modified by the function being called?
Returning None from the function is used as an indicator that the object may have been modified in-place. For example, sort() returns None, while sorted() returns the a sorted copy of the input list, and leaves the input itself alone (though, in fact, sort() is a method on a list).
It's not the best indicator: having good documentation and spelling it out near the top of a doc-string is probably better (for example as random.shuffle does).
But it is what built-in and standard library functions seem to do.
Brief note on some other libraries: NumPy's sort returns and does not modify in-place, so it has different behaviour compared to the built-in sort function. Pandas functions & methods often have an inplace boolean argument, which by default tends to be False, so by default, a modified copy is returned.
I have recently discovered that lists in python are automatically passed by reference (unless the notation array[:] is used). For example, these two functions do the same thing:
def foo(z):
z.append(3)
def bar(z):
z.append(3)
return z
x = [1, 2]
y = [1, 2]
foo(x)
bar(y)
print(x, y)
Before now, I always returned arrays that I manipulated, because I thought I had to. Now, I understand it's superfluous (and perhaps inefficient), but it seems like returning values is generally good practice for code readability. My question is, are there any issues for doing either of these methods/ what are the best practices? Is there a third option that I am missing? I'm sorry if this has been asked before but I couldn't find anything that really answers my question.
This answer works on the assumption that the decision as to whether to modify your input in-place or return a copy has already been made.
As you noted, whether or not to return a modified object is a matter of opinion, since the result is functionally equivalent. In general, it is considered good form to not return a list that is modified in-place. According to the Zen of Python (item #2):
Explicit is better than implicit.
This is borne out in the standard library. List methods are notorious for this on SO: list.append, insert, extend, list.sort, etc.
Numpy also uses this pattern frequently, since it often deals with large data sets that would be impractical to copy and return. A common example is the array method numpy.ndarray.sort, not to be confused with the top-level function numpy.sort, which returns a new copy.
The idea is something that is very much a part of the Python way of thinking. Here is an excerpt from Guido's email that explains the whys and wherefors:
I find the chaining form a threat to readability; it requires that the reader must be intimately familiar with each of the methods. The second [unchained] form makes it clear that each of these calls acts on the same object, and so even if you don't know the class and its methods very well, you can understand that the second and third call are applied to x (and that all calls are made for their side-effects), and not to something else.
Python built-ins, as a rule, will not do both, to avoid confusion over whether the function/method modifies its argument in place or returns a new value. When modifying in place, no return is performed (making it implicitly return None). The exceptions are cases where a mutating function returns something other than the object mutated (e.g. dict.pop, dict.setdefault).
It's generally a good idea to follow the same pattern, to avoid confusion.
The "best practice" is technically to not modify the thing at all:
def baz(z):
return z + [3]
x = [1, 2]
y = baz(x)
print(x, y)
but in general it's clearer if you restrict yourself to either returning a new object or modifying an object in-place, but not both at once.
There are examples in the standard library that both modify an object in-place and return something (the foremost example being list.pop()), but that's a special case because it's not returning the object that was modified.
There's not strict should of course, However, a function should either do something, or return something.. So, you'd better either modify the list in place without returning anything, or return a new one, leaving the original one unchanged.
Note: the list is not exactly passed by reference. It's the value of the reference that is actually passed. Keep that in mind if you re-assign
I realize this may be a bit broad, and thought this was an interesting question that I haven't really seen an answer to. It may be hidden in the python documentation somewhere, but as I'm new to python haven't gone through all of it yet.
So.. are there any general rules of things that we cannot set to be variables? Everything in python is an object and we can use variables for the typical standard usage of storing strings, integers, aliasing variables, lists, calling references to classes, etc and if we're clever even something along the lines as the below that I can think of off the top of my head, wherever this may be useful
var = lambda: some_function()
storing comparison operators to clean code up such as:
var = some_value < some_value ...
So, that being said I've never come across anything that I couldn't store as a variable if I really wanted to, and was wondering if there really are any limitations?
You can't store syntactical constructs in a variable. For example, you can't do
command = break
while condition:
if other_condition:
command
or
operator = +
three = 1 operator 2
You can't really store expressions and statements as objects in Python.
Sure, you can wrap an expression in a lambda, and you can wrap a series of statements in a code object or callable, but you can't easily manipulate them. For instance, changing all instances of addition to multiplication is not readily possible.
To some extent, this can be worked around with the ast module, which provides for parsing Python code into abstract syntax trees. You can then manipulate the trees, instead of the code itself, and pass it to compile() to turn it back into a code object.
However, this is a form of indirection, compensating for a feature Python itself lacks. ast can't really compare to the anything-goes flexibility of (say) Lisp macros.
According to the Language Reference, the right hand side of an assignment statement can be an 'expression list' or a 'yield expression'. An expression list is a comma-separated list of one or more expressions. You need to follow this through several more tokens to come up with anything concrete, but ultimately you can find that an 'expression' is any number of objects (literals or variable names, or the result of applying a unary operator such as not, ~ or - to a nested expression_list) chained together by any binary operator (such as the arithmetic, comparison or bitwise operators, or logical and and or) or the ternary a if condition else b.
You can also note in other parts of the language reference that an 'expression' is exactly something you can use as an argument to a function, or as the first part (before the for) of a list comprehension or generator expression.
This is a fairly broad definition - in fact, it amounts to "anything Python resolves to an object". But it does leave out a few things - for example, you can't directly store the less-than operator < in a variable, since it isn't a valid expression by itself (it has to be between two other expressions) and you have to put it in a function that uses it instead. Similarly, most of the Python keywords aren't expressions (the exceptions are True, False and None, which are all canonical names for certain objects).
Note especially that functions are also objects, and hence the name of a function (without calling it) is a valid expression. This means that your example:
var = lambda: some_function()
can be written as:
var = some_function
By definition, a variable is something which can vary, or change. In its broadest sense, a variable is no more than a way of referring to a location in memory in your given program. Another way to think of a variable is as a container to place your information in.
Unlike popular strongly typed languages, variable declaration in Python is not required. You can place pretty much anything in a variable so long as you can come up with a name for it. Furthermore, in addition to the value of a variable in Python being capable of changing, the type often can as well.
To address your question, I would say the limitations on a variable in Python relate only to a few basic necessary attributes:
A name
A scope
A value
(Usually) a type
As a result, things like operators (+ or * for instance) cannot be stored in a variable as they do not meet these basic requirements, and in general you cannot store expressions themselves as variables (unless you're wrapping them in a lambda expression).
As mentioned by Kevin, it's also worth noting that it is possible to sort of store an operator in a variable using the operator module , however even doing so you cannot perform the kinds of manipulations that a variable is otherwise subject to as really you are just making a value assignment. An example of the operator module:
import operator
operations = {"+": operator.add,
"-": operator.sub,}
operator_variable_string= input('Give me an operand:')
operator_function = operations[operator_variable_string]
result = operator_function(8, 4)
Is there a Python static analysis tool which can detect when function parameters are mutated, therefore causing a side-effect?
that is
def foo(x):
x.append("x at the end")
will change the calling scope x when x is a list.
Can this reliably be detected? I'm asking because such a tool would make it easier to comply with pure functional approaches.
I suppose a decorator could be used to warn about it (for development) but this wouldn't be as reliable as static analysis.
Your foo function will mutate its argument if it's called with a list—but if it's called with something different, it might raise an exception, or do something that doesn't mutate it.
Similarly, you can write a type that mutates itself every time you call len on it, and then a function that just printing the length of its argument would be mutating its arguments.
It's even worse if you use an operator like +=, which will call the (generally-mutating) __iadd__ method on types that have it, like list, but will call the (non-mutating) __add__ method on types that don't, like tuple. So, what are you going to do in those cases?
For that matter, even a for loop over an argument is mutating if you pass in an iterator, but (usually) not if you pass in a sequence.
If you just want to make a list of frequently-mutating method names and operators and search for those, that wouldn't be too hard to write as an AST visitor. But that's going to give you a lot of both false negatives and false positives.
This is exactly the kind of problem that static typing was designed to solve. Python doesn't have static typing built it, but it's possible to build on top of Python.
First, if you're using Python 3.x, you can use annotations to store the types of the parameters. For example:
def foo(x: MutableSequence) -> NoneType:
x.append("x at the end")
Now you know, from the fact that it takes a MutableSequence (or a list) rather than a Sequence, that it intends to mutate its parameter. And, even if it doesn't do so now, some future version might well do so, so you should trust its annotations anyway.
And now you can solve your problem the same way you would in Haskell or ML: your pure functional code takes a Sequence and it calls functions with that Sequence, and you just need to ensure that none of those functions is defined to take a MutableSequence, right?
That last part is the hard part. Python doesn't stop me from writing this:
def foo(x: Sequence) -> NoneType:
x.append("x at the end")
For that, you need a static type checker. Guido has been pushing to standardize annotations to allow the mypy static checker to become a semi-official part of Python. It's not completely finished yet, and it's not as powerful a type system as typical typed functional languages, but it will handle most Python code well enough for what you're looking for. But mypy isn't the only static type checker available; there are others if you search.
Anyway, with a type checker, that foo function would fail with an error explaining that Sequence has no such method append. And if, on the other hand, foo were properly defined as taking a MutableSequence, your functional code that calls it with a Sequence would fail with an error explaining that Sequence is not a subtype of MutableSequence.
Using the double star syntax in function definition, we obtain a regular dictionary. The problem is that it loose the user input order. Sometimes, we could want to know in which order keyword arguments where passed to the function.
Since usually a function call do not involved many arguments, I don't think it is a problem of performance so I wonder why the default is not to maintain the order.
I know we can use:
from collections import Ordereddict
def my_func(kwargs):
print kwargs
my_func(Ordereddict(a=1, b=42))
But it is less concise than:
def my_func(**kwargs):
print kwargs
my_func(a=1, b=42)
[EDIT 1]:
1) I thought there where 2 cases:
I need to know the order, this behaviour is known by the user through the documentation.
I do not need the order, so I do not care if it is ordered or not.
I did not thought that even if the user know it use the order, he could use:
a = dict(a=1, b=42)
my_func(**a)
Because he did not know that a dict is not ordered (even if he should know)
2) I thought that the overhead would not be huge in case of a few arguments, so the benefits of having a new possibility to manage arguments would be superior to this downside.
But it seems (from Joe's answer) that the overhead is not negligible.
[EDIT 2]:
It seems that the PEP 0468 -- Preserving the order of **kwargs in a function is going in this direction.
Because dictionaries are not ordered by definition. I think it really is that simple. The point of kwargs is to take care of exactly those formal parameters which are not ordered. If you did know the order then you could receive them as 'normal' parameters or *args.
Here is a dictionary definition.
CPython implementation detail: Keys and values are listed in an
arbitrary order which is non-random, varies across Python
implementations, and depends on the dictionary’s history of insertions
and deletions.
http://docs.python.org/2/library/stdtypes.html#dict
Python's dictionaries are central to the way the whole language works, so they are highly optimised. Adding ordering would impact performance and require more storage and processing overhead.
You may have a case where that's not true, but I think that's more exceptional than common. Adding a feature 'just in case' for a very hot code path is not a sensible design decision.
EDIT:
Just FYI
>>> timeit.timeit(stmt="z = dict(x)", setup='x = ((("one", "two"), ("three", "four"), ("five", "six")))', number=1000000)
1.6569631099700928
>>> timeit.timeit(stmt="z = OrderedDict(x)", setup='from collections import OrderedDict; x = ((("one", "two"), ("three", "four"), ("five", "six")))', number=1000000)
31.618864059448242
That's about a 30x speed difference in constructing a smallish 'normal' size dictionary. OrderedDict is part of the standard library, so I don't imagine there's much more performance that can be squeezed out of it.
As a counter-argument, here is an example of the complicated semantics this would cause. There are a couple of cases here:
The function always gets an unordered dictionary.
The function always gets an ordered dictionary - given this, we don't know if the order has any meaning, as if the user passes in an unordered data structure, the order will be arbitrary, while the data type implies order.
The function gets whatever is passed in - this seems ideal, but it's not that simple.
What about the case of some_func(a=1, b=2, **unordered_dict)? There is implicit ordering in the original keyword arguments, but then the dict is unordered. There is no clear choice here between ordered or not.
Given this, I'd say that ordering the keyword arguments wouldn't be useful, as it would be impossible to tell if the order is just an arbitrary one. This would cloud the semantics of function calling.
Given that, any benefit gained by making this a part of calling is lost - instead, just expect an OrderedDict as an argument.
If your function's arguments are so correlated that both name and order matter, consider using a specific data structure or define a class to hold them. Chances are, you'll want them together in other places in your code, and possibly define other functions/methods that use them.
Retrieving the order of key-word arguments passed via **kwargs would be extremely useful in the particular project I am working on. It is about making a kind of n-d numpy array with meaningful dimensions (right now called dimarray), particularly useful for geophysical data handling.
I have posted a developed question with examples here:
How to retrieve the original order of key-word arguments passed to a function call?