Pass by object reference good practices - python

I come from C++, and I am struggling to get a sense of safety when programming in Python (for instance misspelling can create extremely hard to find bugs, but that is not the point here).
Here I would like to understand how I can avoid doing horrible things by adhering to good practices.
The simple function below is perfectly fine in c++ but creates what I can only call a monstrosity in Python.
def fun(x):
x += 1
x = x + 1
return x
When I call it
var1 = 1;
print(fun(var1), var1)
var2 = np.array([1]);
print(fun(var2), var2)
I get
3 1
[3] [2]
Apart from the lack of homogeneous behaviour (which is already terrible), the second case is particularly hideous. The external variable is modified only by some of the instructions!
I know in details why it happens. So that is not my question. The point is that when constructing a complex program, I do not want to have to be extra careful with all these context-dependent and highly implicit technicalities.
There must be some good practice I can strictly adhere to that will prevent me from inadvertently producing the code above. I can think of ways, but they seem to overcomplicate the code, making C++ look like a more high level language.
What good practice should I follow to avoid that monstrosity?
Thanks!
[EDIT] Some clarification: What I struggle with is the fact that Python makes a type-dependent and context-dependent choice of creating a temporary. Again, I know the rules. However in C++ the choice is done by the programmer and clear throughout the whole function, while that is not the case in Python. Python requires the programmer to know quite some technicalities of the operations done on the argument in order to figure out if at that point Python is working on a temporary or on the original.
Notice that I constructed a function which both returns a value and has a side effect just to show my point.
The point is that a programmer might want to write that function to simply have side effects (no return statement), and midway through the function Python decides to build a temporary, so some side effects are not applied.
On the other hand the programmer might not want side effects, and instead get some (and hard to predict ones).
In C++ the above is simply and clearly handled. In Python it is rather technical and requires knowing what triggers the generation of temporaries and what not. As I need to explain this to my students, I would like to give them a simple rule that will prevent them from falling into those traps.

Good practices to avoid such pitfalls:
Functions which modify inputs should not return anything (e.g. list.sort)
Functions which do not modify the input should return the modified value (e.g. sorted)
Your fun does both, which goes against the conventions followed by most standard library code and popular third-party Python libraries. Breaking this "unwritten rule" is the cause of the particularly hideous result there.
Generally speaking, it's best if functions are kept "pure" when possible. It's easier to reason about a pure and stateless function, and they're easier to test.
A "sense of safety" when programming in Python comes from having a good test suite. As an interpreted and dynamic programming language, almost everything in Python happens at runtime. There is very little to protect you at compile time - pretty much only the syntax errors will be found. This is great for flexibility, e.g. virtually anything can be monkeypatched at runtime. With great power comes great responsibility. It is not unusual for a Python project to have twice as much test code as there is library code.

The one good practice that jumps to mind is command-query separation:
A function or method should only ever either compute and return something, or do something, at least when it comes to outside-observable behavior.
There's very few exceptions acceptable (think e.g. the pop method of a Stack data structure: It returns something, and does something) but those tend to be in places where it's so idiomatic, you wouldn't expect it any other way.
And when a function does something to its input values, that should be that function's sole purpose. That way, there's no nasty surprises.
Now for the inconsistent behavior between a "primitive" type and a more complex type, it's easiest to code defensively and assume that it's a reference anyway.

Related

Why is exec() dangerous? [duplicate]

I've seen this multiple times in multiple places, but never have found a satisfying explanation as to why this should be the case.
So, hopefully, one will be presented here. Why should we (at least, generally) not use exec() and eval()?
EDIT: I see that people are assuming that this question pertains to web servers – it doesn't. I can see why an unsanitized string being passed to exec could be bad. Is it bad in non-web-applications?
There are often clearer, more direct ways to get the same effect. If you build a complex string and pass it to exec, the code is difficult to follow, and difficult to test.
Example: I wrote code that read in string keys and values and set corresponding fields in an object. It looked like this:
for key, val in values:
fieldName = valueToFieldName[key]
fieldType = fieldNameToType[fieldName]
if fieldType is int:
s = 'object.%s = int(%s)' % (fieldName, fieldType)
#Many clauses like this...
exec(s)
That code isn't too terrible for simple cases, but as new types cropped up it got more and more complex. When there were bugs they always triggered on the call to exec, so stack traces didn't help me find them. Eventually I switched to a slightly longer, less clever version that set each field explicitly.
The first rule of code clarity is that each line of your code should be easy to understand by looking only at the lines near it. This is why goto and global variables are discouraged. exec and eval make it easy to break this rule badly.
When you need exec and eval, yeah, you really do need them.
But, the majority of the in-the-wild usage of these functions (and the similar constructs in other scripting languages) is totally inappropriate and could be replaced with other simpler constructs that are faster, more secure and have fewer bugs.
You can, with proper escaping and filtering, use exec and eval safely. But the kind of coder who goes straight for exec/eval to solve a problem (because they don't understand the other facilities the language makes available) isn't the kind of coder that's going to be able to get that processing right; it's going to be someone who doesn't understand string processing and just blindly concatenates substrings, resulting in fragile insecure code.
It's the Lure Of Strings. Throwing string segments around looks easy and fools naïve coders into thinking they understand what they're doing. But experience shows the results are almost always wrong in some corner (or not-so-corner) case, often with potential security implications. This is why we say eval is evil. This is why we say regex-for-HTML is evil. This is why we push SQL parameterisation. Yes, you can get all these things right with manual string processing... but unless you already understand why we say those things, chances are you won't.
eval() and exec() can promote lazy programming. More importantly it indicates the code being executed may not have been written at design time therefore not tested. In other words, how do you test dynamically generated code? Especially across browsers.
Security aside, eval and exec are often marked as undesirable because of the complexity they induce. When you see a eval call you often don't know what's really going on behind it, because it acts on data that's usually in a variable. This makes code harder to read.
Invoking the full power of the interpreter is a heavy weapon that should be only reserved for very tricky cases. In most cases, however, it's best avoided and simpler tools should be employed.
That said, like all generalizations, be wary of this one. In some cases, exec and eval can be valuable. But you must have a very good reason to use them. See this post for one acceptable use.
In contrast to what most answers are saying here, exec is actually part of the recipe for building super-complete decorators in Python, as you can duplicate everything about the decorated function exactly, producing the same signature for the purposes of documentation and such. It's key to the functionality of the widely used decorator module (http://pypi.python.org/pypi/decorator/). Other cases where exec/eval are essential is when constructing any kind of "interpreted Python" type of application, such as a Python-parsed template language (like Mako or Jinja).
So it's not like the presence of these functions are an immediate sign of an "insecure" application or library. Using them in the naive javascripty way to evaluate incoming JSON or something, yes that's very insecure. But as always, its all in the way you use it and these are very essential functions.
I have used eval() in the past (and still do from time-to-time) for massaging data during quick and dirty operations. It is part of the toolkit that can be used for getting a job done, but should NEVER be used for anything you plan to use in production such as any command-line tools or scripts, because of all the reasons mentioned in the other answers.
You cannot trust your users--ever--to do the right thing. In most cases they will, but you have to expect them to do all of the things you never thought of and find all of the bugs you never expected. This is precisely where eval() goes from being a tool to a liability.
A perfect example of this would be using Django, when constructing a QuerySet. The parameters passed to a query accepts keyword arguments, that look something like this:
results = Foo.objects.filter(whatever__contains='pizza')
If you're programmatically assigning arguments, you might think to do something like this:
results = eval("Foo.objects.filter(%s__%s=%s)" % (field, matcher, value))
But there is always a better way that doesn't use eval(), which is passing a dictionary by reference:
results = Foo.objects.filter( **{'%s__%s' % (field, matcher): value} )
By doing it this way, it's not only faster performance-wise, but also safer and more Pythonic.
Moral of the story?
Use of eval() is ok for small tasks, tests, and truly temporary things, but bad for permanent usage because there is almost certainly always a better way to do it!
Allowing these function in a context where they might run user input is a security issue, and sanitizers that actually work are hard to write.
Same reason you shouldn't login as root: it's too easy to shoot yourself in the foot.
Don't try to do the following on your computer:
s = "import shutil; shutil.rmtree('/nonexisting')"
eval(s)
Now assume somebody can control s from a web application, for example.
Reason #1: One security flaw (ie. programming errors... and we can't claim those can be avoided) and you've just given the user access to the shell of the server.
Try this in the interactive interpreter and see what happens:
>>> import sys
>>> eval('{"name" : %s}' % ("sys.exit(1)"))
Of course, this is a corner case, but it can be tricky to prevent things like this.

Returning a Mutable Argument From A Python Function

Looking for some style clarification on Python function design.
I understand that Python is "pass reference by value" in its function call semantics, however I still often see code where people have returned a mutable object they've done work on within the function. A simple example:
def example(pandasDataframe):
pandasDataframe['New Col'] = pandasDataframe['Current Col'] + 'Foo'
return pandasDataframe
And then the function is used like this
df = example(df)
To me, this is a waste of time as both the return statement and the assignment in the call are simply not required (for mutable objects). Yet this is a very common idiom (particularly in Pandas code)
Is this a recognised/formal convention when coding in Python?
I'm wondering if this is regarded as laudable defensive programming, or at least some of the time a lack of understanding by some programmers?
Can anyone clarify any formal rules, or if this is left to the developer as a matter of personal taste/opinion?
Phil, this idiom is called Method Chaining.
In your case you can apply a DataFrame method directly to the result of example function. E.g.
mean_val = example(DataFrame.from_dict(some_dict)).applymap(some_func)["some_field"].mean()
Without chaining you need to write something like this:
tmp_fame = DataFrame.from_dict(some_dict)
example(tmp_frame)
tmp_frame = tmp_frame.applymap(some_func)
tmp_slice = tmp_frame["some_field"]
mean_val = tmp_slice.mean()
My 2 cents on this:
To start, Python is easier thought as pass by object. The reason behind those statements (pass by value, pass by reference, etc) is that the context assumes to be C/Java as a way to define something we can all agree upon. But when something new comes around that cannot be defined by the current conventions or concepts, it produces this kind of debate.
My answer goes back to design and programming paradigms and assumptions that depend on backgrounds and how people are used to develop software.
Functional programming (pure functions) answer some of these. Object oriented allows what is possible in Python with mutability (and in a way directly related to your question).
Python enables many paradigms to be used, and gives a lot of power (which should be used wisely), and that is why Python relies on many conventions.
I would say that returning a value (such as in your case) is a good approach as it is clear, explicit and easier to maintain and refactor (Explicit is better than implicit.; If the implementation is hard to explain, it's a bad idea.)

When coding in Python, how do I achieve guarantees of correctness similar to those I get with Haskell's type system?

Using Haskell's type system I know that at some point in the program, a variable must contain say an Int of a list of strings. For code that compiles, the type checker offers certain guarantees that for instance I'm not trying to add an Int and a String.
Are there any tools to provide similar guarantees for Python code?
I know about and practice TDD.
The quick answer is "not really". While tools like PyLint (which is very good BTW) will give you a lot of help and good advice on what constitutes good Python style, that isn't exactly what you're looking for and it certainly isn't a real substitute for things like HM type inference.
There are some interesting research projects in this area, notably Gradual Typing by Jeremy Siek and colleagues and some really interesting ideas like the blame calculus of Wadler and Findler.
Practically speaking, I think the best you can achieve is by using some sensibly chosen runtime methods. Use the inspect module to test the type of an object (but remember to be true to Python's duck typing and so on). Use assert statements liberally. Or (possible 'And') use something like Design by Contract using decorators. There are lots of ways to implement these idioms, but this is typically done on a per-project basis. You may want to think about whether and how such methods affect the performance and resource usage of your programs, if this is critical for you. There have, however, been some efforts to standardise techniques like DBC for Python, but these haven't (yet) been pushed into the cPython trunk. Here's hoping though :)
Python is dynamic and strongly typed programming language. What that means is that you can define a variable without explicitly stating its type, but when you first use that variable it becomes bound to a certain type.
For example,
x = 5 is an integer, and so now you cannot concatenate it with string, e.g. x+"hello"

Python design mistakes [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
A while ago, when I was learning Javascript, I studied Javascript: the good parts, and I particularly enjoyed the chapters on the bad and the ugly parts. Of course, I did not agree with everything, as summing up the design defects of a programming language is to a certain extent subjective - although, for instance, I guess everyone would agree that the keyword with was a mistake in Javascript. Nevertheless, I find it useful to read such reviews: even if one does not agree, there is a lot to learn.
Is there a blog entry or some book describing design mistakes for Python? For instance I guess some people would count the lack of tail call optimization a mistake; there may be other issues (or non-issues) which are worth learning about.
You asked for a link or other source, but there really isn't one. The information is spread over many different places. What really constitutes a design mistake, and do you count just syntactic and semantic issues in the language definition, or do you include pragmatic things like platform and standard library issues and specific implementation issues? You could say that Python's dynamism is a design mistake from a performance perspective, because it makes it hard to make a straightforward efficient implementation, and it makes it hard (I didn't say completely impossible) to make an IDE with code completion, refactoring, and other nice things. At the same time, you could argue for the pros of dynamic languages.
Maybe one approach to start thinking about this is to look at the language changes from Python 2.x to 3.x. Some people would of course argue that print being a function is inconvenient, while others think it's an improvement. Overall, there are not that many changes, and most of them are quite small and subtle. For example, map() and filter() return iterators instead of lists, range() behaves like xrange() used to, and dict methods like dict.keys() return views instead of lists. Then there are some changes related to integers, and one of the big changes is binary/string data handling. It's now text and data, and text is always Unicode. There are several syntactic changes, but they are more about consistency than revamping the whole language.
From this perspective, it appears that Python has been pretty well designed on the language (syntax and sematics) level since at least 2.x. You can always argue about indentation-based block syntax, but we all know that doesn't lead anywhere... ;-)
Another approach is to look at what alternative Python implementations are trying to address. Most of them address performance in some way, some address platform issues, and some add or make changes to the language itself to more efficiently solve certain kinds of tasks. Unladen swallow wants to make Python significantly faster by optimizing the runtime byte-compilation and execution stages. Stackless adds functionality for efficient, heavily threaded applications by adding constructs like microthreads and tasklets, channels to allow bidirectional tasklet communication, scheduling to run tasklets cooperatively or preemptively, and serialisation to suspend and resume tasklet execution. Jython allows using Python on the Java platform and IronPython on the .Net platform. Cython is a Python dialect which allows calling C functions and declaring C types, allowing the compiler to generate efficient C code from Cython code. Shed Skin brings implicit static typing into Python and generates C++ for standalone programs or extension modules. PyPy implements Python in a subset of Python, and changes some implementation details like adding garbage collection instead of reference counting. The purpose is to allow Python language and implementation development to become more efficient due to the higher-level language. Py V8 bridges Python and JavaScript through the V8 JavaScript engine – you could say it's solving a platform issue. Psyco is a special kind of JIT that dynamically generates special versions of the running code for the data that is currently being handled, which can give speedups for your Python code without having to write optimised C modules.
Of these, something can be said about the current state of Python by looking at PEP-3146 which outlines how Unladen Swallow would be merged into CPython. This PEP is accepted and is thus the Python developers' judgement of what is the most feasible direction to take at the moment. Note it addresses performance, not the language per se.
So really I would say that Python's main design problems are in the performance domain – but these are basically the same challenges that any dynamic language has to face, and the Python family of languages and implementations are trying to address the issues. As for outright design mistakes like the ones listed in Javascript: the good parts, I think the meaning of "mistake" needs to be more explicitly defined, but you may want to check out the following for thoughts and opinions:
FLOSS Weekly 11: Guido van Rossum (podcast August 4th, 2006)
The History of Python blog
Is there a blog entry or some book describing design mistakes for Python?
Yes.
It's called the Py3K list of backwards-incompatible changes.
Start here: http://docs.python.org/release/3.0.1/whatsnew/3.0.html
Read all the Python 3.x release notes for additional details on the mistakes in Python 2.
My biggest peeve with Python - and one which was not really addressed in the move to 3.x - is the lack of proper naming conventions in the standard library.
Why, for example, does the datetime module contain a class itself called datetime? (To say nothing of why we have separate datetime and time modules, but also a datetime.time class!) Why is datetime.datetime in lower case, but decimal.Decimal is upper case? And please, tell me why we have that terrible mess under the xml namespace: xml.sax, but xml.etree.ElementTree - what is going on there?
Try these links:
http://c2.com/cgi/wiki?PythonLanguage
http://c2.com/cgi/wiki?PythonProblems
Things that frequently surprise inexperienced developers are candidate mistakes. Here is one, default arguments:
http://www.deadlybloodyserious.com/2008/05/default-argument-blunders/
A personal language peeve of mine is name binding for lambdas / local functions:
fns = []
for i in range(10):
fns.append(lambda: i)
for fn in fns:
print(fn()) # !!! always 9 - not what I'd naively expect
IMO, I'd much prefer looking up the names referenced in a lambda at declaration time. I understand the reasons for why it works the way it does, but still...
You currently have to work around it by binding i into a new name whos value doesn't change, using a function closure.
This is more of a minor problem with the language, rather than a fundamental mistake, but: Property overriding. If you override a property (using getters and setters), there is no easy way of getting the parent class' property.
Yeah, it's strange but I guess that's what you get for having mutable variables.
I think the reason is that the "i" refers to a box which has a mutable value and the "for" loop will change that value over time, so reading the box value later gets you the only value there is left.
I don't know how one would fix that short of making it a functional programming language without mutable variables (at least without unchecked mutable variables).
The workaround I use is creating a new variable with a default value (default values being evaluated at DEFINITION time in Python, which is annoying at other times) which causes copying of the value to the new box:
fns = []
for i in range(10):
fns.append(lambda j=i: j)
for fn in fns:
print(fn()) # works
I find it surprising that nobody mentioned the global interpreter lock.
One of the things I find most annoying in Python is using writelines() and readlines() on a file. readlines() not only returns a list of lines, but it also still has the \n characters at the end of each line, so you have to always end up doing something like this to strip them:
lines = [l.replace("\n", "").replace("\r", "") for l in f.readlines()]
And when you want to use writelines() to write lines to a file, you have to add \n at the end of every line in the list before you write them, sort of like this:
f.writelines([l + "\n" for l in lines])
writelines() and readlines() should take care of endline characters in an OS independent way, so you don't have to deal with it yourself.
You should just be able to go:
lines = f.readlines()
and it should return a list of lines, without \n or \r characters at the end of the lines.
Likewise, you should just be able to go:
f.writelines(lines)
To write a list of lines to a file, and it should use the operating systems preferred enline characters when writing the file, you shouldn't need to do this yourself to the list first.
My biggest dislike is range(), because it doesn't do what you'd expect, e.g.:
>>> for i in range(1,10): print i,
1 2 3 4 5 6 7 8 9
A naive user coming from another language would expect 10 to be printed as well.
You asked for liks; I have written a document on that topic some time ago: http://segfaulthunter.github.com/articles/biggestsurprise/
I think there's a lot of weird stuff in python in the way they handle builtins/constants. Like the following:
True = "hello"
False = "hello"
print True == False
That prints True...
def sorted(x):
print "Haha, pwned"
sorted([4, 3, 2, 1])
Lolwut? sorted is a builtin global function. The worst example in practice is list, which people tend to use as a convenient name for a local variable and end up clobbering the global builtin.

Is late binding consistent with the philosophy of "readibility counts"?

I am sorry all - I am not here to blame Python. This is just a reflection on whether what I believe is right. Being a Python devotee for two years, I have been writing only small apps and singing Python's praises wherever I go. I recently had the chance to read Django's code, and have started wondering if Python really follows its "readability counts" philosophy. For example,
class A:
a = 10
b = "Madhu"
def somemethod(self, arg1):
self.c = 20.22
d = "some local variable"
# do something
....
...
def somemethod2 (self, arg2):
self.c = "Changed the variable"
# do something 2
...
It's difficult to track the flow of code in situations where the instance variables are created upon use (i.e. self.c in the above snippet). It's not possible to see which instance variables are defined when reading a substantial amount of code written in this manner. It becomes very frustrating even when reading a class with just 6-8 methods and not more than 100-150 lines of code.
I am interested in knowing if my reading of this code is skewed by C++/Java style, since most other languages follow the same approach as them. Is there a Pythonic way of reading this code more fluently? What made Python developers adopt this strategy keeping "readability counts" in mind?
The code fragment you present is fairly atypical (which might also because you probably made it up):
you wouldn't normally have an instance variable (self.c) that is a floating point number at some point, and a string at a different point. It should be either a number or a string all the time.
you normally don't bring instance variables into life in an arbitrary method. Instead, you typically have a constructor (__init__) that initializes all variables.
you typically don't have instance variables named a, b, c. Instead, they have some speaking names.
With these fixed, your example would be much more readable.
A sufficiently talented miscreant can write unreadable code in any language. Python attempts to impose some rules on structure and naming to nudge coders in the right direction, but there's no way to force such a thing.
For what it's worth, I try to limit the scope of local variables to the area where they're used in every language that i use - for me, not having to maintain a huge mental dictionary makes re-familiarizing myself with a bit of code much, much easier.
I agree that what you have seen can be confusing and ought to be accompanied by documentation. But confusing things can happen in any language.
In your own code, you should apply whatever conventions make things easiest for you to maintain the code. With respect to this particular issue, there are a number of possible things that can help.
Using something like Epydoc, you can specify all the instance variables a class will have. Be scrupulous about documenting your code, and be equally scrupulous about ensuring that your code and your documentation remain in sync.
Adopt coding conventions that encourage the kind of code you find easiest to maintain. There's nothing better than setting a good example.
Keep your classes and functions small and well-defined. If they get too big, break them up. It's easier to figure out what's going on that way.
If you really want to insist that instance variables be declared before referenced, there are some metaclass tricks you can use. e.g., You can create a common base class that, using metaclass logic, enforces the convention that only variables that are declared when the subclass is declared can later be set.
This problem is easily solved by specifying coding standards such as declaring all instance variables in the init method of your object. This isn't really a problem with python as much as the programmer.
If what the code is doing becomes mysterious for some reason .. there should either be comments or the function names should make it obvious.
This is just my opinion though.
I personally think not having to declare variables is one of the dangerous things in Python, especially when doing classes. It is all too easy to accidentally create a variable by simple mistyping and then boggle at the code at length, unable to find the mistake.
Adding a property just before you need it will prevent you from using it before it's got a value. Personally, I always find classes hard to follow just from reading source - I read the documentation and find out what it's supposed to do, and then it usually makes sense when I read the source again.
The fact that such stuff is allowed is only useful in rare times for prototyping; while Javascript tends to allow anything and maybe such an example could be considered normal (I don't really know), in Python this is mostly a negative byproduct of omission of type declaration, which can help speeding up development - if you at some point change your mind on the type of a variable, fixing type declarations can take more time than the fixes to actual code, in some cases, including the renaming of a type, but also cases where you use a different type with some similar methods and no superclass/subclass relationship.

Categories