Related
I've seen this multiple times in multiple places, but never have found a satisfying explanation as to why this should be the case.
So, hopefully, one will be presented here. Why should we (at least, generally) not use exec() and eval()?
EDIT: I see that people are assuming that this question pertains to web servers – it doesn't. I can see why an unsanitized string being passed to exec could be bad. Is it bad in non-web-applications?
There are often clearer, more direct ways to get the same effect. If you build a complex string and pass it to exec, the code is difficult to follow, and difficult to test.
Example: I wrote code that read in string keys and values and set corresponding fields in an object. It looked like this:
for key, val in values:
fieldName = valueToFieldName[key]
fieldType = fieldNameToType[fieldName]
if fieldType is int:
s = 'object.%s = int(%s)' % (fieldName, fieldType)
#Many clauses like this...
exec(s)
That code isn't too terrible for simple cases, but as new types cropped up it got more and more complex. When there were bugs they always triggered on the call to exec, so stack traces didn't help me find them. Eventually I switched to a slightly longer, less clever version that set each field explicitly.
The first rule of code clarity is that each line of your code should be easy to understand by looking only at the lines near it. This is why goto and global variables are discouraged. exec and eval make it easy to break this rule badly.
When you need exec and eval, yeah, you really do need them.
But, the majority of the in-the-wild usage of these functions (and the similar constructs in other scripting languages) is totally inappropriate and could be replaced with other simpler constructs that are faster, more secure and have fewer bugs.
You can, with proper escaping and filtering, use exec and eval safely. But the kind of coder who goes straight for exec/eval to solve a problem (because they don't understand the other facilities the language makes available) isn't the kind of coder that's going to be able to get that processing right; it's going to be someone who doesn't understand string processing and just blindly concatenates substrings, resulting in fragile insecure code.
It's the Lure Of Strings. Throwing string segments around looks easy and fools naïve coders into thinking they understand what they're doing. But experience shows the results are almost always wrong in some corner (or not-so-corner) case, often with potential security implications. This is why we say eval is evil. This is why we say regex-for-HTML is evil. This is why we push SQL parameterisation. Yes, you can get all these things right with manual string processing... but unless you already understand why we say those things, chances are you won't.
eval() and exec() can promote lazy programming. More importantly it indicates the code being executed may not have been written at design time therefore not tested. In other words, how do you test dynamically generated code? Especially across browsers.
Security aside, eval and exec are often marked as undesirable because of the complexity they induce. When you see a eval call you often don't know what's really going on behind it, because it acts on data that's usually in a variable. This makes code harder to read.
Invoking the full power of the interpreter is a heavy weapon that should be only reserved for very tricky cases. In most cases, however, it's best avoided and simpler tools should be employed.
That said, like all generalizations, be wary of this one. In some cases, exec and eval can be valuable. But you must have a very good reason to use them. See this post for one acceptable use.
In contrast to what most answers are saying here, exec is actually part of the recipe for building super-complete decorators in Python, as you can duplicate everything about the decorated function exactly, producing the same signature for the purposes of documentation and such. It's key to the functionality of the widely used decorator module (http://pypi.python.org/pypi/decorator/). Other cases where exec/eval are essential is when constructing any kind of "interpreted Python" type of application, such as a Python-parsed template language (like Mako or Jinja).
So it's not like the presence of these functions are an immediate sign of an "insecure" application or library. Using them in the naive javascripty way to evaluate incoming JSON or something, yes that's very insecure. But as always, its all in the way you use it and these are very essential functions.
I have used eval() in the past (and still do from time-to-time) for massaging data during quick and dirty operations. It is part of the toolkit that can be used for getting a job done, but should NEVER be used for anything you plan to use in production such as any command-line tools or scripts, because of all the reasons mentioned in the other answers.
You cannot trust your users--ever--to do the right thing. In most cases they will, but you have to expect them to do all of the things you never thought of and find all of the bugs you never expected. This is precisely where eval() goes from being a tool to a liability.
A perfect example of this would be using Django, when constructing a QuerySet. The parameters passed to a query accepts keyword arguments, that look something like this:
results = Foo.objects.filter(whatever__contains='pizza')
If you're programmatically assigning arguments, you might think to do something like this:
results = eval("Foo.objects.filter(%s__%s=%s)" % (field, matcher, value))
But there is always a better way that doesn't use eval(), which is passing a dictionary by reference:
results = Foo.objects.filter( **{'%s__%s' % (field, matcher): value} )
By doing it this way, it's not only faster performance-wise, but also safer and more Pythonic.
Moral of the story?
Use of eval() is ok for small tasks, tests, and truly temporary things, but bad for permanent usage because there is almost certainly always a better way to do it!
Allowing these function in a context where they might run user input is a security issue, and sanitizers that actually work are hard to write.
Same reason you shouldn't login as root: it's too easy to shoot yourself in the foot.
Don't try to do the following on your computer:
s = "import shutil; shutil.rmtree('/nonexisting')"
eval(s)
Now assume somebody can control s from a web application, for example.
Reason #1: One security flaw (ie. programming errors... and we can't claim those can be avoided) and you've just given the user access to the shell of the server.
Try this in the interactive interpreter and see what happens:
>>> import sys
>>> eval('{"name" : %s}' % ("sys.exit(1)"))
Of course, this is a corner case, but it can be tricky to prevent things like this.
I recently found out how to dynamically create variables in python through this method:
vars()['my_variable'] = 'Some Value'
Thus creating the variable my_variable.
My question is, is this a good idea? Or should I always declare the variables ahead of time?
I think it's preferable to use a dictionnary if it's possible:
vars_dict = {}
vars_dict["my_variable"] = 'Some Value'
vars_dict["my_variable2"] = 'Some Value'
I think it's more pythonic.
This is a bad idea, since it gets much harder to analyze the code, both for a human being looking at the source, and for tools like pylint or pychecker. You'll be a lot more likely to introduce bugs if you use tricks like that. If you think you need that feature at some time, think really hard if you can solve your problem in a simpler and more conventional way. I've used Python for almost 20 years, and never felt a need to do that.
If you have more dynamic needs, just use a normal dictionary, or possibly something like json.
One of the great things with Python, its dynamic nature and good standard collection types, is that you can avoid putting logic in text strings. Both the Python interpreter, syntax highlighting in your IDE, intellisense and code analysis tools look at your source code, provides helpful suggestions and finds bugs and weaknesses. This doesn't work if your data structure or logic has been hidden in text strings.
More stupid and rigid languages, such as C++ and Java, often makes developers resort to string based data structures such as XML or json, since they don't have convenient collections like Python lists or dicts. This means that you hide business logic from the compiler and other safety checks built into the language or tools, and have to do a lot of checks that your development tools would otherwise do for you. In Python you don't have to do that ... so don't!
There is no guarantee that vars()['myvariable'] = 'Some value' and my variable = 'Some value' have the same effect. From the documentation:
Without an argument, vars() acts like locals(). Note, the locals
dictionary is only useful for reads since updates to the locals
dictionary are ignored.
This code is simply wrong.
Pros:
adds another level of indirection, makes the environment more dynamic
in particular, allows to avoid more code duplication
Cons:
not applicable for function namespaces (due to optimization)
adds another level of indirection, makes the environment more dynamic
"lexical references" are much harder to track and maintain
if created names are arbitrary, conflicts are waiting to happen
it's hard to find the ins and outs in the code base and predict its behaviour
that's why these tricks may upset code checking tools like pylint
if variables are processed in a similar way, they probably belong together separately from others (in a dedicated dict) rather than reusing a namespace dict, making it a mess in the process
In brief, at the abstraction level Python's language and runtime features are designed for, it's only good in small, well-defined amounts.
I don't see what would be the advantage of it, also would make your code harder to understand.
So no I don't think it is a good idea.
I am new to dynamic languages in general, and I have discovered that languages like Python prefer simple data structures, like dictionaries, for sending data between parts of a system (across functions, modules, etc).
In the C# world, when two parts of a system communicate, the developer defines a class (possibly one that implements an interface) that contains properties (like a Person class with a Name, Birth date, etc) where the sender in the system instantiates the class and assigns values to the properties. The receiver then accesses these properties. The class is called a DTO and it is "well- defined" and explicit. If I remove a property from the DTO's class, the compiler will instantly warn me of all parts of the code that use that DTO and are attempting to access what is now a non-existent property. I know exactly what has broken in my codebase.
In Python, functions that produce data (senders) create implicit DTOs by building up dictionaries and returning them. Coming from a compiled world, this scares me. I immediately think of the scenario of a large code base where a function producing a dictionary has the name of a key changed (or a key is removed altogether) and boom- tons of potential KeyErrors begin to crop up as pieces of the code base that work with that dictionary and expect a key are no longer able to access the data they were expecting. Without unit testing, the developer would have no reliable way of knowing where these errors will appear.
Maybe I misunderstand altogether. Are dictionaries a best practice tool for passing data around? If so, how do developers solve this kind of problem? How are implicit data structures and the functions that use them maintained? How do I become less afraid of what seems like a huge amount of uncertainty?
Coming from a compiled world, this scares me. I immediately think of
the scenario of a large code base where a function producing a
dictionary has the name of a key changed (or a key is removed
altogether) and boom- tons of potential KeyErrors begin to crop up as
pieces of the code base that work with that dictionary and expect a
key are no longer able to access the data they were expecting.
I would just like to highlight this part of your question, because I feel this is the main point you are trying to understand.
Python's development philosophy is a bit different; as objects can mutate without throwing errors (for example, you can add properties to instances without having them declared in the class) a common programming practice in Python is EAFP:
EAFP
Easier to ask for forgiveness than permission. This common Python
coding style assumes the existence of valid keys or attributes and
catches exceptions if the assumption proves false. This clean and fast
style is characterized by the presence of many try and except
statements. The technique contrasts with the LBYL style common to many
other languages such as C.
The LBYL referred to from the quote above is "Look Before You Leap":
LBYL
Look before you leap. This coding style explicitly tests for
pre-conditions before making calls or lookups. This style contrasts
with the EAFP approach and is characterized by the presence of many if
statements.
In a multi-threaded environment, the LBYL approach can risk
introducing a race condition between “the looking” and “the leaping”.
For example, the code, if key in mapping: return mapping[key] can fail
if another thread removes key from mapping after the test, but before
the lookup. This issue can be solved with locks or by using the EAFP
approach.
So I would say this is a bit of the norm and in Python you expect the objects will behave well and handle themselves with grace (mainly by throwing up lots of exceptions). Traditional "object hiding" and "interface contracts" are not what Python is all about. It is just like learning anything else, you have to acclimate to the programming environment and its rules.
The other part of your question:
Are dictionaries a best practice tool for passing data around? If so,
how do developers solve this kind of problem?
The answer here is depends on your problem domain. If your problem domain does not lend itself to custom objects, then you can pass around any kind of container (lists, tuples, dictionaries) around. If however all you have to pass around decorated data ("rich" data) is objects, then your code becomes littered with classes that don't define behavior but rather properties of things.
Oh, by the way - this getting of keys and raising KeyError problem is already solved, as Python dictionaries have a get method, which can return a default value (it returns the sentinel None object by default) when a key doesn't exist:
>>> d = {'a': 'b'}
>>> d['b']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'b'
>>> d.get('b') # None is returned, which is the only
# object that is not printed by the
# Python interactive interpreter.
>>> d.get('b','default')
'default'
When using Python for a large project using automated testing is a must because otherwise you would never dare to do any serious refactoring and the code base will rot in no time as all your changes will always try to touch nothing leading to bad solution (simply because you'd be too scared to implement the correct solution instead).
Indeed the above is true even with C++ or, as it often happens with large projects, with mixed-languages solutions.
Not longer than a few HOURS ago for example I had to make a branch for a four line bugfix (one of the lines is a single brace) for a specific customer because the trunk evolved too much from the version he has in production and the guy in charge of the release process told me his use cases have not yet been covered with manual testing in current version and therefore I cannot upgrade his installation.
The compiler can tell you something, but it cannot provide any confidence in stability after a refactoring if the software is complex. The idea that if some piece of code compiles then it's correct is bogus (possibly with the exception of hello_world.cpp).
That said you normally don't use dictionaries in Python for everything, unless you really care about the dynamic aspect (but in this case the code doesn't access the dictionary with a literal key). If your Python code has a lot of d["foo"] instead of d[k] when using dicts then I'd say there is a smell of a design problem.
I don't think passing dictionaries is the only way to pass structured data across parts of a system. I've seen lots of people use classes for that. Actually namedtuple is a good fit for that as well.
Without unit testing, the developer would have no reliable way of
knowing where these errors will appear
Now why would you not write unit tests?
In Python you don't rely on a compiler to catch your errors. If you really need static checking of your code, you can use one of several static analysis tools out there (see this question)
My karate instructor is fond of saying, "a block is a lock is a throw is a blow." What he means is this: When we come to a technique in a form, although it might seem to look like a block, a little creativity and examination shows that it can also be seen as some kind of joint lock, or some kind of throw, or some kind of blow.
So it is with the way the django template syntax uses the dot (".") character. It perceives it first as a dictionary lookup, but it will also treat it as a class attribute, a method, or list index - in that order. The assumption seems to be that, one way or another, we are looking for a piece of knowledge. Whatever means may be employed to store that knowledge, we'll treat it in such a way as to get it into the template.
Why doesn't python do the same? If there's a case where I might have assigned a dictionary term spam['eggs'], but know for sure that spam has an attribute eggs, why not let me just write spam.eggs and sort it out the way django templates do?
Otherwise, I have to except an AttributeError and add three additional lines of code.
I'm particularly interested in the philosophy that drives this setup. Is it regarded as part of strong typing?
django templates and python are two, unrelated languages. They also have different target audiences.
In django templates, the target audience is designers, who proabably don't want to learn 4 different ways of doing roughly the same thing ( a dictionary lookup ). Thus there is a single syntax in django templates that performs the lookup in several possible ways.
python has quite a different audience. developers actually make use of the many different ways of doing similar things, and overload each with distinct meaning. When one fails it should fail, because that is what the developer means for it to do.
JUST MY correct OPINION's opinion is indeed correct. I can't say why Guido did it this way but I can say why I'm glad that he did.
I can look at code and know right away if some expression is accessing the 'b' key in a dict-like object a, the 'b' attribute on the object a, a method being called on or the b index into the sequence a.
Python doesn't have to try all of the above options every time there is an attribute lookup. Imagine if every time one indexed into a list, Python had to try three other options first. List intensive programs would drag. Python is slow enough!
It means that when I'm writing code, I have to know what I'm doing. I can't just toss objects around and hope that I'll get the information somewhere somehow. I have to know that I want to lookup a key, access an attribute, index a list or call a method. I like it that way because it helps me think clearly about the code that I'm writing. I know what the identifiers are referencing and what attributes and methods I'm expecting the object of those references to support.
Of course Guido Van Rossum might have just flipped a coin for all I know (He probably didn't) so you would have to ask him yourself if you really want to know.
As for your comment about having to surround these things with try blocks, it probably means that you're not writing very robust code. Generally, you want your code to expect to get some piece of information from a dict-like object, list-like object or a regular object. You should know which way it's going to do it and let anything else raise an exception.
The exception to this is that it's OK to conflate attribute access and method calls using the property decorator and more general descriptors. This is only good if the method doesn't take arguments.
The different methods of accessing
attributes do different things. If
you have a function foo the two lines
of code
a = foo,
a = foo()
do two
very different things. Without
distinct syntax to reference and call
functions there would be no way for
python to know whether the variable
should be a reference to foo or the
result of running foo. The () syntax removes the ambiguity.
Lists and dictionaries are two very different data structures. One of the things that determine which one is appropriate in a given situation is how its contents can be accessed (key Vs index). Having separate syntax for both of them reinforces the notion that these two things are not the same and neither one is always appropriate.
It makes sense for these distinctions to be ignored in a template language, the person writing the html doesn't care, the template language doesn't have function pointers so it knows you don't want one. Programmers who write the python that drive the template however do care about these distinctions.
In addition to the points already posted, consider this. Python uses special member variables and functions to provide metadata about the object. Both the interpreter and programmers make heavy use of these. For example, both dicts and lists have a __len__ member function. Now, if a dict's data were accessed by using the . operator, a potential ambiguity arises if the dict has a key called __len__. You could special-case these, but many objects have a __dict__ attribute which is a mapping of member names and values. If that object happened to be a container, which also defined a __len__ attribute, you would end up with an utter mess.
Problems like this would end up turning Python into a mishmash of special cases that the programmer would have to constantly be aware of. This would detract from the reason why many people use Python in the first place, i.e., its elegant simplicity.
Now, consider that new users often shadow built-ins (if the code in SO questions is any indication) and having something like this starts to look like a really bad idea, since it would exacerbate the problem many-fold.
In addition to the responses above, it's not practical to merge dictionary lookup and object lookup in general because of the restrictions on object members.
What if your key has whitespace? What if it's an int, or a frozenset, etc.? Dot notation can't account for these discrepancies, so while it's an acceptable tradeoff for a templating language, it's unacceptable for a general-purpose programming language like Python.
I am sorry all - I am not here to blame Python. This is just a reflection on whether what I believe is right. Being a Python devotee for two years, I have been writing only small apps and singing Python's praises wherever I go. I recently had the chance to read Django's code, and have started wondering if Python really follows its "readability counts" philosophy. For example,
class A:
a = 10
b = "Madhu"
def somemethod(self, arg1):
self.c = 20.22
d = "some local variable"
# do something
....
...
def somemethod2 (self, arg2):
self.c = "Changed the variable"
# do something 2
...
It's difficult to track the flow of code in situations where the instance variables are created upon use (i.e. self.c in the above snippet). It's not possible to see which instance variables are defined when reading a substantial amount of code written in this manner. It becomes very frustrating even when reading a class with just 6-8 methods and not more than 100-150 lines of code.
I am interested in knowing if my reading of this code is skewed by C++/Java style, since most other languages follow the same approach as them. Is there a Pythonic way of reading this code more fluently? What made Python developers adopt this strategy keeping "readability counts" in mind?
The code fragment you present is fairly atypical (which might also because you probably made it up):
you wouldn't normally have an instance variable (self.c) that is a floating point number at some point, and a string at a different point. It should be either a number or a string all the time.
you normally don't bring instance variables into life in an arbitrary method. Instead, you typically have a constructor (__init__) that initializes all variables.
you typically don't have instance variables named a, b, c. Instead, they have some speaking names.
With these fixed, your example would be much more readable.
A sufficiently talented miscreant can write unreadable code in any language. Python attempts to impose some rules on structure and naming to nudge coders in the right direction, but there's no way to force such a thing.
For what it's worth, I try to limit the scope of local variables to the area where they're used in every language that i use - for me, not having to maintain a huge mental dictionary makes re-familiarizing myself with a bit of code much, much easier.
I agree that what you have seen can be confusing and ought to be accompanied by documentation. But confusing things can happen in any language.
In your own code, you should apply whatever conventions make things easiest for you to maintain the code. With respect to this particular issue, there are a number of possible things that can help.
Using something like Epydoc, you can specify all the instance variables a class will have. Be scrupulous about documenting your code, and be equally scrupulous about ensuring that your code and your documentation remain in sync.
Adopt coding conventions that encourage the kind of code you find easiest to maintain. There's nothing better than setting a good example.
Keep your classes and functions small and well-defined. If they get too big, break them up. It's easier to figure out what's going on that way.
If you really want to insist that instance variables be declared before referenced, there are some metaclass tricks you can use. e.g., You can create a common base class that, using metaclass logic, enforces the convention that only variables that are declared when the subclass is declared can later be set.
This problem is easily solved by specifying coding standards such as declaring all instance variables in the init method of your object. This isn't really a problem with python as much as the programmer.
If what the code is doing becomes mysterious for some reason .. there should either be comments or the function names should make it obvious.
This is just my opinion though.
I personally think not having to declare variables is one of the dangerous things in Python, especially when doing classes. It is all too easy to accidentally create a variable by simple mistyping and then boggle at the code at length, unable to find the mistake.
Adding a property just before you need it will prevent you from using it before it's got a value. Personally, I always find classes hard to follow just from reading source - I read the documentation and find out what it's supposed to do, and then it usually makes sense when I read the source again.
The fact that such stuff is allowed is only useful in rare times for prototyping; while Javascript tends to allow anything and maybe such an example could be considered normal (I don't really know), in Python this is mostly a negative byproduct of omission of type declaration, which can help speeding up development - if you at some point change your mind on the type of a variable, fixing type declarations can take more time than the fixes to actual code, in some cases, including the renaming of a type, but also cases where you use a different type with some similar methods and no superclass/subclass relationship.