Are there any operations in Python which are guaranteed never to fail? - python

In C++, in order for code to be robust in the presence of exceptions, it is often necessary to rely on the fact that a few simple operations are guaranteed never to fail (and hence never to throw an exception). Examples of these operations include assignment of integers and swapping of standard containers.
Are there any operations in Python which provide this no-fail guarantee?

Python is a higher-level language than C and C++. Anything can involve code execution behind the scenes, and no name is exempt from looking up its current, possibly overridden value. It might be possible to identify some operations that are guaranteed never to raise an exception, but I suspect that set of operations would be so small that it provides no benefit over the usual assumption that anything can raise an exception at any time.
And the identification of those operations would require limiting your Python environment. For example, you can assign a trace function which is invoked for every line of your Python program. With a suitably crafted trace function, even 1+1 could raise an exception. So do we assume that there is no trace function? What about redefining builtins?
Practically speaking, you need to adopt a different mindset for Python: exceptions happen, and you can't know ahead of time what they might be. As Mark Amery says in the comments, C++ needs to avoid memory leaks and uninitialized variables, which are not issues in Python.

Related

Why is using global in python threading bad practice?

I read all over various websites how using global is bad. I have an application where I am storing say, 300 objects, in an array. I want to have 8 threads running through these 300 objects. These objects are different sizes, say between 10 and 50,000 integers and randomly distributed (think worst case scenario here).
Basically, I want to start up 8 threads, do a process on an object, report or store the results, and pick up a new object, 300 times.
The solution I can think of is to set a global lock and a global counter, lock the array, get the current object, increment the counter, release the lock.
There is 1 lock for 8 threads. There is 1 counter for 8 threads. I have 2 global objects. I store results in a dictionary, possibly also global to make it visible to all threads but also threadsafe. I am not bothering to do something stupid like subclassing thread and passing along 300/8 objects to each object because multiprocessing.pool does that for me. So how would you do it? Also, convince me that using global in this situation is bad.
Classifying approaches as either "good" or "bad" is a bit simplistic -- in practice, if a design makes sense to you and accomplishes the goals you set out to accomplish, then it doesn't matter whether other people (except possibly your boss) think it's "good" or not; it either works or it doesn't. On the other hand, if your design causes you a lot of pain and suffering, that's a sign that you might not be using the most suitable design for the task at hand.
That said, there are some valid reasons why a lot of people think that global variables are problematic, particularly when combined with multithreading.
The general problem with global variables (with or without multithreading) is that as your program grows larger, it becomes increasingly difficult to mentally keep track of which parts of your program might be reading and/or updating the global variables' values at which times -- since they are global, by definition all parts of your program have access to them, so when you're trying to trace through your program to figure out who it was who set a global variable to some unexpected value, the list of suspects can become unmanageably large. (this isn't much of a problem for small programs, but the larger your program grows, the worse this problem becomes -- and a lot of programmers have learned, through painful experience, that it's better to nip the problem in the bud by avoiding globals wherever possible in the first place, then to have to go back and rewrite a big, complicated, buggy program later on)
In the specific use-case of a multithreaded program, the anybody-could-be-accessing-my-global-variable-at-any-time property becomes even more fraught with peril, since in a multithreaded scenario, any (non-immutable) data that is shared between threads can only be safely accessed with proper serialization (e.g. by locking a mutex before reading/writing the shared data, and unlocking it afterwards). Ideally programmers would never accidentally read or write any shared+mutable data without locking the mutex -- but programmers are human and will inevitably make mistakes; if given the ability to do so, sooner or later you (or someone else) will forget that access to a particular global variable needs to be serialized, and will just go ahead and read/write it, and then you're in for a lot of pain, because the symptoms will be rare and random, and the cause of the fault won't be obvious.
So smart programmers try to make it impossible to fall into that sort of trap, usually by limiting access to the shared-state to a specific, small, carefully-written set of functions (a.k.a. an API) that implement the serialization correctly so that no other code has to. When doing that, you want to make sure that only the code in this particular API has access to the shared data, and that nobody else does -- something that is impossible to do with a global variable, as by definition everyone has direct access to it.
There is also one performance-related reason why people prefer not to mix global variables and multithreading: the more serialization you have to do, the less your program can exploit the power of multiple CPU cores. In particular, it does you no good to have an 8-core CPU if 7 of your 8 threads are spending most of their time blocked, waiting for a mutex to become available.
So how does that relate to globals? It's related in that in most cases it's difficult or impossible to prove that a global variable won't ever be accessed by another thread, which means all accesses to that global variable need to be serialized. With a non-global variable, on the other hand, you can make sure to give a reference to that variable to only a single thread -- at which point you have effectively guaranteed that only that one thread will ever access the variable (since the other threads have no references to it, you know they can't access it), and because you have that guarantee, you no longer need to serialize access to that data, and now your thread can run more efficiently because it never has to block waiting for a mutex.
(Btw note that CPython in particular suffers from a severe form of implicit serialization caused by Python's Global Interpreter Lock, which means that even the best multithreaded, CPU-bound Python code will be unlikely to use more than a single CPU core at a time. The only way to get around that is to use multiprocessing instead, or do the bulk of your program's computations in a lower-level language such C, so that it can execute without holding the GIL)

RAII in Python: What's the point of __del__?

At first glance, it seems like Python's __del__ special method offers much the same advantages a destructor has in C++. But according to the Python documentation (https://docs.python.org/3.4/reference/datamodel.html), there is no guarantee that your object's __del__ method ever gets called at all!
It is not guaranteed that __del__() methods are called for objects that still exist when the interpreter exits.
So in other words, the method is useless! Isn't it? A hook function that may or may not get called really doesn't do much good, so __del__ offers nothing with regard to RAII. If I have some essential cleanup, I don't need it to run some of the time, oh, when ever the GC feels like it really, I need it to run reliably, deterministically and 100% of the time.
I know that Python provides context managers, which are far more useful for that task, but why was __del__ kept around at all? What's the point?
__del__ is a finalizer. It is not a destructor. Finalizers and destructors are entirely different animals.
Destructors are called reliably, and only exist in languages with deterministic memory management (such as C++). Python's context managers (the with statement) can achieve similar effects in certain circumstances. These are reliable because the lifespan of an object is precisely fixed; in C++, objects die when they are explicitly deleted or when some scope is exited (or when a smart pointer deletes them in response to its own destruction). And that's when destructors run.
Finalizers are not called reliably. The only valid use of a finalizer is as an emergency safety net (NB: this article is written from a .NET perspective, but the concepts translate reasonably well). For instance, the file objects returned by open() automatically close themselves when finalized. But you're still supposed to close them yourself (e.g. using the with statement). This is because the objects are destroyed dynamically by the garbage collector, which may or may not run right away, and with generational garbage collection, it may or may not collect some objects in any given pass. Since nobody knows what kinds of optimizations we might invent in the future, it's safest to assume that you just can't know when the garbage collector will get around to collecting your objects. That means you cannot rely on finalizers.
In the specific case of CPython, you get slightly stronger guarantees, thanks to the use of reference counting (which is far simpler and more predictable than garbage collection). If you can ensure that you never create a reference cycle involving a given object, that object's finalizer will be called at a predictable point (when the last reference dies). This is only true of CPython, the reference implementation, and not of PyPy, IronPython, Jython, or any other implementations.
Because __del__ does get called. It's just that it's unclear when it will, because in CPython if you have circular references, the refcount mechanism can't take care of the object reclamation (and thus its finalization via __del__) and must delegate it to the garbage collector.
The garbage collector then has a problem: he cannot know in which order to break the circular references, because this may trigger additional problems (e.g. frees the memory that is going to be needed in the finalization of another object that is part of the collected loop, triggering a segfault).
The point you stress is because the interpreter may exit for reasons that prevents it to perform the cleanup (e.g. it segfaults, or some C module impolitely calls exit() ).
There's PEP 442 for safe object finalization that has been finalized in 3.4. I suggest you take a look at it.
https://www.python.org/dev/peps/pep-0442/

Why do i need the gil for PyMem_Malloc()?

As per this discussion, PyMem_Malloc() requires the GIL; however, if the function is nothing more than an alias for malloc(), who cares?
Because it is sometimes more than simply an alias for malloc(). Sometimes it is an alias for _PyMem_DebugMalloc() and there is some global accounting there to keep track of unique memory objects. There's no real point in releasing the GIL just for a PyMem_Malloc() call, so you're probably doing something more complicated in C. If that's the case, you can simply call malloc() and not get any of the debugging stuff.

Will the collections.deque "pop" methods release GIL?

I have a piece of code where I have a processing thread and a monitor thread. In the processing thread, I have a call to collections.deque.popleft function. I wanted to know if this function releases GIL because I want run my monitor thread even when the processing function is blocked on the popleft function
Instead of answering this specific question I'll answer a different question:
What is the Global Interpreter Lock (GIL), and when will it block my program?
In short, the GIL protects the interpreter's state from becoming corrupted by concurrent threads.
For a sense of what it is for, Consider the low level implementation of dict, which somewhere has an array of keys, organized for quick lookup. When you write some code like:
myDict['foo'] = 'bar'
the python interpreter needs to adjust its collection of keys. That might involve things like making more room for the additional key as well as adding the particular key to that array.
If multiple, concurrent threads are modifying that dict, then one thread might reallocate the array while another is in the middle of modifying it, which could cause some unpredictable, probably bad behavior (anything from corrupted data, segfault or heartbleed like memory content leak of sensitive data or arbitrary code execution)
Since that's not the sort of state you can reasonably describe or prevent at the level of your python application, the run-time goes to great lengths to prevent those sorts of problems from occuring. The way it does it is that certain parts of the interpreter, such as the modification of a dict, is surrounded by a PyGILState_Ensure()/PyGILState_Release() pair, so that critical operations always reach a consistent state.
Note however that the scope of this lock is very narrow; it doesn't attempt to protect from general data races, it won't protect you from writing a program with multiple threads overwriting each other's work in a common container (say, a collections.deque), only that even if you do write such a program, it wont' cause the interpreter to crash, you'll always have a valid, working deque. You can add additional application locks, as in queue.Queue to give good concurrent semantics to your application.
Since every operation that the GIL protects is a change in the interpreter state, it never blocks on external events; since those events won't cause the interpreter state to be changed, a signaling condition variable cannot corrupt memory.
The only time you might have a problem is when you have several unblocked threads, since they are potentially all executing code in the low level interpreter, they'll compete for the GIL, and only one thread can hold it, blocking other threads that also want to do some computation.
Unless you are writing C extensions, you probably don't need to worry about it, and unless you have multiple, compute bound threads, in python, you won't be affected by it, either.
Yes -- deque is thread-safe (thanks #hemanths) http://docs.python.org/2/library/collections.html#collections.deque
No, because collections.deque is not thread-safe. Use a Queue, or make your own deque subclass.

Python Daemons - Program Structure and Exception Control

I've been doing amateur coding in Python for a while now and feel quite comfortable with it. Recently though I've been writing my first Daemon and am trying to come to terms with how my programs should flow.
With my past programs, exceptions could be handled by simply aborting the program, perhaps after some minor cleaning up. The only consideration I had to give to program structure was the effective handling of non-exception input. In effect, "Garbage In, Nothing Out".
In my Daemon, there is an outside loop that effectively never ends and a sleep statement within it to control the interval at which things happen. Processing of valid input data is easy but I'm struggling to understand the best practice for dealing with exceptions. Sometimes the exception may occur within several levels of nested functions and each needs to return something to its parent, which must, in turn, return something to its parent until control returns to the outer-most loop. Each function must be capable of handling any exception condition, not only for itself but also for all its subordinates.
I apologise for the vagueness of my question but I'm wondering if anyone could offer me some general pointers into how these exceptions should be handled. Should I be looking at spawning sub-processes that can be terminated without impact to the parent? A (remote) possibility is that I'm doing things correctly and actually do need all that nested handling. Another very real possibility is that I haven't got a clue what I'm talking about. :)
Steve
Exceptions are designed for the purpose of (potentially) not being caught immediately-- that's how they differ from when a function returns a value that means "error". Each exception can be caught at the level where you want to (and can) do something about it.
At a minimum, you could start by catching all exceptions at the main loop and logging a message. This is simple and ensures that your daemon won't die. At the main loop it's probably too late to fix most problems, so you can catch specific exceptions sooner. E.g. if a file has the wrong format, catch the exception in the routine that opens and tries to use the file, not deep in the parsing code where the problem is discovered; perhaps you can try another format. Basically if there's a place where you could recover from a particular error condition, catch it there and do so.
The answer will be "it depends".
If an exception occurs in some low-level function, it may be appropriate to catch it there if there is enough information available at this level to let the function complete successfully in spite of the exception. E.g. when reading triangles from an .stl file, the normal vector of the triangle it both explicitly given and implicitly given by the sequence of the three points that make up the triangle. So if the normal vector is given as (0,0,0), which is a 0-length vector and should trigger an exception in the constructor of a Normal vector class, that can be safely caught in the constructor of a Triangle class, because it can still be calculated by other means.
If there is not enough information available to handle an exception, it should trickle upwards to a level where it can be handled. E.g. if you are writing a module to read and interpret a file format, it should raise an exception if the file it was given doesn't match the file format. In this case it is probably the top level of the program using that module that should handle the exception and communicate with the user. (Or in case of a daemon, log the error and carry on.)

Categories