Why is using global in python threading bad practice?

Why is using global in python threading bad practice? - python

I read all over various websites how using global is bad. I have an application where I am storing say, 300 objects, in an array. I want to have 8 threads running through these 300 objects. These objects are different sizes, say between 10 and 50,000 integers and randomly distributed (think worst case scenario here).
Basically, I want to start up 8 threads, do a process on an object, report or store the results, and pick up a new object, 300 times.
The solution I can think of is to set a global lock and a global counter, lock the array, get the current object, increment the counter, release the lock.
There is 1 lock for 8 threads. There is 1 counter for 8 threads. I have 2 global objects. I store results in a dictionary, possibly also global to make it visible to all threads but also threadsafe. I am not bothering to do something stupid like subclassing thread and passing along 300/8 objects to each object because multiprocessing.pool does that for me. So how would you do it? Also, convince me that using global in this situation is bad.

Classifying approaches as either "good" or "bad" is a bit simplistic -- in practice, if a design makes sense to you and accomplishes the goals you set out to accomplish, then it doesn't matter whether other people (except possibly your boss) think it's "good" or not; it either works or it doesn't. On the other hand, if your design causes you a lot of pain and suffering, that's a sign that you might not be using the most suitable design for the task at hand.
That said, there are some valid reasons why a lot of people think that global variables are problematic, particularly when combined with multithreading.
The general problem with global variables (with or without multithreading) is that as your program grows larger, it becomes increasingly difficult to mentally keep track of which parts of your program might be reading and/or updating the global variables' values at which times -- since they are global, by definition all parts of your program have access to them, so when you're trying to trace through your program to figure out who it was who set a global variable to some unexpected value, the list of suspects can become unmanageably large. (this isn't much of a problem for small programs, but the larger your program grows, the worse this problem becomes -- and a lot of programmers have learned, through painful experience, that it's better to nip the problem in the bud by avoiding globals wherever possible in the first place, then to have to go back and rewrite a big, complicated, buggy program later on)
In the specific use-case of a multithreaded program, the anybody-could-be-accessing-my-global-variable-at-any-time property becomes even more fraught with peril, since in a multithreaded scenario, any (non-immutable) data that is shared between threads can only be safely accessed with proper serialization (e.g. by locking a mutex before reading/writing the shared data, and unlocking it afterwards). Ideally programmers would never accidentally read or write any shared+mutable data without locking the mutex -- but programmers are human and will inevitably make mistakes; if given the ability to do so, sooner or later you (or someone else) will forget that access to a particular global variable needs to be serialized, and will just go ahead and read/write it, and then you're in for a lot of pain, because the symptoms will be rare and random, and the cause of the fault won't be obvious.
So smart programmers try to make it impossible to fall into that sort of trap, usually by limiting access to the shared-state to a specific, small, carefully-written set of functions (a.k.a. an API) that implement the serialization correctly so that no other code has to. When doing that, you want to make sure that only the code in this particular API has access to the shared data, and that nobody else does -- something that is impossible to do with a global variable, as by definition everyone has direct access to it.
There is also one performance-related reason why people prefer not to mix global variables and multithreading: the more serialization you have to do, the less your program can exploit the power of multiple CPU cores. In particular, it does you no good to have an 8-core CPU if 7 of your 8 threads are spending most of their time blocked, waiting for a mutex to become available.
So how does that relate to globals? It's related in that in most cases it's difficult or impossible to prove that a global variable won't ever be accessed by another thread, which means all accesses to that global variable need to be serialized. With a non-global variable, on the other hand, you can make sure to give a reference to that variable to only a single thread -- at which point you have effectively guaranteed that only that one thread will ever access the variable (since the other threads have no references to it, you know they can't access it), and because you have that guarantee, you no longer need to serialize access to that data, and now your thread can run more efficiently because it never has to block waiting for a mutex.
(Btw note that CPython in particular suffers from a severe form of implicit serialization caused by Python's Global Interpreter Lock, which means that even the best multithreaded, CPU-bound Python code will be unlikely to use more than a single CPU core at a time. The only way to get around that is to use multiprocessing instead, or do the bulk of your program's computations in a lower-level language such C, so that it can execute without holding the GIL)

Related

Unclear documentation: Sharing state between processes

there is a part in the Python documentation that is unclear to me:
https://docs.python.org/3.4/library/multiprocessing.html#sharing-state-between-processes
"As mentioned above, when doing concurrent programming it is usually best to avoid using shared state as far as possible."
But I cannot find any description above 17.2.1.5 that describes why it is best to avoid using shared state. Any ideas?

Shared state is like a global variable, but… more global.
Not only do you have to consider what parts of your code are reading and modifying the state, but also which running copy of your code is accessing it, and how. This gets even trickier when the state is mutable, i.e. can be changed.
To make sure one thread doesn't stomp on what another thread is doing you have to coordinate access to the state. That could be done using semaphores, message-passing, software transactional memory, etc.
See also https://softwareengineering.stackexchange.com/questions/148108/why-is-global-state-so-evil.

How to share objects and data between python processes in real-time?

I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?

I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.

First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.

For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.

One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657

Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.

Will the collections.deque "pop" methods release GIL?

I have a piece of code where I have a processing thread and a monitor thread. In the processing thread, I have a call to collections.deque.popleft function. I wanted to know if this function releases GIL because I want run my monitor thread even when the processing function is blocked on the popleft function

Instead of answering this specific question I'll answer a different question:
What is the Global Interpreter Lock (GIL), and when will it block my program?
In short, the GIL protects the interpreter's state from becoming corrupted by concurrent threads.
For a sense of what it is for, Consider the low level implementation of dict, which somewhere has an array of keys, organized for quick lookup. When you write some code like:
myDict['foo'] = 'bar'
the python interpreter needs to adjust its collection of keys. That might involve things like making more room for the additional key as well as adding the particular key to that array.
If multiple, concurrent threads are modifying that dict, then one thread might reallocate the array while another is in the middle of modifying it, which could cause some unpredictable, probably bad behavior (anything from corrupted data, segfault or heartbleed like memory content leak of sensitive data or arbitrary code execution)
Since that's not the sort of state you can reasonably describe or prevent at the level of your python application, the run-time goes to great lengths to prevent those sorts of problems from occuring. The way it does it is that certain parts of the interpreter, such as the modification of a dict, is surrounded by a PyGILState_Ensure()/PyGILState_Release() pair, so that critical operations always reach a consistent state.
Note however that the scope of this lock is very narrow; it doesn't attempt to protect from general data races, it won't protect you from writing a program with multiple threads overwriting each other's work in a common container (say, a collections.deque), only that even if you do write such a program, it wont' cause the interpreter to crash, you'll always have a valid, working deque. You can add additional application locks, as in queue.Queue to give good concurrent semantics to your application.
Since every operation that the GIL protects is a change in the interpreter state, it never blocks on external events; since those events won't cause the interpreter state to be changed, a signaling condition variable cannot corrupt memory.
The only time you might have a problem is when you have several unblocked threads, since they are potentially all executing code in the low level interpreter, they'll compete for the GIL, and only one thread can hold it, blocking other threads that also want to do some computation.
Unless you are writing C extensions, you probably don't need to worry about it, and unless you have multiple, compute bound threads, in python, you won't be affected by it, either.

Yes -- deque is thread-safe (thanks #hemanths) http://docs.python.org/2/library/collections.html#collections.deque
No, because collections.deque is not thread-safe. Use a Queue, or make your own deque subclass.

Is CCKeyDerivationPBKDF thread safe?

I'm using CCKeyDerivationPBKDF to generate and verify password hashes in a concurrent environment and I'd like to know whether it it thread safe. The documentation of the function doesn't mention thread safety at all, so I'm currently using a lock to be on the safe side but I'd prefer not to use a lock if I don't have to.

After going through the source code of the CCKeyDerivationPBKDF() I find it to be "thread unsafe". While the code for CCKeyDerivationPBKDF() uses many library functions which are thread-safe(eg: bzero), most user-defined function(eg:PRF) and the underlying functions being called from those user-defined functions, are potentially thread-unsafe. (For eg. due to use of several pointers and unsafe casting of memory eg. in CCHMac). I would suggest unless they make all the underlying functions thread-safe or have some mechanism to alteast make it conditionally thread-safe, stick with your approach, or modify the commoncrypto code to make it thread-safe and use that code.
Hope it helps.

Lacking documentation or source code, one option is to build a test app with say 10 threads looping on calls to CCKeyDerivationPBKDF with a random selection from say 10 different sets of arguments with 10 known results.
Each thread checks the result of a call to make sure it is what is expected. Each thread should also have a usleep() call for some random amount of time (bell curve sitting on say 10% of the time each call to CCKeyDerivationPBKDF takes) in this loop in order to attempt to interleave operations as much as possible.
You'll probably want to instrument it with debugging that keeps track of how much concurrency you are able to generate. With a 10% sleep time and 10 threads, you should be able to keep 9 threads concurrent.
If it makes it through an aggregate of say 100,000,000 calls without an error, I'd assume it was thread safe. Of course you could run it for much longer than that to get greater assurances.

How do I count bytecodes in Python so I can modify sys.setcheckinterval appropriately

I have a port scanning application that uses work queues and threads.
It uses simple TCP connections and spends a lot of time waiting for packets to come back (up to half a second). Thus the threads don't need to fully execute (i.e. first half sends a packet, context switch, does stuff, comes back to thread which has network data waiting for it).
I suspect I can improve performance by modifying the sys.setcheckinterval from the default of 100 (which lets up to 100 bytecodes execute before switching to another thread).
But without knowing how many bytecodes are actually executing in a thread or function I'm flying blind and simply guessing values, testing and relying on the testing shows a measurable difference (which is difficult since the amount of code being executed is minimal; a simple socket connection, thus network jitter will likely affect any measurements more than changing sys.setcheckinterval).
Thus I would like to find out how many bytecodes are in certain code executions (i.e. total for a function or in execution of a thread) so I can make more intelligent guesses at what to set sys.setcheckinterval to.

For higher level (method, class) wise, dis module should help.
But if one needs finer grain, tracing will be unavoidable. Tracing does operate line by line basis but explained here is a great hack to dive deeper at the bytecode level. Hats off to Ned Batchelder.

Reasoning about a system of this complexity will rarely produce the right answer. Measure the results, and use the setting that runs the fastest. If as you say, testing can't measure the difference in various settings of setcheckinterval, then why bother changing it? Only measurable differences are interesting. If your test run is too short to provide meaningful data, then make the run longer until it does.

" I suspect I can improve performance by modifying the sys.setcheckinterval"
This rarely works. Correct behavior can't depend on timing -- you can't control timing. Slight changes on OS, hardware, patch level of Python or phase of the moon will change how your application behaves.
The select module is what you use to wait for I/O's. Your application can be structured as a main loop that does the select and queues up work for other threads. The other threads are waiting for an entries in their queue of requests to process.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.