Blocking I/O in Python

Blocking I/O in Python - python

Newbie on Python and multiple threading.
I read some articles on what is blocking and non-blocking I/O, and the main difference seems to be the case that blocking I/O only allows tasks to be executed sequentially, while non-blocking I/O allows multiple tasks to be executed concurrently.
If that's the case, how blocking I/O operations (some Python standard built-in functions) can do multiple threading?

Blocking I/O blocks the thread it's running in, not the whole process. (at least in this context, and on a standard PC)
Multithreading is not affected by definition - only current thread gets blocked.

The global interpreter lock (in cpython) is a measure put in place so that only one active python thread executes at the same time. As frustrating as it can be, this is a good thing because it is put in place to avoid interpreter corruption.
When a blocking operation is encountered, the current thread yields the lock and thus allows other threads to execute while the first thread is blocked. However, when CPU bound threads (when purely python calls are made), only one thread executes no matter how many threads are running.
It is interesting to note that in python 3.2, code was added to mitigate the effects of the global interpreter lock. It is also interesting to note that other implementations of python do not have a global interpreter lock
Please not this is a limitation of the python code and that the underlying libraries may be still processing data.
Also, in many cases, when it comes to I/O, to avoid blocking, a useful way to handle IO is using polling and eventing:
Polling involves checking whether the operation would block and test whether there is data. For example, if you are trying to get
data from a socket, you would use select() and poll()
Eventing involves using callbacks in such a way that your thread is triggered when a relevant IO operation just occurred

Related

How to manipulate the GIL when many threads need to execute?

My understanding is that the typical GIL manipulations involve, e.g., blocking I/O operations. Hence one would want to release the lock before the I/O operation and reacquire it once it has completed.
I'm currently facing a different scenario with a C extension: I am creating X windows that are exposed to Python via the Canvas class. When the method show() is called on an instance, a new UI thread is started using PyThreads (with a call to PyThread_start_new_thread). This new thread is responsible for drawing on the X window, using the Python code specified in the on_draw method of a subclass of Canvas. A pure C event loop is started in the main thread that simply checks for events on the X window and, for the time being, only captures the WM_DELETE_EVENT.
So I have potentially many threads (one for each X window) that want to execute Python code and the main thread that does not execute any Python code at all.
How do I release/acquire the GIL in order to allow the UI threads to get into the interpreter orderly?

The rule is easy: you need to hold the GIL to access Python machinery (any API starting with Py<...> and any PyObject).
So, you can release it whenever you don't need any of that.
Anything further than this is the fundamental problem of locking granularity: potential benefits vs locking overhead. There was an experiment for Py 1.4 to replace the GIL with more granular locks that failed exactly because the overhead proved prohibitive.
That's why it's typically released for code chunks involving call(s) to extental facilities that can take arbitrary time (especially if they involve waiting for external events) -- if you don't release the lock, Python will be just idling during this time.
Heeding this rule, you will get to your goal automatically: whenever a thread can't proceed further (whether it's I/O, signal from another thread, or even so much as a time.sleep() to avoid a busy loop), it will release the lock and allow other threads to proceed in its stead. The GIL assigning mechanism strives to be fair (see issue8299 for exploration on how fair it is), releasing the programmer from bothering about any bias stemming solely from the engine.

I think the problem stems from the fact that, in my opinion, the official documentation is a bit ambiguous on the meaning of Non-Python created threads. Quoting from it:
When threads are created using the dedicated Python APIs (such as the threading module), a thread state is automatically associated to them and the code showed above is therefore correct. However, when threads are created from C (for example by a third-party library with its own thread management), they don’t hold the GIL, nor is there a thread state structure for them.
I have highlighted in bold the parts that I find off-putting. As I have stated in the OP, I am calling PyThread_start_new_thread. Whilst this creates a new thread from C, this function is not part of a third-party library, but of the dedicated Python (C) APIs. Based on this assumption, I ruled out that I actually needed to use the PyGILState_Ensure/PyGILState_Release paradigm.
As far as I can tell from what I've seen with my experiments, a thread created from C with (just) PyThread_start_new_thread should be considered as a non-Python created thread.

CPU utilization while waiting for I/O to be ready in asynchronous programs

In an asynchronous program (e.g., asyncio, twisted etc.), all system calls must be non-blocking. That means a non-blocking select (or something equivalent) needs be executed in every iteration of the main loop. That seems more wasteful than the multi-threaded approach where each thread can use a blocking call and sleep (without wasting CPU resource) until the socket is ready.
Does this sometimes cause asynchronous programs to be slower than their multi-threaded alternatives (despite thread switching costs), or is there some mechanism that makes this not a valid concern?

When working with select in a single thread program, you do not have to continuously check the results. The right way to work with it is to let it block until the relevant I/O has arrived, just like in the case of multi threads.
However, instead of waiting for a single socket (or other I/O), the select call gets a list of relevant sockets, and blocks until any of them is interrupted.
Once that happens, select wakes-up and returns a list of the sockets (or I/Os) that are ready. It is up to the coder to handle those ready sockets in the required way, and then, if the code has nothing else to do, it might start another iteration of the select.
As you can see, no polling loop is required; the program does not require CPU resources until one or more of the required sockets are ready. Moreover, if a few sockets were ready almost together, then the code wakes-up once, handle all of them, and only then start select again. Add to that the fact that the program does not requires the resources overhead of a few threads, and you can see why this is more effective in terms of OS resources.

In my question I separated the I/O handling into two categories: polling represented by non-blocking select, and "callback" represented by the blocking select. (The blocking select sleeps the thread, so it's not strictly speaking a callback; but conceptually it is similar to a callback, since it doesn't use CPU cycles until the I/O is ready. Since I don't know the precise term, I'll just use "callback").
I assumed that asynchronous model cannot use "callback" I/O. It now seems to me that this assumption was incorrect. While an asynchronous program should not be using non-blocking select, and it cannot strictly request a traditional callback from the OS either, it can certainly provide OS with its main event loop and say a coroutine, and ask the OS to create a task in that event loop using that coroutine when an I/O socket is ready. This would not use any of the program's CPU cycles until the I/O is ready. (It might use OS kernel's CPU cycles if it uses polling rather than interrupts for I/O, but that would be the case even with a multi-threaded program.)
Of course, this requires that the OS supports the asynchronous framework used by the program. It probably doesn't. But even then, it seems quite straightforward to add an middle layer that uses a single separate thread and blocking select to talk to the OS, and whenever I/O is ready, creates a task to the program's main event loop. If this layer is included in the interpreter, the program would look perfectly asynchronous. If this layer is added as a library, the program would be largely asynchronous, apart from a simple additional thread that converts synchronous I/O to asynchronous I/O.
I have no idea whether any of this is done in python, but it seems plausible conceptually.

When should I be using asyncio over regular threads, and why? Does it provide performance increases?

I have a pretty basic understanding of multithreading in Python and an even basic-er understanding of asyncio.
I'm currently writing a small Curses-based program (eventually going to be using a full GUI, but that's another story) that handles the UI and user IO in the main thread, and then has two other daemon threads (each with their own queue/worker-method-that-gets-things-from-a-queue):
a watcher thread that watches for time-based and conditional (e.g. posts to a message board, received messages, etc.) events to occur and then puts required tasks into...
the other (worker) daemon thread's queue which then completes them.
All three threads are continuously running concurrently, which leads me to some questions:
When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Thanks!

When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
You should just use a blocking call to queue.get(). That will leave the thread blocked on I/O, which means the GIL will be released, and no processing power (or at least a very minimal amount) will be used. Don't use non-blocking gets in a while loop, since that's going to require a lot more CPU wakeups.
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
If all the watcher is doing is pulling things off a queue and immediately putting it into another queue, where it gets consumed by a single worker, it sounds like its unnecessary overhead - you may as well just consume it directly in the worker. It's not exactly clear to me if that's the case, though - is the watcher consuming from a queue, or just putting items into one? If it is consuming from a queue, who is putting stuff into it?
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Yes, this is affected by the GIL. Only one of your threads can run Python bytecode at a time, so won't get true parallelism, except when threads are running I/O (which releases the GIL). If your worker thread is doing CPU-bound activities, you should seriously consider running it in a separate process via multiprocessing, if possible.
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
It's hard to say, because I don't know exactly what "running continuously" means. What is it doing continuously? If it spends most of its time sleeping or blocking on a queue, it's fine - both of those things release the GIL. If it's constantly doing actual work, that will require the GIL, and therefore degrade the performance of the other threads in your app (assuming they're trying to do work at the same time). asyncio is designed for programs that are I/O-bound, and can therefore be run in a single thread, using asynchronous I/O. It sounds like your program may be a good fit for that depending on what your worker is doing.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Any program where you're mostly waiting for I/O is potentially a good for for asyncio - but only if you can find a library that makes curses (or whatever other GUI library you eventually choose) play nicely with it. Most GUI frameworks come with their own event loop, which will conflict with asyncio's. You would need to use a library that can make the GUI's event loop play nicely with asyncio's event loop. You'd also need to make sure that you can find asyncio-compatible versions of any other synchronous-I/O based library your application uses (e.g. a database driver).
That said, you're not likely to see any kind of performance improvement by switching from your thread-based program to something asyncio-based. It'll likely perform about the same. Since you're only dealing with 3 threads, the overhead of context switching between them isn't very significant, so switching from that a single-threaded, asynchronous I/O approach isn't going to make a very big difference. asyncio will help you avoid thread synchronization complexity (if that's an issue with your app - it's not clear that it is), and at least theoretically, would scale better if your app potentially needed lots of threads, but it doesn't seem like that's the case. I think for you, it's basically down to which style you prefer to code in (assuming you can find all the asyncio-compatible libraries you need).

python threads & sockets

I have a "I just want to understand it" question..
first, I'm using python 2.6.5 on Ubuntu.
So.. threads in python (via thread module) are only "threads", and is just tells the GIL to run code blocks from each "thread" in a certain period of time and so and so.. and there aren't actually real threads here..
So the question is - if I have a blocking socket in one thread, and now I'm sending data and block the thread for like 5 seconds. I expected to block all the program because it is one C command (sock.send) that is blocking the thread. But I was surprised to see that the main thread continue to run.
So the question is - how can GIL is able to continue and run the rest of the code after it reaches a blocking command like send? Isn't it have to use real thread in here?
Thanks.

Python uses "real" threads, i.e. threads of the underlying platform. On Linux, it will use the pthread library (if you are interested, here is the implementation).
What is special about Python's threads is the GIL: A thread can only modify Python data structures if it holds this global lock. Thus, many Python operations cannot make use of multiple processor cores. A thread with a blocking socket won't hold the GIL though, so it does not affect other threads.
The GIL is often misunderstood, making people believe threads are almost useless in Python. The only thing the GIL prevents is concurrent execution of "pure" Python code on multiple processor cores. If you use threads to make a GUI responsive or to run other code during blocking I/O, the GIL won't affect you. If you use threads to run code in some C extension like NumPy/SciPy concurrently on multiple processor cores, the GIL won't affect you either.

Python wiki page on GIL mentions that
Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL.

GIL (the Global Interpreter Lock) is just a lock, it does not run anything by itself. Rather, the Python interpreter captures and releases that lock as necessary. As a rule, the lock is held when running Python code, but released for calls to lower-level functions (such as sock.send). As Python threads are real OS-level threads, threads will not run Python code in parallel, but if one thread invokes a long-running C function, the GIL is released and another Python code thread can run until the first one finishes.

Python threads all executing on a single core

I have a Python program that spawns many threads, runs 4 at a time, and each performs an expensive operation. Pseudocode:
for object in list:
t = Thread(target=process, args=(object))
# if fewer than 4 threads are currently running, t.start(). Otherwise, add t to queue
But when the program is run, Activity Monitor in OS X shows that 1 of the 4 logical cores is at 100% and the others are at nearly 0. Obviously I can't force the OS to do anything but I've never had to pay attention to performance in multi-threaded code like this before so I was wondering if I'm just missing or misunderstanding something.
Thanks.

Note that in many cases (and virtually all cases where your "expensive operation" is a calculation implemented in Python), multiple threads will not actually run concurrently due to Python's Global Interpreter Lock (GIL).
The GIL is an interpreter-level lock.
This lock prevents execution of
multiple threads at once in the Python
interpreter. Each thread that wants to
run must wait for the GIL to be
released by the other thread, which
means your multi-threaded Python
application is essentially single
threaded, right? Yes. Not exactly.
Sort of.
CPython uses what’s called “operating
system” threads under the covers,
which is to say each time a request to
make a new thread is made, the
interpreter actually calls into the
operating system’s libraries and
kernel to generate a new thread. This
is the same as Java, for example. So
in memory you really do have multiple
threads and normally the operating
system controls which thread is
scheduled to run. On a multiple
processor machine, this means you
could have many threads spread across
multiple processors, all happily
chugging away doing work.
However, while CPython does use
operating system threads (in theory
allowing multiple threads to execute
within the interpreter
simultaneously), the interpreter also
forces the GIL to be acquired by a
thread before it can access the
interpreter and stack and can modify
Python objects in memory all
willy-nilly. The latter point is why
the GIL exists: The GIL prevents
simultaneous access to Python objects
by multiple threads. But this does not
save you (as illustrated by the Bank
example) from being a lock-sensitive
creature; you don’t get a free ride.
The GIL is there to protect the
interpreters memory, not your sanity.
See the Global Interpreter Lock section of Jesse Noller's post for more details.
To get around this problem, check out Python's multiprocessing module.
multiple processes (with judicious use
of IPC) are[...] a much better
approach to writing apps for multi-CPU
boxes than threads.
-- Guido van Rossum (creator of Python)
Edit based on a comment from #spinkus:
If Python can't run multiple threads simultaneously, then why have threading at all?
Threads can still be very useful in Python when doing simultaneous operations that do not need to modify the interpreter's state. This includes many (most?) long-running function calls that are not in-Python calculations, such as I/O (file access or network requests)) and [calculations on Numpy arrays][6]. These operations release the GIL while waiting for a result, allowing the program to continue executing. Then, once the result is received, the thread must re-acquire the GIL in order to use that result in "Python-land"

Python has a Global Interpreter Lock, which can prevent threads of interpreted code from being processed concurrently.
http://en.wikipedia.org/wiki/Global_Interpreter_Lock
http://wiki.python.org/moin/GlobalInterpreterLock
For ways to get around this, try the multiprocessing module, as advised here:
Does running separate python processes avoid the GIL?

AFAIK, in CPython the Global Interpreter Lock means that there can't be more than one block of Python code being run at any one time. Although this does not really affect anything in a single processor/single-core machine, on a mulitcore machine it means you have effectively only one thread running at any one time - causing all the other core to be idle.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.