Twisted - should this code be run in separate threads

Twisted - should this code be run in separate threads - python

I am running some code that has X workers, each worker pulling tasks from a queue every second. For this I use twisted's task.LoopingCall() function. Each worker fulfills its request (scrape some data) and then pushes the response back to another queue. All this is done in the reactor thread since I am not deferring this to any other thread.
I am wondering whether I should run all these jobs in separate threads or leave them as they are. And if so, is there a problem if I call task.LoopingCall every second from each thread ?

No, you shouldn't use threads. You can't call LoopingCall from a thread (unless you use reactor.callFromThread), but it wouldn't help you make your code faster.
If you notice a performance problem, you may want to profile your workload, figure out where the CPU-intensive work is, and then put that work into multiple processes, spawned with spawnProcess. You really can't skip the step where you figure out where the expensive work is, though: there's no magic pixie dust you can sprinkle on your Twisted application that will make it faster. If you choose a part of your code which isn't very intensive and doesn't require blocking resources like CPU or disk, then you will discover that the overhead of moving work to a different process may outweigh any benefit of having it there.

You shouldn't use threads for that. Doing it all in the reactor thread is ok. If your scraping uses twisted.web.client to do the network access, it shouldn't block, so you will go as fast as it gets.

First, beware that Twisted's reactor sometimes multithreads and assigns tasks without telling you anything. Of course, I haven't seen your program in particular.
Second, in Python (that is, in CPython) spawning threads to do non-blocking computation has little benefit. Read up on the GIL (Global Interpreter Lock).

Related

CPU utilization while waiting for I/O to be ready in asynchronous programs

In an asynchronous program (e.g., asyncio, twisted etc.), all system calls must be non-blocking. That means a non-blocking select (or something equivalent) needs be executed in every iteration of the main loop. That seems more wasteful than the multi-threaded approach where each thread can use a blocking call and sleep (without wasting CPU resource) until the socket is ready.
Does this sometimes cause asynchronous programs to be slower than their multi-threaded alternatives (despite thread switching costs), or is there some mechanism that makes this not a valid concern?

When working with select in a single thread program, you do not have to continuously check the results. The right way to work with it is to let it block until the relevant I/O has arrived, just like in the case of multi threads.
However, instead of waiting for a single socket (or other I/O), the select call gets a list of relevant sockets, and blocks until any of them is interrupted.
Once that happens, select wakes-up and returns a list of the sockets (or I/Os) that are ready. It is up to the coder to handle those ready sockets in the required way, and then, if the code has nothing else to do, it might start another iteration of the select.
As you can see, no polling loop is required; the program does not require CPU resources until one or more of the required sockets are ready. Moreover, if a few sockets were ready almost together, then the code wakes-up once, handle all of them, and only then start select again. Add to that the fact that the program does not requires the resources overhead of a few threads, and you can see why this is more effective in terms of OS resources.

In my question I separated the I/O handling into two categories: polling represented by non-blocking select, and "callback" represented by the blocking select. (The blocking select sleeps the thread, so it's not strictly speaking a callback; but conceptually it is similar to a callback, since it doesn't use CPU cycles until the I/O is ready. Since I don't know the precise term, I'll just use "callback").
I assumed that asynchronous model cannot use "callback" I/O. It now seems to me that this assumption was incorrect. While an asynchronous program should not be using non-blocking select, and it cannot strictly request a traditional callback from the OS either, it can certainly provide OS with its main event loop and say a coroutine, and ask the OS to create a task in that event loop using that coroutine when an I/O socket is ready. This would not use any of the program's CPU cycles until the I/O is ready. (It might use OS kernel's CPU cycles if it uses polling rather than interrupts for I/O, but that would be the case even with a multi-threaded program.)
Of course, this requires that the OS supports the asynchronous framework used by the program. It probably doesn't. But even then, it seems quite straightforward to add an middle layer that uses a single separate thread and blocking select to talk to the OS, and whenever I/O is ready, creates a task to the program's main event loop. If this layer is included in the interpreter, the program would look perfectly asynchronous. If this layer is added as a library, the program would be largely asynchronous, apart from a simple additional thread that converts synchronous I/O to asynchronous I/O.
I have no idea whether any of this is done in python, but it seems plausible conceptually.

When should I be using asyncio over regular threads, and why? Does it provide performance increases?

I have a pretty basic understanding of multithreading in Python and an even basic-er understanding of asyncio.
I'm currently writing a small Curses-based program (eventually going to be using a full GUI, but that's another story) that handles the UI and user IO in the main thread, and then has two other daemon threads (each with their own queue/worker-method-that-gets-things-from-a-queue):
a watcher thread that watches for time-based and conditional (e.g. posts to a message board, received messages, etc.) events to occur and then puts required tasks into...
the other (worker) daemon thread's queue which then completes them.
All three threads are continuously running concurrently, which leads me to some questions:
When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Thanks!

When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
You should just use a blocking call to queue.get(). That will leave the thread blocked on I/O, which means the GIL will be released, and no processing power (or at least a very minimal amount) will be used. Don't use non-blocking gets in a while loop, since that's going to require a lot more CPU wakeups.
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
If all the watcher is doing is pulling things off a queue and immediately putting it into another queue, where it gets consumed by a single worker, it sounds like its unnecessary overhead - you may as well just consume it directly in the worker. It's not exactly clear to me if that's the case, though - is the watcher consuming from a queue, or just putting items into one? If it is consuming from a queue, who is putting stuff into it?
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Yes, this is affected by the GIL. Only one of your threads can run Python bytecode at a time, so won't get true parallelism, except when threads are running I/O (which releases the GIL). If your worker thread is doing CPU-bound activities, you should seriously consider running it in a separate process via multiprocessing, if possible.
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
It's hard to say, because I don't know exactly what "running continuously" means. What is it doing continuously? If it spends most of its time sleeping or blocking on a queue, it's fine - both of those things release the GIL. If it's constantly doing actual work, that will require the GIL, and therefore degrade the performance of the other threads in your app (assuming they're trying to do work at the same time). asyncio is designed for programs that are I/O-bound, and can therefore be run in a single thread, using asynchronous I/O. It sounds like your program may be a good fit for that depending on what your worker is doing.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Any program where you're mostly waiting for I/O is potentially a good for for asyncio - but only if you can find a library that makes curses (or whatever other GUI library you eventually choose) play nicely with it. Most GUI frameworks come with their own event loop, which will conflict with asyncio's. You would need to use a library that can make the GUI's event loop play nicely with asyncio's event loop. You'd also need to make sure that you can find asyncio-compatible versions of any other synchronous-I/O based library your application uses (e.g. a database driver).
That said, you're not likely to see any kind of performance improvement by switching from your thread-based program to something asyncio-based. It'll likely perform about the same. Since you're only dealing with 3 threads, the overhead of context switching between them isn't very significant, so switching from that a single-threaded, asynchronous I/O approach isn't going to make a very big difference. asyncio will help you avoid thread synchronization complexity (if that's an issue with your app - it's not clear that it is), and at least theoretically, would scale better if your app potentially needed lots of threads, but it doesn't seem like that's the case. I think for you, it's basically down to which style you prefer to code in (assuming you can find all the asyncio-compatible libraries you need).

reactor design pattern in a single thread vs multiple threads

I've been reading about the reactor design pattern, specifically in the context of the Python Twisted networking framework. My simple understanding of the reactor design is that there is a single thread that will sit and wait until one or more I/O sources (or file descriptors) become available, and then it will synchronously loop through each of those sources, doing whatever callbacks specified for each of these sources. Which does mean that the program as a whole would block if any of the callbacks are themselves blocking. And regardless, once all callbacks have executed, the reactor goes back to waiting for more I/O sources to become ready.
What are the pros and cons of this, compared to asynchronously looping through each source as they appear, i.e. launching a separate thread for each source. I imagine this may be less efficient if all your callbacks are very fast, as the OS now has to deal with managing multiple threads and swapping between them. But it seems that it's now impossible to block the main program, and as an added benefit, the main reactor can keep listening for sources. In short, why does something like Twisted not do this, instead keeping to a single-threaded model?

What are the pros and cons of this, compared to asynchronously looping through each source as they appear, i.e. launching a separate thread for each source.
What you're describing is basically what happens in a multithreaded program that uses blocking I/O APIs. In this case, the "reactor" moves into the kernel and the "asynchronous looping" is the kernel completing some outstanding blocking operation, freeing up a user-space thread to proceed.
The cons of this approach are the greatly increased complexity with respect to thread-safety (ie, correctness) that it incurs compared to a strictly single-threaded approach.
The pros are better utilization of multiple CPUs (but running multiple single-threaded event-driven processes often offers this benefit as well) and the greater number of programmers who are familiar and comfortable (though often mistakenly so) with the multithreading approach to concurrency.
Also related, though, are the PyPy team's efforts towards providing a better abstraction over the conventional multithreading model. PyPy's work towards Software Transactional Memory (STM) could offer a system in which work is dispatched asynchronously to multiple worker threads without violating the assumptions that are valid in a strictly single-threaded system. If this works out, it could offer the best of both worlds.

But it seems that it's now impossible to block the main program,
I'm not a Python guy but have done this in the context of Boost. Asio. You're correct—your callbacks need to execute quickly and return control to the main reactor. The idea is to only use asynchronous calls in your callbacks. For example, you wouldn't use an API for sending an IP datagram that blocks and returns a status code. Instead, you'd use a non-blocking API where you register success and failure callbacks. This lets the call send call return immediately. The reactor will then call the success/failure callback once the OS has dealt with the packet.

Threads vs. Async

I have been reading up on the threaded model of programming versus the asynchronous model from this really good article. http://krondo.com/blog/?p=1209
However, the article mentions the following points.
An async program will simply outperform a sync program by switching between tasks whenever there is a I/O.
Threads are managed by the operating system.
I remember reading that threads are managed by the operating system by moving around TCBs between the Ready-Queue and the Waiting-Queue(amongst other queues). In this case, threads don't waste time on waiting either do they?
In light of the above mentioned, what are the advantages of async programs over threaded programs?

It is very difficult to write code that is thread safe. With asyncronous code, you know exactly where the code will shift from one task to the next and race conditions are therefore much harder to come by.
Threads consume a fair amount of data since each thread needs to have its own stack. With async code, all the code shares the same stack and the stack is kept small due to continuously unwinding the stack between tasks.
Threads are OS structures and are therefore more memory for the platform to support. There is no such problem with asynchronous tasks.
Update 2022:
Many languages now support stackless co-routines (async/await). This allows us to write a task almost synchronously while yielding to other tasks (awaiting) at set places (sleeping or waiting for networking or other threads)

There are two ways to create threads:
synchronous threading - the parent creates one (or more) child threads and then must wait for each child to terminate. Synchronous threading is often referred to as the fork-join model.
asynchronous threading - the parent and child run concurrently/independently of one another. Multithreaded servers typically follow this model.
resource - http://www.amazon.com/Operating-System-Concepts-Abraham-Silberschatz/dp/0470128720

Assume you have 2 tasks, which does not involve any IO (on multiprocessor machine).
In this case threads outperform Async. Because Async like a
single threaded program executes your tasks in order. But threads can
execute both the tasks simultaneously.
Assume you have 2 tasks, which involve IO (on multiprocessor machine).
In this case both Async and Threads performs more or less same (performance
might vary based on number of cores, scheduling, how much process intensive
the task etc.). Also Async takes less amount of resources, low overhead and
less complex to program over multi threaded program.
How it works?
Thread 1 executes Task 1, since it is waiting for IO, it is moved to IO
waiting Queue. Similarly Thread 2 executes Task 2, since it is also involves
IO, it is moved to IO waiting Queue. As soon as it's IO request is resolved
it is moved to ready queue so the scheduler can schedule the thread for
execution.
Async executes Task 1 and without waiting for it's IO to complete it
continues with Task 2 then it waits for IO of both the task to complete. It
completes the tasks in the order of IO completion.
Async best suited for tasks which involve Web service calls, Database query
calls etc.,
Threads for process intensive tasks.
The below video explains aboutAsync vs Threaded model and also when to use etc.,
https://www.youtube.com/watch?v=kdzL3r-yJZY
Hope this is helpful.

First of all, note that a lot of the detail of how threads are implemented and scheduled are very OS-specific. In general, you shouldn't need to worry about threads waiting on each other, since the OS and the hardware will attempt to arrange for them to run efficiently, whether asynchronously on a single-processor system or in parallel on multi-processors.
Once a thread has finished waiting for something, say I/O, it can be thought of as runnable. Threads that are runnable will be scheduled for execution at some point soon. Whether this is implemented as a simple queue or something more sophisticated is, again, OS- and hardware-specific. You can think of the set of blocked threads as a set rather than as a strictly ordered queue.
Note that on a single-processor system, asynchronous programs as defined here are equivalent to threaded programs.

see http://en.wikipedia.org/wiki/Thread_(computing)#I.2FO_and_scheduling
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.
According to this, your whole process might be blocked, and no thread will be scheduled when one thread is blocked in IO. I think this is os-specific, and will not always be hold.

To be fair, let's point out the benefit of Threads under CPython GIL compared to async approach:
it's easier first to write typical code that has one flow of events (no parallel execution) and then to run multiple copies of it in separate threads: it will keep each copy responsive, while the benefit of executing all I/O operations in parallel will be achieved automatically;
many time-proven libraries are sync and therefore easy to be included in the threaded version, and not in async one;
some sync libraries actually let GIL go at C level that allows parallel execution for tasks beyond I/O-bound ones: e.g. NumPy;
it's harder to write async code in general: the inclusion of a heavy CPU-bound section will make concurrent tasks not responsive, or one may forget to await the result and finish execution earlier.
So if there are no immediate plans to scale your services beyond ~100 concurrent connections it may be easier to start with a threaded version and then rewrite it... using some other more performant language like Go.

Async I/O means there is already a thread in the driver that does the job, so you are duplicating functionality and incurring some overhead. On the other hand, often it is not documented how exactly the driver thread behaves, and in complex scenarios, when you want to control timeout/cancellation/start/stop behaviour, synchronization with other threads, it makes sense to implement your own thread. It is also sometimes easier to reason in sync terms.

How to maximize performance in Python when doing many I/O bound operations?

I have a situation where I'm downloading a lot of files. Right now everything runs on one main Python thread, and downloads as many as 3000 files every few minutes. The problem is that the time it takes to do this is too long. I realize Python has no true multi-threading, but is there a better way of doing this? I was thinking of launching multiple threads since the I/O bound operations should not require access to the global interpreter lock, but perhaps I misunderstand that concept.

Multithreading is just fine for the specific purpose of speeding up I/O on the net (although asynchronous programming would give even greater performance). CPython's multithreading is quite "true" (native OS threads) -- what you're probably thinking of is the GIL, the global interpreter lock that stops different threads from simultaneously running Python code. But all the I/O primitives give up the GIL while they're waiting for system calls to complete, so the GIL is not relevant to I/O performance!
For asynchronous programming, the most powerful framework around is twisted, but it can take a while to get the hang of it if you're never done such programming. It would probably be simpler for you to get extra I/O performance via the use of a pool of threads.

Could always take a look at multiprocessing.

is there a better way of doing this?
Yes
I was thinking of launching multiple threads since the I/O bound operations
Don't.
At the OS level, all the threads in a process are sharing a limited set of I/O resources.
If you want real speed, spawn as many heavyweight OS processes as your platform will tolerate. The OS is really, really good about balancing I/O workloads among processes. Make the OS sort this out.
Folks will say that spawning 3000 processes is bad, and they're right. You probably only want to spawn a few hundred at a time.
What you really want is the following.
A shared message queue in which the 3000 URI's are queued up.
A few hundred workers which are all reading from the queue.
Each worker gets a URI from the queue and gets the file.
The workers can stay running. When the queue's empty, they'll just sit there, waiting for work.
"every few minutes" you dump the 3000 URI's into the queue to make the workers start working.
This will tie up every resource on your processor, and it's quite trivial. Each worker is only a few lines of code. Loading the queue is a special "manager" that's just a few lines of code, also.

Gevent is perfect for this.
Gevent's use of Greenlets (lightweight coroutines in the same python process) offer you asynchronous operations without compromising code readability or introducing abstract 'reactor' concepts into your mix.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.