I've been reading about the reactor design pattern, specifically in the context of the Python Twisted networking framework. My simple understanding of the reactor design is that there is a single thread that will sit and wait until one or more I/O sources (or file descriptors) become available, and then it will synchronously loop through each of those sources, doing whatever callbacks specified for each of these sources. Which does mean that the program as a whole would block if any of the callbacks are themselves blocking. And regardless, once all callbacks have executed, the reactor goes back to waiting for more I/O sources to become ready.
What are the pros and cons of this, compared to asynchronously looping through each source as they appear, i.e. launching a separate thread for each source. I imagine this may be less efficient if all your callbacks are very fast, as the OS now has to deal with managing multiple threads and swapping between them. But it seems that it's now impossible to block the main program, and as an added benefit, the main reactor can keep listening for sources. In short, why does something like Twisted not do this, instead keeping to a single-threaded model?
What are the pros and cons of this, compared to asynchronously looping through each source as they appear, i.e. launching a separate thread for each source.
What you're describing is basically what happens in a multithreaded program that uses blocking I/O APIs. In this case, the "reactor" moves into the kernel and the "asynchronous looping" is the kernel completing some outstanding blocking operation, freeing up a user-space thread to proceed.
The cons of this approach are the greatly increased complexity with respect to thread-safety (ie, correctness) that it incurs compared to a strictly single-threaded approach.
The pros are better utilization of multiple CPUs (but running multiple single-threaded event-driven processes often offers this benefit as well) and the greater number of programmers who are familiar and comfortable (though often mistakenly so) with the multithreading approach to concurrency.
Also related, though, are the PyPy team's efforts towards providing a better abstraction over the conventional multithreading model. PyPy's work towards Software Transactional Memory (STM) could offer a system in which work is dispatched asynchronously to multiple worker threads without violating the assumptions that are valid in a strictly single-threaded system. If this works out, it could offer the best of both worlds.
But it seems that it's now impossible to block the main program,
I'm not a Python guy but have done this in the context of Boost. Asio. You're correct—your callbacks need to execute quickly and return control to the main reactor. The idea is to only use asynchronous calls in your callbacks. For example, you wouldn't use an API for sending an IP datagram that blocks and returns a status code. Instead, you'd use a non-blocking API where you register success and failure callbacks. This lets the call send call return immediately. The reactor will then call the success/failure callback once the OS has dealt with the packet.
Related
In an asynchronous program (e.g., asyncio, twisted etc.), all system calls must be non-blocking. That means a non-blocking select (or something equivalent) needs be executed in every iteration of the main loop. That seems more wasteful than the multi-threaded approach where each thread can use a blocking call and sleep (without wasting CPU resource) until the socket is ready.
Does this sometimes cause asynchronous programs to be slower than their multi-threaded alternatives (despite thread switching costs), or is there some mechanism that makes this not a valid concern?
When working with select in a single thread program, you do not have to continuously check the results. The right way to work with it is to let it block until the relevant I/O has arrived, just like in the case of multi threads.
However, instead of waiting for a single socket (or other I/O), the select call gets a list of relevant sockets, and blocks until any of them is interrupted.
Once that happens, select wakes-up and returns a list of the sockets (or I/Os) that are ready. It is up to the coder to handle those ready sockets in the required way, and then, if the code has nothing else to do, it might start another iteration of the select.
As you can see, no polling loop is required; the program does not require CPU resources until one or more of the required sockets are ready. Moreover, if a few sockets were ready almost together, then the code wakes-up once, handle all of them, and only then start select again. Add to that the fact that the program does not requires the resources overhead of a few threads, and you can see why this is more effective in terms of OS resources.
In my question I separated the I/O handling into two categories: polling represented by non-blocking select, and "callback" represented by the blocking select. (The blocking select sleeps the thread, so it's not strictly speaking a callback; but conceptually it is similar to a callback, since it doesn't use CPU cycles until the I/O is ready. Since I don't know the precise term, I'll just use "callback").
I assumed that asynchronous model cannot use "callback" I/O. It now seems to me that this assumption was incorrect. While an asynchronous program should not be using non-blocking select, and it cannot strictly request a traditional callback from the OS either, it can certainly provide OS with its main event loop and say a coroutine, and ask the OS to create a task in that event loop using that coroutine when an I/O socket is ready. This would not use any of the program's CPU cycles until the I/O is ready. (It might use OS kernel's CPU cycles if it uses polling rather than interrupts for I/O, but that would be the case even with a multi-threaded program.)
Of course, this requires that the OS supports the asynchronous framework used by the program. It probably doesn't. But even then, it seems quite straightforward to add an middle layer that uses a single separate thread and blocking select to talk to the OS, and whenever I/O is ready, creates a task to the program's main event loop. If this layer is included in the interpreter, the program would look perfectly asynchronous. If this layer is added as a library, the program would be largely asynchronous, apart from a simple additional thread that converts synchronous I/O to asynchronous I/O.
I have no idea whether any of this is done in python, but it seems plausible conceptually.
I'm trying to write a scalable custom web server.
Here's what I have so far:
The main loop and request interpreter are in Cython. The main loop accepts connections and assigns the sockets to one of the processes in the pool (has to be processes, threads won't get any benefit from multi-core hardware because of the GIL).
Each process has a thread pool. The process assigns the socket to a thread.
The thread calls recv (blocking) on the socket and waits for data. When some shows up, it gets piped into the request interpreter, and then sent via WSGI to the application running in that thread.
Now I've heard about epoll and am a little confused. Is there any benefit to using epoll to get socket data and then pass that directly to the processes? Or should I just go the usual route of having each thread wait on recv?
PS: What is epoll actually used for? It seems like multithreading and blocking fd calls would accomplish the same thing.
If you're already using multiple threads, epoll doesn't offer you much additional benefit.
The point of epoll is that a single thread can listen for activity on many file selectors simultaneously (and respond to events on each as they occur), and thus provide event-driven multitasking without requiring the spawning of additional threads. Threads are relatively cheap (compared to spawning processes), but each one does require some overhead (after all, they each have to maintain a call stack).
If you wanted to, you could rewrite your pool processes to be single-threaded using epoll, which would reduce your overall thread usage count, but of course you'd have to consider whether that's something you care about or not - in general, for low numbers of simultaneous requests on each worker, the overhead of spawning threads wouldn't matter, but if you want each worker to be able to handle 1000s of open connections, that overhead can become significant (and that's where epoll shines).
But...
What you're describing sounds suspiciously like you're basically reinventing the wheel - your:
main loop and request interpreter
pool of processes
sounds almost exactly like:
nginx (or any other load balancer/reverse proxy)
A pre-forking tornado app
Tornado is a single-threaded web server python module using epoll, and it has the capability built-in for pre-forking (meaning that it spawns multiple copies of itself as separate processes, effectively creating a process pool). Tornado is based on the tech created to power Friendfeed - they needed a way to handle huge numbers of open connections for long-polling clients looking for new real-time updates.
If you're doing this as a learning process, then by all means, reinvent away! It's a great way to learn. But if you're actually trying to build an application on top of these kinds of things, I'd highly recommend considering using the existing, stable, communally-developed projects - it'll save you a lot of time, false starts, and potential gotchas.
(P.S. I approve of your avatar. <3)
The epoll function (and the other functions in the same family poll and select) allow you to write single threading networking code that manage multiple networking connection. Since there is no threading, there is no need fot synchronisation as would be required in a multi-threaded program (this can be difficult to get right).
On the other hand, you'll need to have an explicit state machine for each connection. In a threaded program, this state machine is implicit.
Those function just offer another way to multiplex multiple connexion in a process. Sometimes it is easier not to use threads, other times you're already using threads, and thus it is easier just to use blocking sockets (which release the GIL in Python).
I am running some code that has X workers, each worker pulling tasks from a queue every second. For this I use twisted's task.LoopingCall() function. Each worker fulfills its request (scrape some data) and then pushes the response back to another queue. All this is done in the reactor thread since I am not deferring this to any other thread.
I am wondering whether I should run all these jobs in separate threads or leave them as they are. And if so, is there a problem if I call task.LoopingCall every second from each thread ?
No, you shouldn't use threads. You can't call LoopingCall from a thread (unless you use reactor.callFromThread), but it wouldn't help you make your code faster.
If you notice a performance problem, you may want to profile your workload, figure out where the CPU-intensive work is, and then put that work into multiple processes, spawned with spawnProcess. You really can't skip the step where you figure out where the expensive work is, though: there's no magic pixie dust you can sprinkle on your Twisted application that will make it faster. If you choose a part of your code which isn't very intensive and doesn't require blocking resources like CPU or disk, then you will discover that the overhead of moving work to a different process may outweigh any benefit of having it there.
You shouldn't use threads for that. Doing it all in the reactor thread is ok. If your scraping uses twisted.web.client to do the network access, it shouldn't block, so you will go as fast as it gets.
First, beware that Twisted's reactor sometimes multithreads and assigns tasks without telling you anything. Of course, I haven't seen your program in particular.
Second, in Python (that is, in CPython) spawning threads to do non-blocking computation has little benefit. Read up on the GIL (Global Interpreter Lock).
I have been reading up on the threaded model of programming versus the asynchronous model from this really good article. http://krondo.com/blog/?p=1209
However, the article mentions the following points.
An async program will simply outperform a sync program by switching between tasks whenever there is a I/O.
Threads are managed by the operating system.
I remember reading that threads are managed by the operating system by moving around TCBs between the Ready-Queue and the Waiting-Queue(amongst other queues). In this case, threads don't waste time on waiting either do they?
In light of the above mentioned, what are the advantages of async programs over threaded programs?
It is very difficult to write code that is thread safe. With asyncronous code, you know exactly where the code will shift from one task to the next and race conditions are therefore much harder to come by.
Threads consume a fair amount of data since each thread needs to have its own stack. With async code, all the code shares the same stack and the stack is kept small due to continuously unwinding the stack between tasks.
Threads are OS structures and are therefore more memory for the platform to support. There is no such problem with asynchronous tasks.
Update 2022:
Many languages now support stackless co-routines (async/await). This allows us to write a task almost synchronously while yielding to other tasks (awaiting) at set places (sleeping or waiting for networking or other threads)
There are two ways to create threads:
synchronous threading - the parent creates one (or more) child threads and then must wait for each child to terminate. Synchronous threading is often referred to as the fork-join model.
asynchronous threading - the parent and child run concurrently/independently of one another. Multithreaded servers typically follow this model.
resource - http://www.amazon.com/Operating-System-Concepts-Abraham-Silberschatz/dp/0470128720
Assume you have 2 tasks, which does not involve any IO (on multiprocessor machine).
In this case threads outperform Async. Because Async like a
single threaded program executes your tasks in order. But threads can
execute both the tasks simultaneously.
Assume you have 2 tasks, which involve IO (on multiprocessor machine).
In this case both Async and Threads performs more or less same (performance
might vary based on number of cores, scheduling, how much process intensive
the task etc.). Also Async takes less amount of resources, low overhead and
less complex to program over multi threaded program.
How it works?
Thread 1 executes Task 1, since it is waiting for IO, it is moved to IO
waiting Queue. Similarly Thread 2 executes Task 2, since it is also involves
IO, it is moved to IO waiting Queue. As soon as it's IO request is resolved
it is moved to ready queue so the scheduler can schedule the thread for
execution.
Async executes Task 1 and without waiting for it's IO to complete it
continues with Task 2 then it waits for IO of both the task to complete. It
completes the tasks in the order of IO completion.
Async best suited for tasks which involve Web service calls, Database query
calls etc.,
Threads for process intensive tasks.
The below video explains aboutAsync vs Threaded model and also when to use etc.,
https://www.youtube.com/watch?v=kdzL3r-yJZY
Hope this is helpful.
First of all, note that a lot of the detail of how threads are implemented and scheduled are very OS-specific. In general, you shouldn't need to worry about threads waiting on each other, since the OS and the hardware will attempt to arrange for them to run efficiently, whether asynchronously on a single-processor system or in parallel on multi-processors.
Once a thread has finished waiting for something, say I/O, it can be thought of as runnable. Threads that are runnable will be scheduled for execution at some point soon. Whether this is implemented as a simple queue or something more sophisticated is, again, OS- and hardware-specific. You can think of the set of blocked threads as a set rather than as a strictly ordered queue.
Note that on a single-processor system, asynchronous programs as defined here are equivalent to threaded programs.
see http://en.wikipedia.org/wiki/Thread_(computing)#I.2FO_and_scheduling
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.
According to this, your whole process might be blocked, and no thread will be scheduled when one thread is blocked in IO. I think this is os-specific, and will not always be hold.
To be fair, let's point out the benefit of Threads under CPython GIL compared to async approach:
it's easier first to write typical code that has one flow of events (no parallel execution) and then to run multiple copies of it in separate threads: it will keep each copy responsive, while the benefit of executing all I/O operations in parallel will be achieved automatically;
many time-proven libraries are sync and therefore easy to be included in the threaded version, and not in async one;
some sync libraries actually let GIL go at C level that allows parallel execution for tasks beyond I/O-bound ones: e.g. NumPy;
it's harder to write async code in general: the inclusion of a heavy CPU-bound section will make concurrent tasks not responsive, or one may forget to await the result and finish execution earlier.
So if there are no immediate plans to scale your services beyond ~100 concurrent connections it may be easier to start with a threaded version and then rewrite it... using some other more performant language like Go.
Async I/O means there is already a thread in the driver that does the job, so you are duplicating functionality and incurring some overhead. On the other hand, often it is not documented how exactly the driver thread behaves, and in complex scenarios, when you want to control timeout/cancellation/start/stop behaviour, synchronization with other threads, it makes sense to implement your own thread. It is also sometimes easier to reason in sync terms.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the modules used to write multi-threaded applications in Python? I'm aware of the basic concurrency mechanisms provided by the language and also of Stackless Python, but what are their respective strengths and weaknesses?
In order of increasing complexity:
Use the threading module
Pros:
It's really easy to run any function (any callable in fact) in its
own thread.
Sharing data is if not easy (locks are never easy :), at
least simple.
Cons:
As mentioned by Juergen Python threads cannot actually concurrently access state in the interpreter (there's one big lock, the infamous Global Interpreter Lock.) What that means in practice is that threads are useful for I/O bound tasks (networking, writing to disk, and so on), but not at all useful for doing concurrent computation.
Use the multiprocessing module
In the simple use case this looks exactly like using threading except each task is run in its own process not its own thread. (Almost literally: If you take Eli's example, and replace threading with multiprocessing, Thread, with Process, and Queue (the module) with multiprocessing.Queue, it should run just fine.)
Pros:
Actual concurrency for all tasks (no Global Interpreter Lock).
Scales to multiple processors, can even scale to multiple machines.
Cons:
Processes are slower than threads.
Data sharing between processes is trickier than with threads.
Memory is not implicitly shared. You either have to explicitly share it or you have to pickle variables and send them back and forth. This is safer, but harder. (If it matters increasingly the Python developers seem to be pushing people in this direction.)
Use an event model, such as Twisted
Pros:
You get extremely fine control over priority, over what executes when.
Cons:
Even with a good library, asynchronous programming is usually harder than threaded programming, hard both in terms of understanding what's supposed to happen and in terms of debugging what actually is happening.
In all cases I'm assuming you already understand many of the issues involved with multitasking, specifically the tricky issue of how to share data between tasks. If for some reason you don't know when and how to use locks and conditions you have to start with those. Multitasking code is full of subtleties and gotchas, and it's really best to have a good understanding of concepts before you start.
You've already gotten a fair variety of answers, from "fake threads" all the way to external frameworks, but I've seen nobody mention Queue.Queue -- the "secret sauce" of CPython threading.
To expand: as long as you don't need to overlap pure-Python CPU-heavy processing (in which case you need multiprocessing -- but it comes with its own Queue implementation, too, so you can with some needed cautions apply the general advice I'm giving;-), Python's built-in threading will do... but it will do it much better if you use it advisedly, e.g., as follows.
"Forget" shared memory, supposedly the main plus of threading vs multiprocessing -- it doesn't work well, it doesn't scale well, never has, never will. Use shared memory only for data structures that are set up once before you spawn sub-threads and never changed afterwards -- for everything else, make a single thread responsible for that resource, and communicate with that thread via Queue.
Devote a specialized thread to every resource you'd normally think to protect by locks: a mutable data structure or cohesive group thereof, a connection to an external process (a DB, an XMLRPC server, etc), an external file, etc, etc. Get a small thread pool going for general purpose tasks that don't have or need a dedicated resource of that kind -- don't spawn threads as and when needed, or the thread-switching overhead will overwhelm you.
Communication between two threads is always via Queue.Queue -- a form of message passing, the only sane foundation for multiprocessing (besides transactional-memory, which is promising but for which I know of no production-worthy implementations except In Haskell).
Each dedicated thread managing a single resource (or small cohesive set of resources) listens for requests on a specific Queue.Queue instance. Threads in a pool wait on a single shared Queue.Queue (Queue is solidly threadsafe and won't fail you in this).
Threads that just need to queue up a request on some queue (shared or dedicated) do so without waiting for results, and move on. Threads that eventually DO need a result or confirmation for a request queue a pair (request, receivingqueue) with an instance of Queue.Queue they just made, and eventually, when the response or confirmation is indispensable in order to proceed, they get (waiting) from their receivingqueue. Be sure you're ready to get error-responses as well as real responses or confirmations (Twisted's deferreds are great at organizing this kind of structured response, BTW!).
You can also use Queue to "park" instances of resources which can be used by any one thread but never be shared among multiple threads at one time (DB connections with some DBAPI compoents, cursors with others, etc) -- this lets you relax the dedicated-thread requirement in favor of more pooling (a pool thread that gets from the shared queue a request needing a queueable resource will get that resource from the apppropriate queue, waiting if necessary, etc etc).
Twisted is actually a good way to organize this minuet (or square dance as the case may be), not just thanks to deferreds but because of its sound, solid, highly scalable base architecture: you may arrange things to use threads or subprocesses only when truly warranted, while doing most things normally considered thread-worthy in a single event-driven thread.
But, I realize Twisted is not for everybody -- the "dedicate or pool resources, use Queue up the wazoo, never do anything needing a Lock or, Guido forbid, any synchronization procedure even more advanced, such as semaphore or condition" approach can still be used even if you just can't wrap your head around async event-driven methodologies, and will still deliver more reliability and performance than any other widely-applicable threading approach I've ever stumbled upon.
It depends on what you're trying to do, but I'm partial to just using the threading module in the standard library because it makes it really easy to take any function and just run it in a separate thread.
from threading import Thread
def f():
...
def g(arg1, arg2, arg3=None):
....
Thread(target=f).start()
Thread(target=g, args=[5, 6], kwargs={"arg3": 12}).start()
And so on. I often have a producer/consumer setup using a synchronized queue provided by the Queue module
from Queue import Queue
from threading import Thread
q = Queue()
def consumer():
while True:
print sum(q.get())
def producer(data_source):
for line in data_source:
q.put( map(int, line.split()) )
Thread(target=producer, args=[SOME_INPUT_FILE_OR_SOMETHING]).start()
for i in range(10):
Thread(target=consumer).start()
Kamaelia is a python framework for building applications with lots of communicating processes.
(source: kamaelia.org) Kamaelia - Concurrency made useful, fun
In Kamaelia you build systems from simple components that talk to each other. This speeds development, massively aids maintenance and also means you build naturally concurrent software. It's intended to be accessible by any developer, including novices. It also makes it fun :)
What sort of systems? Network servers, clients, desktop applications, pygame based games, transcode systems and pipelines, digital TV systems, spam eradicators, teaching tools, and a fair amount more :)
Here's a video from Pycon 2009. It starts by comparing Kamaelia to Twisted and Parallel Python and then gives a hands on demonstration of Kamaelia.
Easy Concurrency with Kamaelia - Part 1 (59:08)
Easy Concurrency with Kamaelia - Part 2 (18:15)
Regarding Kamaelia, the answer above doesn't really cover the benefit here. Kamaelia's approach provides a unified interface, which is pragmatic not perfect, for dealing with threads, generators & processes in a single system for concurrency.
Fundamentally it provides a metaphor of a running thing which has inboxes, and outboxes. You send messages to outboxes, and when wired together, messages flow from outboxes to inboxes. This metaphor/API remains the same whether you're using generators, threads or processes, or speaking to other systems.
The "not perfect" part is due to syntactic sugar not being added as yet for inboxes and outboxes (though this is under discussion) - there is a focus on safety/usability in the system.
Taking the producer consumer example using bare threading above, this becomes this in Kamaelia:
Pipeline(Producer(), Consumer() )
In this example it doesn't matter if these are threaded components or otherwise, the only difference is between them from a usage perspective is the baseclass for the component. Generator components communicate using lists, threaded components using Queue.Queues and process based using os.pipes.
The reason behind this approach though is to make it harder to make hard to debug bugs. In threading - or any shared memory concurrency you have, the number one problem you face is accidentally broken shared data updates. By using message passing you eliminate one class of bugs.
If you use bare threading and locks everywhere you're generally working on the assumption that when you write code that you won't make any mistakes. Whilst we all aspire to that, it's very rare that will happen. By wrapping up the locking behaviour in one place you simplify where things can go wrong. (Context handlers help, but don't help with accidental updates outside the context handler)
Obviously not every piece of code can be written as message passing and shared style which is why Kamaelia also has a simple software transactional memory (STM), which is a really neat idea with a nasty name - it's more like version control for variables - ie check out some variables, update them and commit back. If you get a clash you rinse and repeat.
Relevant links:
Europython 09 tutorial
Monthly releases
Mailing list
Examples
Example Apps
Reusable components (generator & thread)
Anyway, I hope that's a useful answer. FWIW, the core reason behind Kamaelia's setup is to make concurrency safer & easier to use in python systems, without the tail wagging the dog. (ie the big bucket of components
I can understand why the other Kamaelia answer was modded down, since even to me it looks more like an ad than an answer. As the author of Kamaelia it's nice to see enthusiasm though I hope this contains a bit more relevant content :-)
And that's my way of saying, please take the caveat that this answer is by definition biased, but for me, Kamaelia's aim is to try and wrap what is IMO best practice. I'd suggest trying a few systems out, and seeing which works for you. (also if this is inappropriate for stack overflow, sorry - I'm new to this forum :-)
I would use the Microthreads (Tasklets) of Stackless Python, if I had to use threads at all.
A whole online game (massivly multiplayer) is build around Stackless and its multithreading principle -- since the original is just to slow for the massivly multiplayer property of the game.
Threads in CPython are widely discouraged. One reason is the GIL -- a global interpreter lock -- that serializes threading for many parts of the execution. My experiance is, that it is really difficult to create fast applications this way. My example codings where all slower with threading -- with one core (but many waits for input should have made some performance boosts possible).
With CPython, rather use seperate processes if possible.
If you really want to get your hands dirty, you can try using generators to fake coroutines. It probably isn't the most efficient in terms of work involved, but coroutines do offer you very fine control of co-operative multitasking rather than pre-emptive multitasking you'll find elsewhere.
One advantage you'll find is that by and large, you will not need locks or mutexes when using co-operative multitasking, but the more important advantage for me was the nearly-zero switching speed between "threads". Of course, Stackless Python is said to be very good for that as well; and then there's Erlang, if it doesn't have to be Python.
Probably the biggest disadvantage in co-operative multitasking is the general lack of workaround for blocking I/O. And in the faked coroutines, you'll also encounter the issue that you can't switch "threads" from anything but the top level of the stack within a thread.
After you've made an even slightly complex application with fake coroutines, you'll really begin to appreciate the work that goes into process scheduling at the OS level.