Twisted with queue for CPU-bound tasks

Twisted with queue for CPU-bound tasks - python

I have an HTTP server which does some IO stuff, then does some CPU-bound stuff (PIL) and then replies with data (magnitude of megabyte or so).
(a) My first idea is something like this: a process for server and IO, based on Twisted, and several processes for PIL stuff, with queue.
If this architecture is reasonable, then there probably is a library which does exactly that: multiprocess queue for Twisted. However, I'm not really experienced in Twisted and know nothing of its community so the only thing I found is ampoule, for which I found neither docs nor description which would persuade me that it's the right tool for a job.
(b) Another idea is to just run several servers in several threads, with both IO and CPU stuff going in each on of them. This seems stupid because CPU stuff will block, but maybe I'm not really understanding it.
So, questions:
Is any of these architectures reasonable?
How would you implement it (using Twisted + ampoule or what?)
For (a), how would you send a huge pile of data from "worker" to the server thread? Or maybe I can tell the worker to write into the response directly somehow?
How many "workers" are reasonable?

Yes, those architectures are probably reasonable
I'd probably use ampoule too, but I don't know a lot about it right now either. This link is the closest thing to a good introduction that I remember seeing.
Since it sounds like your workers will always be on the same machine, you should be able to use shared memory.
That depends pretty wildly on the number of cores you have available and how intensive the work is (both with respect to CPU time and other resources, like memory and disk). It's probably hard to give any answer better than "benchmark with around 1-5 processes per core and see what's fastest".

Related

Is the Multi interface in curl faster or more efficient than using multiple easy interfaces?

I am making something which involves pycurl since pycurl depends on libcurl, I was reading through its documentation and came across this Multi interface where you could perform several transfers using a single multi object. I was wondering if this is faster/more memory efficient than having miltiple easy interfaces ? I was wondering what is the advantage with this approach since the site barely says,
"Enable multiple simultaneous transfers in the same thread without making it complicated for the application."

You are trying to optimize something that doesn't matter at all.
If you want to download 200 URLs as fast as possible, you are going to spend 99.99% of your time waiting for those 200 requests, limited by your network and/or the server(s) you're downloading from. The key to optimizing that is to make the right number of concurrent requests. Anything you can do to cut down the last 0.01% will have no visible effect on your program. (See Amdahl's Law.)
Different sources give different guidelines, but typically it's somewhere between 6-12 requests, no more than 2-4 to the same server. Since you're pulling them all from Google, I'd suggest starting 4 concurrent requests, then, if that's not fast enough, tweaking that number until you get the best results.
As for space, the cost of storing 200 pages is going to far outstrip the cost of a few dozen bytes here and there for overhead. Again, what you want to optimize is those 200 pages—by storing them to disk instead of in memory, by parsing them as they come in instead of downloading everything and then parsing everything, etc.
Anyway, instead of looking at what command-line tools you have and trying to find a library that's similar to those, look for libraries directly. pycurl can be useful in some cases, e.g., when you're trying to do something complicated and you already know how to do it with libcurl, but in general, it's going to be a lot easier to use either stdlib modules like urllib or third-party modules designed to be as simple as possible like requests.
The main example for ThreadPoolExecutor in the docs shows how to do exactly what you want to do. (If you're using Python 2.x, you'll have to pip install futures to get the backport for ThreadPoolExecutor, and use urllib2 instead of urllib.request, but otherwise the code will be identical.)

Having multiple easy interfaces running concurrently in the same thread means building your own reactor and driving curl at a lower level. That's painful in C, and just as painful in Python, which is why libcurl offers, and recommends, multi.
But that "in the same thread" is key here. You can also create a pool of threads and throw the easy instances into that. In C, that can still be painful; in Python, it's dead simple. In fact, the first example in the docs for using a concurrent.futures.ThreadPoolExecutor does something similar, but actually more complicated than you need here, and it's still just a few lines of code.
If you're comparing multi vs. easy with a manual reactor, the simplicity is the main benefit. In C, you could easily implement a more efficient reactor than the one libcurl uses; in Python, that may or may not be true. But in either language, the performance cost of switching among a handful of network requests is going to be so tiny compared to everything else you're doing—especially waiting for those network requests—that it's unlikely to ever matter.
If you're comparing multi vs. easy with a thread pool, then a reactor can definitely outperform threads (except on platforms where you can tie a thread pool to a proactor, as with Windows I/O completion ports), especially for huge numbers of concurrent connections. Also, each thread needs its own stack, which typically means about 1MB of memory pages allocated (although not all of them used), which can be a serious problem in 32-bit land for huge numbers of connections. That's why very few serious servers use threads for connections. But in a client making a handful of connections, none of this matters; again, the costs incurred by wasting 8 threads vs. using a reactor will be so small compared to the real costs of your program that they won't matter.

Multithreading With Very Large Number of Threads

I'm working on simulating a mesh network with a large number of nodes. The nodes pass data between different master nodes throughout the network.
Each master comes live once a second to receive the information, but the slave nodes don't know when the master is up or not, so when they have information to send, they try and do so every 5 ms for 1 second to make sure they can find the master.
Running this on a regular computer with 1600 nodes results in 1600 threads and the performance is extremely bad.
What is a good approach to handling the threading so each node acts as if it is running on its own thread?
In case it matters, I'm building the simulation in python 2.7, but I'm open to changing to something else if that makes sense.

For one, are you really using regular, default Python threads available in the default Python 2.7 interpreter (CPython), and is all of your code in Python? If so, you are probably not actually using multiple CPU cores because of the global interpreter lock CPython has (see https://wiki.python.org/moin/GlobalInterpreterLock). You could maybe try running your code under Jython, just to check if performance will be better.
You should probably rethink your application architecture and switch to manually scheduling events instead of using threads, or maybe try using something like greenlets (https://stackoverflow.com/a/15596277/1488821), but that would probably mean less precise timings because of lack of parallelism.

To me, 1600 threads sounds like a lot but not excessive given that it's a simulation. If this were a production application it would probably not be production-worthy.
A standard machine should have no trouble handling 1600 threads. As to the OS this article could provide you with some insights.
When it comes to your code a Python script is not a native application but an interpreted script and as such will require more CPU resources to execute.
I suggest you try implementing the simulation in C or C++ instead which will produce a native application which should execute more efficiently.

Do not use threading for that. If sticking to Python, let the nodes perform their actions one by one. If the performance you get doing so is OK, you will not have to use C/C++. If the actions each node perform are simple, that may work. Anyway, there is no reason to use threads in Python at all. Python threads are usable mostly for making blocking I/O not to block your program, not for multiple CPU kernels utilization.
If you want to really use parallel processing and to write your nodes as if they were really separated and exchanging only using messages, you may use Erlang (http://www.erlang.org/). It is a functional language very well suited for executing parallel processes and making them exchange messages. Erlang processes do not map to OS threads, and you may create thousands of them. However, Erlang is a purely functional language and may seem extremely strange if you have never used such languages. And it also is not very fast, so, like Python, it is unlikely to handle 1600 actions every 5ms unless the actions are rather simple.
Finally, if you do not get desired performance using Python or Erlang, you may move to C or C++. However, still do not use 1600 threads. In fact, using threads to gain performance is reasonable only if the number of threads does not dramatically exceed number of CPU kernels. A reactor pattern (with several reactor threads) is what you may need in that case (http://en.wikipedia.org/wiki/Reactor_pattern). There is an excellent implementation of the reactor pattern in boost.asio library. It is explained here: http://www.gamedev.net/blog/950/entry-2249317-a-guide-to-getting-started-with-boostasio/

Some random thoughts here:
I did rather well with several hundred threads working like this in Java; it can be done with the right language. (But I haven't tried this in Python.)
In any language, you could run the master node code in one thread; just have it loop continuously, running the code for each master in each cycle. You'll lose the benefits of multiple cores that way, though. On the other hand, you'll lose the problems of multithreading, too. (You could have, say, 4 such threads, utilizing the cores but getting the multithreading headaches back. It'll keep the thread-overhead down, too, but then there's blocking...)
One big problem I had was threads blocking each other. Enabling 100 threads to call the same method on the same object at the same time without waiting for each other requires a bit of thought and even research. I found my multithreading program at first often used only 25% of a 4-core CPU even when running flat out. This might be one reason you're running slow.
Don't have your slave nodes repeat sending data. The master nodes should come alive in response to data coming in, or have some way of storing it until they do come alive, or some combination.
It does pay to have more threads than cores. Once you have two threads, they can block each other (and will if they share any data). If you have code to run that won't block, you want to run it in its own thread so it won't be waiting for code that does block to unblock and finish. I found once I had a few threads, they started to multiply like crazy--hence my hundreds-of-threads program. Even when 100 threads block at one spot despite all my brilliance, there's plenty of other threads to keep the cores busy!

Python - Waiting for input from lots of sockets

I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!

One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.

The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.

What would I use Stackless Python for?

There are many questions related to Stackless Python. But none answering this my question, I think (correct me if wrong - please!). There's some buzz about it all the time so I curious to know. What would I use Stackless for? How is it better than CPython?
Yes it has green threads (stackless) that allow quickly create many lightweight threads as long as no operations are blocking (something like Ruby's threads?). What is this great for? What other features it has I want to use over CPython?

It allows you to work with massive amounts of concurrency. Nobody sane would create one hundred thousand system threads, but you can do this using stackless.
This article tests doing just that, creating one hundred thousand tasklets in both Python and Google Go (a new programming language): http://dalkescientific.com/writings/diary/archive/2009/11/15/100000_tasklets.html
Surprisingly, even if Google Go is compiled to native code, and they tout their co-routines implementation, Python still wins.
Stackless would be good for implementing a map/reduce algorithm, where you can have a very large number of reducers depending on your input data.

Stackless Python's main benefit is the support for very lightweight coroutines. CPython doesn't support coroutines natively (although I expect someone to post a generator-based hack in the comments) so Stackless is a clear improvement on CPython when you have a problem that benefits from coroutines.
I think the main area where they excel are when you have many concurrent tasks running within your program. Examples might be game entities that run a looping script for their AI, or a web server that is servicing many clients with pages that are slow to create.
You still have many of the typical problems with concurrency correctness however regarding shared data, but the deterministic task switching makes it easier to write safe code since you know exactly where control will be transferred and therefore know the exact points at which the shared state must be up to date.

Thirler already mentioned that stackless was used in Eve Online. Keep in mind, that:
(..) stackless adds a further twist to this by allowing tasks to be separated into smaller tasks, Tasklets, which can then be split off the main program to execute on their own. This can be used for fire-and-forget tasks, like sending off an email, or dispatching an event, or for IO operations, e.g. sending and receiving network packets. One tasklet waits for a packet from the network while others continue running the game loop.
It is in some ways like threads, but is non-preemptive and explicitly scheduled, so there are fewer issues with synchronization. Also, switching between tasklets is much faster than thread switching, and you can have a huge number of active tasklets whereas the number of threads is severely limited by the computer hardware.
(got this citation from here)
At PyCon 2009 there was given a very interesting talk, describing why and how Stackless is used at CCP Games.
Also, there is a very good introductory material, which describes why stackless is a good solution for Your applications. (it may be somewhat old, but I think that it is worth reading).

EVEOnline is largely programmed in Stackless Python. They have several dev blogs on the use of it. It seems it is very useful for high performance computing.

While I've not used Stackless itself, I have used Greenlet for implementing highly-concurrent network applications. Some of the use cases Linden Lab has put it towards are: high-performance smart proxies, a fast system for distributing commands over huge numbers of machines, and an application that does a ton of database writes and reads (at a ratio of about 1:2, which is very write-heavy, so it's spending most of its time waiting for the database to return), and a web-crawler-type-thing for internal web data. Basically any app that's expecting to have to do a lot of network I/O will benefit from being able to create a bajillion lightweight threads. 10,000 connected clients doesn't seem like a huge deal to me.
Stackless or Greenlet aren't really a complete solution, though. They are very low-level and you're going to have to do a lot of monkeywork to build an application with them that uses them to their fullest. I know this because I maintain a library that provides a networking and scheduling layer on top of Greenlet, specifically because writing apps is so much easier with it. There are a bunch of these now; I maintain Eventlet, but also there is Concurrence, Chiral, and probably a few more that I don't know about.
If the sort of app you want to write sounds like what I wrote about, consider one of these libraries. The choice of Stackless vs Greenlet is somewhat less important than deciding what library best suits the needs of what you want to do.

The basic usefulness for green threads, the way I see it, is to implement a system in which you have a large amount of objects that do high latency operations. A concrete example would be communicating with other machines:
def Run():
# Do stuff
request_information() # This call might block
# Proceed doing more stuff
Threads let you write the above code naturally, but if the number of objects is large enough, threads just cannot perform adequately. But you can use green threads even for in really large amounts. The request_information() above could switch out to some scheduler where other work is waiting and return later. You get all the benefits of being able to call "blocking" functions as if they return immediately without using threads.
This is obviously very useful for any kind of distributed computing if you want to write code in a straightforward way.
It is also interesting for multiple cores to mitigate waiting for locks:
def Run():
# Do some calculations
green_lock(the_foo)
# Do some more calculations
The green_lock function would basically attempt to acquire the lock and just switch out to a main scheduler if it fails due to other cores using the object.
Again, green threads are being used to mitigate blocking, allowing code to be written naturally and still perform well.

Writing a socket-based server in Python, recommended strategies?

I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.

asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.

How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html

One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)

Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.

I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.

http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.