After doing Gevent/Eventlet monkey patching - can I assume that whenever DB driver (eg redis-py, pymongo) uses IO through standard library (eg socket) it will be asynchronous?
So using eventlets monkey patching is enough to make eg: redis-py non blocking in eventlet application?
From what I know it should be enough if I take care about connection usage (eg to use different connection for each greenlet). But I want to be sure.
If you known what else is required, or how to use DB drivers correctly with Gevent/Eventlet please type it also.
You can assume it will be magically patched if all of the following are true.
You're sure of the I/O is built on top of standard Python sockets or other things that eventlet/gevent monkeypatches. No files, no native (C) socket objects, etc.
You pass aggressive=True to patch_all (or patch_select), or you're sure the library doesn't use select or anything similar.
The driver doesn't use any (implicit) internal threads. (If the driver does use threads internally, patch_thread may work, but it may not.)
If you're not sure, it's pretty easy to test—probably easier than reading through the code and trying to work it out. Have one greenlet that just does something like this:
while True:
print("running")
gevent.sleep(0.1)
Then have another that runs a slow query against the database. If it's monkeypatched, the looping greenlet will keep printing "running" 10 times/second; if not, the looping greenlet will not get to run while the program is blocked on the query.
So, what do you do if your driver blocks?
The easiest solution is to use a truly concurrent threadpool for DB queries. The idea is that you fire off each query (or batch) as a threadpool job and greenlet-block your gevent on the completion of that job. (For really simple cases, where you don't need many concurrent queries, you can just spawn a threading.Thread for each one instead, but usually you can't get away with that.)
If the driver does significant CPU work (e.g., you're using something that runs an in-process cache, or even an entire in-process DBMS like sqlite), you want this threadpool to actually be implemented on top of processes, because otherwise the GIL may prevent your greenlets from running. Otherwise (especially if you care about Windows), you probably want to use OS threads. (However, this means you can't patch_threads(); if you need to do that, use processes.)
If you're using eventlet, and you want to use threads, there's a built-in simple solution called tpool that may be sufficient. If you're using gevent, or you need to use processes, this won't work. Unfortunately, blocking a greenlet (without blocking the whole event loop) on a real threading object is a bit different between eventlet and gevent, and not documented very well, but the tpool source should give you the idea. Beyond that part, the rest is just using concurrent.futures (see futures on pypi if you need this in 2.x or 3.1) to execute the tasks on a ThreadPoolExecutor or ProcessPoolExecutor. (Or, if you prefer, you can go right to threading or multiprocessing instead of using futures.)
Can you explain why I should use OS threads on Windows?
The quick summary is: If you stick to threads, you can pretty much just write cross-platform code, but if you go to processes, you're effectively writing code for two different platforms.
First, read the Programming guidelines for the multiprocessing module (both the "All platforms" section and the "Windows" section). Fortunately, a DB wrapper shouldn't run into most of this. You only need to deal with processes via the ProcessPoolExecutor. And, whether you wrap things up at the cursor-op level or the query level, all your arguments and return values are going to be simple types that can be pickled. Still, it's something you have to be careful about, which otherwise wouldn't be an issue.
Meanwhile, Windows has very low overhead for its intra-process synchronization objects, but very high overhead for its inter-process ones. (It also has very fast thread creation and very slow process creation, but that's not important if you're using a pool.) So, how do you deal with that? I had a lot of fun creating OS threads to wait on the cross-process sync objects and signal the greenlets, but your definition of fun may vary.
Finally, tpool can be adapted trivially to a ppool for Unix, but it takes more work on Windows (and you'll have to understand Windows to do that work).
abarnert's answer is correct and very comprehensive. I just want to add that there is no "aggressive" patching in eventlet, probably gevent feature. Also if library uses select that is not a problem, because eventlet can monkey patch that too.
Indeed, in most cases eventlet.monkey_patch() is all you need. Of course, it must be done before creating any sockets.
If you still have any issues, feel free to open issue or write to eventlet mailing list or G+ community. All relevant links can be found at http://eventlet.net/
Related
What I want is in title. The backgroud is I have thousands of requests to send to a very slow Restful interface in the program where all 3rd party packages are not allowed to imported into, except requests.
The speed of MULTITHREADING AND MULTIPROCESSING is limited to GIL and the 4 cores computer in which the program will be run.
I know you can implement an incomplete coroutine in Python 2.7 by generator and yield key word, but how can I make it possible to do thousands of requests with the incomplete coroutine ability?
Example
url_list = ["https://www.example.com/rest?id={}".format(num) for num in range(10000)]
results = request_all(url_list) # do asynchronously
First, you're starting from an incorrect premise.
The speed of multiprocessing is not limited by the GIL at all.
The speed of multiprocessing is only limited by the number of cores for CPU-bound work, which yours is not. And async doesn't work at all for CPU-bound work, so multiprocessing would be 4x better than async, not worse.
The speed of multithreading is only limited by the GIL for CPU-bound code, which, again, yours is not.
The speed of multithreading is barely affected by the number of cores. If your code is CPU-bound, the threads mostly end up serialized on a single core. But again, async is even worse here, not better.
The reason people use async is that not that it solves any of these problems; in fact, it only makes them worse. The main advantage is that if you have a ton of workers that are doing almost no work, you can schedule a ton of waiting-around coroutines more cheaply than a ton of waiting-around threads or processes. The secondary advantage is that you can tie the selector loop to the scheduler loop and eliminate a bit of overhead coordinating them.
Second, you can't use requests with asyncio in the first place. It expects to be able to block the whole thread on socket reads. There was a project to rewrite it around an asyncio-based transport adapter, but it was abandoned unfinished.
The usual way around that is to use it in threads, e.g., with run_in_executor. But if the only thing you're doing is requests, building an event loop just to dispatch things to a thread pool executor is silly; just use the executor directly.
Third, I doubt you actually need to have thousands of requests running in parallel. Although of course the details depend on your service or your network or whatever the bottleneck is, it's almost always more efficient to have a thread pool that can run, say, 12 or 64 requests running in parallel, with the other thousands queued up behind them.
Handling thousands of concurrent connections (and therefore workers) is usually something you only have to do on a server. Occasionally you have to do it on a client that's aggregating data from a huge number of different services. But if you're just hitting a single service, there's almost never any benefit to that much concurrency.
Fourth, if you really do want a coroutine-based event loop in Python 2, by far the easiest way is to use gevent or greenlets or another such library.
Yes, they give you an event loop hidden under the covers where you can't see it, and "magic" coroutines where the yielding happens inside methods like socket.send and Thread.join instead of being explicitly visible with await or yield from, but the plus side is that they already work—and, in fact, the magic means they work with requests, which anything you build will not.
Of course you don't want to use any third-party libraries. Building something just like greenlets yourself on top of Stackless or PyPy is pretty easy; building it for CPython is a lot more work. And then you still have to do all the monkeypatching that gevent does to make libraries like sockets work like magic, or rewrite requests around explicit greenlets.
Anyway, if you really want to build an event loop on top of just plain yield, you can.
In Greg Ewing's original papers on why Python needed to add yield from, he included examples of a coroutine event loop with just yield, and a better one that uses an explicit trampoline to yield to—with a simple networking-driven example. He even wrote an automatic translator from code for the (at the time not implemented) yield from to Python 3.1.
Notice that having to bounce every yield off a trampoline makes things a lot less efficient. There's really no way around that. That's a good part of the reason we have yield from in the language.
But that's just the scheduler part with a bit of toy networking. You still need to integrate a selectors loop and then write coroutines to replace all of the socket functions you need. Consider how long asyncio took Guido to build when he knew Python inside and out and had yield from to work with… but then you can steal most of his design, so it won't be quite that bad. Still, it's going to be a lot of work.
(Oh, and you don't have selectors in Python 2. If you don't care about Windows, it's pretty easy to build the part you need out of the select module, but if you do care about Windows, it's a lot more work.)
And remember, because requests won't work with your code, you're also going to need to reimplement most of it as well. Or, maybe better, port aiohttp from asyncio to your framework.
And, in the end, I'd be willing to give you odds that the result is not going to be anywhere near as efficient as aiohttp in Python 3, or requests on top of gevent in Python 2, or just requests on top of a thread pool in either.
And, of course, you'll be the only person in the world using it. asyncio had hundreds of bugs to fix between tulip and going into the stdlib, which were only detected because dozens of early adopters (including people who are serious experts on this kind of thing) were hammering on it. And requests, aiohttp, gevent, etc. are all used by thousands of servers handling zillions of dollars worth of business, so you benefit from all of those people finding bugs and needing fixes. Whatever you build almost certainly won't be nearly as reliable as any of those solutions.
All this for something you're probably going to need to port to Python 3 anyway, since Python 2 hits end-of-life in less than a year and a half, and distros and third-party libraries are already disengaging from it. For a relevant example, requests 3.0 is going to require at least Python 3.5; if you want to stick with Python 2.7, you'll be stuck with requests 2.1 forever.
I have a python library that performs asynchronous network via multicast which may garner replies from other services. It hides the dirty work by returning a Future which will capture a reply. I am integrating this library into an existing gevent application. The call pattern is as simple as:
future = service.broadcast()
# next call blocks the current thread
reply = future.result(some_timeout)
Under the hood, concurrent.futures.Future.result() uses threading.Condition.wait().
With a monkey-patched threading module, this seems fine and safe, and non-blocking with greenlets.
Is there any reason to be worried here or when mixing gevent and concurrent.futures?
Well, as far as I can tell, futures isn't documented to work on top of threading.Condition, and gevent isn't documented to be able to patch futures safely. So, in theory, someone could write a Python implementation that would break gevent.
But in practice? It's hard to imagine what such an implementation would look like. You obviously need some kind of sync objects to make a Future work. Sure, you could use an Event, Lock, and Rlock instead of a Condition, but that won't cause a problem for gevent. The only way an implementation could plausibly break things would be to go directly to the pthreads/Win32/Java/.NET/whatever sync objects instead of using the wrappers in threading.
How would you deal with that if it happened? Well, futures is implemented in pure Python, and it's pretty simple Python, and there's a fully functional backport that works with 2.5+/3.2+. So, you'd just have to grab that backport and swap out concurrent.futures for futures.
So, if you're doing something wacky like deploying a server that's going to run for 5 years unattended and may have its Python repeatedly upgraded underneath it, maybe I'd install the backport now and use that instead.
Otherwise, I'd just document the assumption (and the workaround in case it's ever broken) in the appropriate place, and then just use the stdlib module.
I'm working on a simple experiment in Python. I have a "master" process, in charge of all the others, and every single process has a connection via unix socket to the master process. I would like to be able for the master process to be able to monitor all of the sockets for a response - but there could theoretically be almost a hundred of them. How would threads impact the memory and performance of the application? What would be the best solution? Thanks a lot!
One hundred simultaneous threads might be pushing the reasonable limits of threading. If you find this is the cleanest way to organize your code, I'd say give it a try, but threading really doesn't scale very far.
What works better is to use a technique like select to wait for one of the sockets to be readable / writable / or has an error to report. This mechanism lets you go to sleep until something interesting happens, handle as many sockets have content to handle, and then go back to sleep again, all in a single thread of execution. Removing the multi-threading can often reduce chances for errors, and this style of programming should get you into the hundreds of connections no trouble. (If you want to go beyond about 100, I'd use the poll functionality instead of select -- constantly rebuilding the list of interesting file descriptors takes time that poll does not require.)
Something to consider is the Python Twisted Framework. They've gone to some length to provide a consistent way to hook callbacks onto events for this exact sort of programming. (If you're familiar with node.js, it's a bit like that, but Python.) I must admit a slight aversion to Twisted -- I never got very far in their documentation without being utterly baffled -- but a lot of people made it further in the docs than I did. You might find it a better fit than I have.
The easiest way to conduct comparative tests of threads versus processes for socket handling is to use the SocketServer in Python's standard library. You can easily switch approaches (while keeping everything else the same) by inheriting from either ThreadingMixIn or ForkingMixIn. Here is a simple example to get you started.
Another alternative is a select/poll approach using non-blocking sockets in a single process and a single thread.
If you're interested in software that is already fully developed and highly evolved, consider these high-performance Python based server packages:
The Twisted framework uses the async single process, single thread style.
The Tornado framework is similar (less evolved, less full featured, but easier to understand)
And Gunicorn which is a high-performance forking server.
I'm confused about Twisted threading.
I've heard and read more than a few articles, books, and sat through a few presentations on the subject of threading vs processes in Python. It just seems to me that unless one is doing lots of IO or wanting to utilize shared memory across jobs, then the right choice is to use multiprocessing.
However, from what I've seen so far, it seems like Twisted uses Threads (pThreads from the python threading module). And Twisted seems to perform really really well in processing lots of data.
I've got a fairly large number of processes that I'd like to distribute processing to using the MapReduce pattern in Python on a single node/server. They don't do any IO really, they just do a lot of processing.
Is the Twisted reactor the right tool for this job?
The short answer to your question: no, twisted threading is not the right solution for heavy processing.
If you have a lot of processing to do, twisted's threading will still be subject to the GIL (Global Interpreter Lock). Without going into a long in depth explanation, the GIL is what allows only one thread at a time to execute python code. What this means in effect is you will not be able to take advantage of multiple cores with a single multi-threaded twisted process. That said, some C modules (such as bits of SciPy) can release the GIL and run multi-threaded, though the python code associated is still effectively single-threaded.
What twisted's threading is mainly useful for is using it along with blocking I/O based modules. A prime example of this is database API's, because the db-api spec doesn't account for asynchronous use cases, and most database modules adhere to the spec. Thusly, to use PostgreSQL for example from a twisted app, one has to either block or use something like twisted.enterprise.adbapi which is a wrapper that uses twisted.internet.threads.deferToThread to allow a SQL query to execute while other stuff is going on. This can allow other python code to run because the socket module (among most others involving operating system I/O) will release the GIL while in a system call.
That said, you can use twisted to write a network application talking to many twisted (or non-twisted, if you'd like) workers. Each worker could then work on little bits of work, and you would not be restricted by the GIL, because each worker would be its own completely isolated process. The master process can then make use of many of twisted's asynchronous primitives. For example you could use a DeferredList to wait on a number of results coming from any number of workers, and then run a response handler when all of the Deferred's complete. (thus allowing you to do your map call) If you want to go down this route, I recommend looking at twisted.protocols.amp, which is their Asynchronous Message Protocol, and can be used very trivially to implement a network-based RPC or map-reduce.
The downside of running many disparate processes versus something like multiprocessing is that
you lose out on simple process management, and
the subprocesses can't share memory as if they would if they were forked on a unix system.
Though for modern systems, 2) is rarely a problem unless you are running hundreds of subprocesses. And problem 1) can be solved by using a process management system like supervisord
Edit For more on python and the GIL, you should watch Dave Beazley's talks on the subject ( website , video, slides )
I was recently reading this document which lists a number of strategies that could be employed to implement a socket server. Namely, they are:
Serve many clients with each thread, and use nonblocking I/O and level-triggered readiness notification
Serve many clients with each thread, and use nonblocking I/O and readiness change notification
Serve many clients with each server thread, and use asynchronous I/O
serve one client with each server thread, and use blocking I/O
Build the server code into the kernel
Now, I would appreciate a hint on which should be used in CPython, which we know has some good points, and some bad points. I am mostly interested in performance under high concurrency, and yes a number of the current implementations are too slow.
So if I may start with the easy one, "5" is out, as I am not going to be hacking anything into the kernel.
"4" Also looks like it must be out because of the GIL. Of course, you could use multiprocessing in place of threads here, and that does give a significant boost. Blocking IO also has the advantage of being easier to understand.
And here my knowledge wanes a bit:
"1" is traditional select or poll which could be trivially combined with multiprocessing.
"2" is the readiness-change notification, used by the newer epoll and kqueue
"3" I am not sure there are any kernel implementations for this that have Python wrappers.
So, in Python we have a bag of great tools like Twisted. Perhaps they are a better approach, though I have benchmarked Twisted and found it too slow on a multiple processor machine. Perhaps having 4 twisteds with a load balancer might do it, I don't know. Any advice would be appreciated.
asyncore is basically "1" - It uses select internally, and you just have one thread handling all requests. According to the docs it can also use poll. (EDIT: Removed Twisted reference, I thought it used asyncore, but I was wrong).
"2" might be implemented with python-epoll (Just googled it - never seen it before).
EDIT: (from the comments) In python 2.6 the select module has epoll, kqueue and kevent build-in (on supported platforms). So you don't need any external libraries to do edge-triggered serving.
Don't rule out "4", as the GIL will be dropped when a thread is actually doing or waiting for IO-operations (most of the time probably). It doesn't make sense if you've got huge numbers of connections of course. If you've got lots of processing to do, then python may not make sense with any of these schemes.
For flexibility maybe look at Twisted?
In practice your problem boils down to how much processing you are going to do for requests. If you've got a lot of processing, and need to take advantage of multi-core parallel operation, then you'll probably need multiple processes. On the other hand if you just need to listen on lots of connections, then select or epoll, with a small number of threads should work.
How about "fork"? (I assume that is what the ForkingMixIn does) If the requests are handled in a "shared nothing" (other than DB or file system) architecture, fork() starts pretty quickly on most *nixes, and you don't have to worry about all the silly bugs and complications from threading.
Threads are a design illness forced on us by OSes with too-heavy-weight processes, IMHO. Cloning a page table with copy-on-write attributes seems a small price, especially if you are running an interpreter anyway.
Sorry I can't be more specific, but I'm more of a Perl-transitioning-to-Ruby programmer (when I'm not slaving over masses of Java at work)
Update: I finally did some timings on thread vs fork in my "spare time". Check it out:
http://roboprogs.com/devel/2009.04.html
Expanded:
http://roboprogs.com/devel/2009.12.html
One sollution is gevent. Gevent maries a libevent based event polling with lightweight cooperative task switching implemented by greenlet.
What you get is all the performance and scalability of an event system with the elegance and straightforward model of blocking IO programing.
(I don't know what the SO convention about answering to realy old questions is, but decided I'd still add my 2 cents)
Can I suggest additional links?
cogen is a crossplatform library for network oriented, coroutine based programming using the enhanced generators from python 2.5. On the main page of cogen project there're links to several projects with similar purpose.
I like Douglas' answer, but as an aside...
You could use a centralized dispatch thread/process that listens for readiness notifications using select and delegates to a pool of worker threads/processes to help accomplish your parallelism goals.
As Douglas mentioned, however, the GIL won't be held during most lengthy I/O operations (since no Python-API things are happening), so if it's response latency you're concerned about you can try moving the critical portions of your code to CPython API.
http://docs.python.org/library/socketserver.html#asynchronous-mixins
As for multi-processor (multi-core) machines. With CPython due to GIL you'll need at least one process per core, to scale. As you say that you need CPython, you might try to benchmark that with ForkingMixIn. With Linux 2.6 might give some interesting results.
Other way is to use Stackless Python. That's how EVE solved it. But I understand that it's not always possible.