Python Twisted and database connections - python

Our projects at work include synchronous applications (short lived) and asynchronous Twisted applications (long lived). We're re-factoring our database and are going to build an API module to decouple all of the SQL in that module. I'd like to create that API so both synchronous and asynchronous applications can use it. For the synchronous applications I'd like calls to the database API to just return data (blocking) just like using MySQLdb, but for the asynchronous applications I'd like calls to the same API functions/methods to be non-blocking, probably returning a deferred. Anyone have any hints, suggestions or help they might offer me to do this?
Thanks in advance,
Doug

twisted.enterprise.adbapi seems the way to go -- do you think it fails to match your requirements, and if so, can you please explain why?

Within Twisted, you basically want a wrapper around a function which returns a Deferred (such as the Twisted DB layer), waits for it's results, and returns them. However, you can't busy-wait, since that's using up your reactor cycles, and checking for a task to complete using the Twisted non-blocking wait is probably inefficient.
Will inlineCallbacks or deferredGenerator solve your problem? They require a modern Twisted. See the twistedmatrix docs.
def thingummy():
thing = yield makeSomeRequestResultingInDeferred()
print thing #the result! hoorj!
thingummy = inlineCallbacks(thingummy)
Another option would be to have two methods which execute the same SQL template, one which uses runInteraction, which blocks, and one which uses runQuery, which returns a Deferred, but that would involve more code paths which do the same thing.

Have you considered borrowing a page from continuation-passing style? Stackless Python supports continuations directly, if you're using it, and the approach appears to have gained some interest already.

All the database libraries I've seen seem to be stubbornly synchronous.
It appears that Twisted.enterprise.abapi solves this problem by using a threads to manage a connection pool and wrapping the underlying database libraries. This is obviously not ideal, but I suppose it would work, but I haven't actually tried it myself.
Ideally there would be some way to have sqlalchemy and twisted integrated. I found this project, nadbapi, which claims to do it, but it looks like it hasn't been updated since 2007.

Related

How can I do asynchronous programming but hide it in Python?

Am just getting my head round Twisted, threading, stackless, etc. etc. and would appreciate some high level advice.
Suppose I have remote clients 1 and 2, connected via a websocket running in a page on their browsers. Here is the ideal goal:
for cl in (1,2):
guess[cl] = show(cl, choice("Pick a number:", range(1,11)))
checkpoint()
if guess[1] == guess[2]:
show((1,2), display("You picked the same number!"))
Ignoring the mechanics of show, choice and display, the point is that I want the show call to be asynchronous. Each client gets shown the choice. The code waits at checkpoint() for all the threads (or whatever) to rejoin.
I would be interested in hearing answers even if they involve hairy things like rewriting the source code. I'd also be interested in less hairy answers which involve compromising a bit on the syntax.
The most simple solution code wise is to use a framework like Autobahn which support remote procdure calls (RPC). That means you can call some JavaScript in the browser and wait for the result.
If you want to call two clients, you will have to use threads.
You can also do it manually. The approach works along these lines:
You need to pass a callback to show().
show() needs to register the callback with some kind of string ID in a global dict
show() must send this ID to the client
When the client sends the answer, it must include the ID.
The Python handler can then remove the callback from the global dict and invoke it with the answer
The callback needs to collect the results.
When it has enough results (two in your case), it must send status updates to the client.
You can simplify the code using yield but the theory behind is a bit complex to understand: What does the "yield" keyword do in Python? and coroutines
In Python, the most widely-used approach to async/event-based network programming that hides that model from the programmer is probably gevent.
Beware: this kind of trickery works by making tasks yield control implicitly, which encourages the same sorts of surprising bugs that tend to appear when OS threads are involved. Local reasoning about such problems is significantly harder than with explicit yielding, and the convenience of avoiding callbacks might not be worth the trouble introduced by the inherent pitfalls. Perhaps just as important to a library author like yourself: this approach is not pure Python, and would force dependencies and interpreter restrictions on the users of your library.
A lot of discussion about this topic sprouted up (especially between the gevent and twisted camps) while Guido was working on the asyncio library, which was called tulip at the time. He summarized the main issues here.

Using concurrent.futures.Future with greenlets/gevent

I have a python library that performs asynchronous network via multicast which may garner replies from other services. It hides the dirty work by returning a Future which will capture a reply. I am integrating this library into an existing gevent application. The call pattern is as simple as:
future = service.broadcast()
# next call blocks the current thread
reply = future.result(some_timeout)
Under the hood, concurrent.futures.Future.result() uses threading.Condition.wait().
With a monkey-patched threading module, this seems fine and safe, and non-blocking with greenlets.
Is there any reason to be worried here or when mixing gevent and concurrent.futures?
Well, as far as I can tell, futures isn't documented to work on top of threading.Condition, and gevent isn't documented to be able to patch futures safely. So, in theory, someone could write a Python implementation that would break gevent.
But in practice? It's hard to imagine what such an implementation would look like. You obviously need some kind of sync objects to make a Future work. Sure, you could use an Event, Lock, and Rlock instead of a Condition, but that won't cause a problem for gevent. The only way an implementation could plausibly break things would be to go directly to the pthreads/Win32/Java/.NET/whatever sync objects instead of using the wrappers in threading.
How would you deal with that if it happened? Well, futures is implemented in pure Python, and it's pretty simple Python, and there's a fully functional backport that works with 2.5+/3.2+. So, you'd just have to grab that backport and swap out concurrent.futures for futures.
So, if you're doing something wacky like deploying a server that's going to run for 5 years unattended and may have its Python repeatedly upgraded underneath it, maybe I'd install the backport now and use that instead.
Otherwise, I'd just document the assumption (and the workaround in case it's ever broken) in the appropriate place, and then just use the stdlib module.

How to simultaneously query two APIs in Python?

Using web.py, I'm building a website in which I display search results from two third party websites through their public API. Unfortunately, for the APIs to send back the result takes about 4 seconds. If I query the second API only after I received the answer from the first, this obviously takes me about 8 seconds, which is way too long. To bring this down I want to send the requests to the APIs simultaneously and simply continue as soon as I received an answer from both the APIs.
My problem is now: how to do this?
I've never worked with parallel computing, but I've heard of multiprocessing and threading. I don't really know what the difference or advantages of each are. I also know that for example C++ is able to do parallel computations. It could therefore also be an option to write the part that queries the APIs in C++ (I'm a beginner in C++, but I think I'd manage). Finally, there could of course be options that I am totally overlooking. Maybe web.py has some options to do this, or maybe there are Python modules which are specifically made to do this?
Since only researching and understanding all of these options would take me quite a lot of time, I thought I'd ask you guys here for some tips.
So which one do you think I should go for? And most importantly: why? All tips are welcome!
You want an asynchronous HTTP request library. Examples of this would be gevent, or grequests.
Alternatively, you could use Python's built-in threading module to run synchronous requests in multiple threads.
Either way, no need to go to another language.

Gevent/Eventlet monkey patching for DB drivers

After doing Gevent/Eventlet monkey patching - can I assume that whenever DB driver (eg redis-py, pymongo) uses IO through standard library (eg socket) it will be asynchronous?
So using eventlets monkey patching is enough to make eg: redis-py non blocking in eventlet application?
From what I know it should be enough if I take care about connection usage (eg to use different connection for each greenlet). But I want to be sure.
If you known what else is required, or how to use DB drivers correctly with Gevent/Eventlet please type it also.
You can assume it will be magically patched if all of the following are true.
You're sure of the I/O is built on top of standard Python sockets or other things that eventlet/gevent monkeypatches. No files, no native (C) socket objects, etc.
You pass aggressive=True to patch_all (or patch_select), or you're sure the library doesn't use select or anything similar.
The driver doesn't use any (implicit) internal threads. (If the driver does use threads internally, patch_thread may work, but it may not.)
If you're not sure, it's pretty easy to test—probably easier than reading through the code and trying to work it out. Have one greenlet that just does something like this:
while True:
print("running")
gevent.sleep(0.1)
Then have another that runs a slow query against the database. If it's monkeypatched, the looping greenlet will keep printing "running" 10 times/second; if not, the looping greenlet will not get to run while the program is blocked on the query.
So, what do you do if your driver blocks?
The easiest solution is to use a truly concurrent threadpool for DB queries. The idea is that you fire off each query (or batch) as a threadpool job and greenlet-block your gevent on the completion of that job. (For really simple cases, where you don't need many concurrent queries, you can just spawn a threading.Thread for each one instead, but usually you can't get away with that.)
If the driver does significant CPU work (e.g., you're using something that runs an in-process cache, or even an entire in-process DBMS like sqlite), you want this threadpool to actually be implemented on top of processes, because otherwise the GIL may prevent your greenlets from running. Otherwise (especially if you care about Windows), you probably want to use OS threads. (However, this means you can't patch_threads(); if you need to do that, use processes.)
If you're using eventlet, and you want to use threads, there's a built-in simple solution called tpool that may be sufficient. If you're using gevent, or you need to use processes, this won't work. Unfortunately, blocking a greenlet (without blocking the whole event loop) on a real threading object is a bit different between eventlet and gevent, and not documented very well, but the tpool source should give you the idea. Beyond that part, the rest is just using concurrent.futures (see futures on pypi if you need this in 2.x or 3.1) to execute the tasks on a ThreadPoolExecutor or ProcessPoolExecutor. (Or, if you prefer, you can go right to threading or multiprocessing instead of using futures.)
Can you explain why I should use OS threads on Windows?
The quick summary is: If you stick to threads, you can pretty much just write cross-platform code, but if you go to processes, you're effectively writing code for two different platforms.
First, read the Programming guidelines for the multiprocessing module (both the "All platforms" section and the "Windows" section). Fortunately, a DB wrapper shouldn't run into most of this. You only need to deal with processes via the ProcessPoolExecutor. And, whether you wrap things up at the cursor-op level or the query level, all your arguments and return values are going to be simple types that can be pickled. Still, it's something you have to be careful about, which otherwise wouldn't be an issue.
Meanwhile, Windows has very low overhead for its intra-process synchronization objects, but very high overhead for its inter-process ones. (It also has very fast thread creation and very slow process creation, but that's not important if you're using a pool.) So, how do you deal with that? I had a lot of fun creating OS threads to wait on the cross-process sync objects and signal the greenlets, but your definition of fun may vary.
Finally, tpool can be adapted trivially to a ppool for Unix, but it takes more work on Windows (and you'll have to understand Windows to do that work).
abarnert's answer is correct and very comprehensive. I just want to add that there is no "aggressive" patching in eventlet, probably gevent feature. Also if library uses select that is not a problem, because eventlet can monkey patch that too.
Indeed, in most cases eventlet.monkey_patch() is all you need. Of course, it must be done before creating any sockets.
If you still have any issues, feel free to open issue or write to eventlet mailing list or G+ community. All relevant links can be found at http://eventlet.net/

Threads vs Asynchronous Networking (Twisted) Python

I am writing an implementation of a NAT. My algorithm is as follows:
Packet comes in
Check against lookup table if external, add to lookup table if internal
Swap the source address and send the packet on its way
I have been reading about Twisted. I was curious if Twisted takes advantage of multicore CPUs? Assume the system has thousands of users and one packet comes right after the other. With twisted can the lookup table operations be taking place at the same time on each core. I hear with threads the GIL will not allow this anyway. Perhaps I could benifit from multiprocessing>
Nginix is asynchronous and happily serves thousands of users at the same time.
Using threads with twisted is discouraged. It has very good performance when used asynchronously, but the code you write for the request handlers must not block. So if your handler is a pretty big piece of code, break it up into smaller parts and utilize twisted's famous Deferreds to attach the other parts via callbacks. It certainly requires a somewhat different thinking than most programmers are used to, but it has benefits. If the code has blocking parts, like database operations, or accessing other resources via network to get some result, try finding asynchronous libraries for those tasks too, so you can use Deferreds in those cases also. If you can't use asynchronous libraries you may finally use the deferToThread function, which will run the function you want to call in a different thread and return a Deferred for it, and fire your callback when finished, but it's better to use that as a last resort, if nothing else can be done.
Here is the official tutorial for Deferreds:
http://twistedmatrix.com/documents/10.1.0/core/howto/deferredindepth.html
And another nice guide, which can help to get used to think in "async mode":
http://ezyang.com/twisted/defer2.html

Categories