Making file-handling code compatible with asyncio

Making file-handling code compatible with asyncio - python

The "traditional" way for a library to take file input is to do something like this:
def foo(file_obj):
data = file_obj.read()
# Do other things here
The client code is responsible for opening the file, seeking to the appropriate point (if necessary), and closing it. If the client wants to hand us a pipe or socket (or a StringIO, for that matter), they can do that and it Just Works.
But this isn't compatible with asyncio, which requires a syntax more like this:
def foo(file_obj):
data = yield from file_obj.read()
# Do other things here
Naturally, this syntax only works with asyncio objects; trying to use it with traditional file objects makes a mess. The reverse is also true.
Worse, it seems to me there's no way to wrap this yield from inside a traditional .read() method, because we need to yield all the way up to the event loop, not just at the site where the reading happens. The gevent library does do something like this, but I don't see how to adapt their greenlet code into generators.
If I'm writing a library that handles file input, how should I deal with this situation? Do I need two versions of the foo() function? I have many such functions; duplicating all of them is not scalable.
I could tell my client developers to use run_in_executor() or some equivalent, but that feels like working against asyncio instead of with it.

This is one of the downsides of explicit asynchronous frameworks. Unlike gevent, which can monkeypatch synchronous code to make it asynchronous without any code changes, you can't make synchronous code asyncio-compatible without rewriting it to use asyncio.coroutine and yield from (or at least asyncio.Futures and callbacks) all the way down.
There's no way that I know of to have the same function work properly in both an asyncio and normal, synchronous context; any code that's asyncio compatible is going to rely on the event loop to be running to drive the asynchronous portions, so it won't work in a normal context, and synchronous code is always going to end up blocking the event loop if its run in an asyncio context. This is why you generally see asyncio-specific (or at least asynchronous framework-specific) versions of libraries alongside synchronous versions. There's just no good way to present a unified API that works with both.

Having considered this some more, I've come to the conclusion that it is possible to do this, but it's not exactly beautiful.
Start with the traditional version of foo():
def foo(file_obj):
data = file_obj.read()
# Do other things here
We need to pass a file object which will behave "correctly" here. When the file object needs to do I/O, it should follow this process:
It creates a new event.
It creates a closure which, when invoked, performs the necessary I/O and then sets the event.
It hands the closure off to the event loop using call_soon_threadsafe().
It blocks on the event.
Here's some example code:
import asyncio, threading
# inside the file object class
def read(self):
event = threading.Event()
def closure():
# self.reader is an asyncio StreamReader or similar
self._tmp = yield from self.reader.read()
event.set()
asyncio.get_event_loop().call_soon_threadsafe(closure)
event.wait()
return self._tmp
We then arrange for foo(file_obj) to be run in an executor (e.g. using run_in_executor() as suggested in the OP).
The nice thing about this technique is that it works even if the author of foo() has no knowledge of asyncio. It also ensures I/O is served on the event loop, which could be desirable in certain circumstances.

Related

Enhance synchronous software API to allow asynchronous consuming

I have a module in Python 3.5+, providing a function that reads some data from a remote web API and returns it. The function relies on a wrapper function, which in turn uses the library requests to make the HTTP call.
Here it is (omitting on purpose all data validation logic and exception handling):
# module fetcher.py
import requests
# high-level module API
def read(some_params):
resp = requests.get('http://example.com', params=some_params)
return resp.json()
# wrapper for the actual remote API call
def get_data(some_params):
return call_web_api(some_params)
The module is currently imported and used by multiple clients.
As of today, the call to get_data is inherently synchronous: this means that whoever uses the function fetcher.read() knows that this is going to block the thread the function is executed on.
What I would love to achieve
I want to allow the fetcher.read() to be run both in a synchronous and an asynchronous fashion (eg. via an event loop).
This is in order to keep compatibility with existing callers consuming the module and at the same time to offer the possibility
to leverage non-blocking calls to allow a better throughput for callers that do want to call the function asynchronously.
This said, my legitimate wish is to modify the original code as little as possible...
As of today, the only thing I know is that Requests does not support asynchronous operations out of the box and therefore I should switch to an asyncio-friendly HTTP client (eg. aiohttp) in order to provide a non-blocking behaviour
How would the above code need to be modified to meet my desiderata? Which also leads me to ask: is there any best practice about enhancing sync software APIs to async contexts?

I want to allow the fetcher.read() to be run both in a synchronous and an asynchronous fashion (eg. via an event loop).
I don't think it is feasible for the same function to be usable via both sync and async API because the usage patterns are so different. Even if you could somehow make it work, it would be just too easy to mess things up, especially taking into account Python's dynamic-typing nature. (For example, users might accidentally forget to await their functions in async code, and the sync code would kick in, thus blocking their event loop.)
Instead, I would recommend the actual API to be async, and to create a trivial sync wrapper that just invokes the entry points using run_until_complete. Something along these lines:
# new module afetcher.py (or fetcher_async, or however you like it)
import aiohttp
# high-level module API
async def read(some_params):
async with aiohttp.request('GET', 'http://example.com', params=some_params) as resp:
return await resp.json()
# wrapper for the actual remote API call
async def get_data(some_params):
return call_web_api(some_params)
Yes, you switch from using requests to aiohttp, but the change is mechanical as the APIs are very similar in spirit.
The sync module would exist for backward compatibility and convenience, and would trivially wrap the async functionality:
# module fetcher.py
import afetcher
def read(some_params):
loop = asyncio.get_event_loop()
return loop.run_until_complete(afetcher.read(some_params))
...
This approach provides both sync and async version of the API, without code duplication because the sync version consists of trivial trampolines, whose definition can be further compressed using appropriate decorators.
The async fetcher module should have a nice short name, so that the users don't feel punished for using the async functionality. It should be easy to use, and it actually provides a lot of new features compared to the sync API, most notably low-overhead parallelization and reliable cancellation.
The route that is not recommended is using run_in_executor or similar thread-based tool to run requests in a thread pool under the hood. That implementation doesn't provide the actual benefits of using asyncio, but incurs all the costs. In that case it is better to continue providing the synchronous API and leave it to the users to use concurrent.futures or similar tools for parallel execution, where they're at least aware they're using threads.

Use twisted with custom main loop [duplicate]

I have an existing program that has its own main loop, and does computations based on input it receives - let's say from the user, to make it simple. I want to now do the computations remotely instead of locally, and I decided to implement the RPCs in Twisted.
Ideally I just want to change one of my functions, say doComputation(), to make a call to twisted to perform the RPC, get the results, and return. The rest of the program should stay the same. How can I accomplish this, though? Twisted hijacks the main loop when I call reactor.run(). I also read that you don't really have threads in twisted, that all the tasks run in sequence, so it seems I can't just create a LoopingCall and run my main loop that way.

You have a couple of different options, depending on what sort of main loop your existing program has.
If it's a mainloop from a GUI library, Twisted may already have support for it. In that case, you can just go ahead and use it.
You could also write your own reactor. There isn't a lot of great documentation for this, but you can look at the way that qtreactor implements a reactor plugin externally to Twisted.
You can also write a minimal reactor using threadedselectreactor. The documentation for this is also sparse, but the wxpython reactor is implemented using it. Personally I wouldn't recommend this approach as it is difficult to test and may result in confusing race conditions, but it does have the advantage of letting you leverage almost all of Twisted's default networking code with only a thin layer of wrapping.
If you are really sure that you don't want your doComputation to be asynchronous, and you want your program to block while waiting for Twisted to answer, do the following:
start Twisted in another thread before your main loop starts up, with something like twistedThread = Thread(target=reactor.run); twistedThread.start()
instantiate an object to do your RPC communication (let's say, RPCDoer) in your own main loop's thread, so that you have a reference to it. Make sure to actually kick off its Twisted logic with reactor.callFromThread so you don't need to wrap all of its Twisted API calls.
Implement RPCDoer.doRPC to return a Deferred, using only Twisted API calls (i.e. don't call into your existing application code, so you don't need to worry about thread safety for your application objects; pass doRPC all the information that it needs as arguments).
You can now implement doComputation like this:
def doComputation(self):
rpcResult = blockingCallFromThread(reactor, self.myRPCDoer.doRPC)
return self.computeSomethingFrom(rpcResult)
Remember to call reactor.callFromThread(reactor.stop); twistedThread.join() from your main-loop's shutdown procedure, otherwise you may see some confusing tracebacks or log messages on exit.
Finally, one option that you should really consider, especially in the long term: dump your existing main loop, and figure out a way to just use Twisted's. In my experience this is the right answer for 9 out of 10 askers of questions like this. I'm not saying that this is always the way to go - there are plenty of cases where you really need to keep your own main loop, or where it's just way too much effort to get rid of the existing loop. But, maintaining your own loop is work too. Keep in mind that the Twisted loop has been extensively tested by millions of users and used in a huge variety of environments. If your loop is also extremely mature, that may not be a big deal, but if you're writing a small, new program, the difference in reliability may be significant.

It seems like the correct and very simple answer here is a LoopingCall:
http://www.saltycrane.com/blog/2008/10/running-functions-periodically-using-twisteds-loopingcall/
from datetime import datetime
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
def doComputation():
print "Custom fn run at", datetime.now()
lc = LoopingCall(doComputation)
lc.start(0.1) # run your own loop 10 times a second
# put your other twisted here
reactor.run()

Python Twisted Deferred : clarification needed

I am hoping for some clarification on the best way to deal with handling "first" deferreds , ie not just adding callbacks and errbacks to existing Twisted methods that return a deferred, but the best way of creating those original deferreds.
As a concrete example, here are 2 variations of the same method :
it just counts the number of lines in some rather big text files, and is used as the starting point for a chain of deferreds.
Method 1:
This one does not feel so good, as the deferred is fired directly by the reactor.callLater method.
def get_line_count(self):
deferred = defer.Deferred()
def count_lines(result):
try:
print_file = file(self.print_file_path, "r")
self.line_count = sum(1 for line in print_file)
print_file.close()
return self.line_count
except Exception as inst:
raise InvalidFile()
deferred.addCallback(count_lines)
reactor.callLater(1, deferred.callback, None)
return deferred
Method 2:
slightly better , as the deferred is actually fired when the result is available
def get_line_count(self):
deferred = defer.Deferred()
def count_lines():
try:
print_file = file(self.print_file_path, "r")
self.line_count = sum(1 for line in print_file)
print_file.close()
deferred.callback(self.line_count)
except Exception as inst:
deferred.errback(InvalidFile())
reactor.callLater(1, count_lines)
return deferred
Note: You could also point out that both of these are actually synchronous, and potentially blocking methods, (and I perhaps could use "MaybeDeferred"?).
But well, that that is actually one of the aspects I get confused by.
For Method 2, if the count_lines method is very slow (counting the lines in some huge files etc), will it potentially "block" the whole Twisted app ?
I read quite a lot of documentation on how callbacks and errbacks and the reactor behave together (callbacks need to be executed quickly, or return deferreds themselves etc), but in this case , I just don't see and would really appreciate some pointers/examples etc
Are there some articles/clear explanations that deal with the best approach to creating these "first" deferreds? I have read through these excellent articles , and they have helped a lot with some of the basic understanding, but I still feel like I am missing a piece.
For blocking code, would this be this a typicall case for DeferToThread or reactor.spawnprocess ?
I read through a lot of questions like this one and this article, but I still am not 100% sure on how to deal with potentially blocking code, mostly when dealing with file i/o
Sorry if any of this seems too basic , but I really want to get the hang of using Twisted more thoroughly. (It has been a really powerful tool for all the more network-oriented aspects).
Thank you for your time!

Yes, you've got it right: you need threads or separate processes to avoid blocking the Twisted event loop. Using Deferreds wont magically make your code non-blocking. For your questions:
Yes, you would block the event loop if count_lines is very slow. Deferring it to a thread would solve this.
I used Twisteds documentation to learn how Deferreds work, but I guess you've already been through that. The article on database support was information since it clearly says that this library is built using threads. This is how you bridge the synchronous–asynchronous gap.
If the call is truly blocking, then you need to DeferToThread. Python itself is kind-of single threaded, meaning that only one thread can execute Python byte code at a time. However, if the thread you create will block on I/O anyway, then this model works fine: the thread will release the global interpreter lock and so let other Python threads run, including the main thread with the Twisted event loop.
It can also be the case that you can use non-blocking I/O in your code. This can be done with the select module, for example. In that case, you don't need a separate thread. Twisted uses this technique internally and you don't have to think of this if you do normal network I/O. But if you're doing something exotic, then it's good to know how things are built so that you can do the same.
I hope that makes things a bit clearer!

Use my own main loop in twisted

I have an existing program that has its own main loop, and does computations based on input it receives - let's say from the user, to make it simple. I want to now do the computations remotely instead of locally, and I decided to implement the RPCs in Twisted.
Ideally I just want to change one of my functions, say doComputation(), to make a call to twisted to perform the RPC, get the results, and return. The rest of the program should stay the same. How can I accomplish this, though? Twisted hijacks the main loop when I call reactor.run(). I also read that you don't really have threads in twisted, that all the tasks run in sequence, so it seems I can't just create a LoopingCall and run my main loop that way.

You have a couple of different options, depending on what sort of main loop your existing program has.
If it's a mainloop from a GUI library, Twisted may already have support for it. In that case, you can just go ahead and use it.
You could also write your own reactor. There isn't a lot of great documentation for this, but you can look at the way that qtreactor implements a reactor plugin externally to Twisted.
You can also write a minimal reactor using threadedselectreactor. The documentation for this is also sparse, but the wxpython reactor is implemented using it. Personally I wouldn't recommend this approach as it is difficult to test and may result in confusing race conditions, but it does have the advantage of letting you leverage almost all of Twisted's default networking code with only a thin layer of wrapping.
If you are really sure that you don't want your doComputation to be asynchronous, and you want your program to block while waiting for Twisted to answer, do the following:
start Twisted in another thread before your main loop starts up, with something like twistedThread = Thread(target=reactor.run); twistedThread.start()
instantiate an object to do your RPC communication (let's say, RPCDoer) in your own main loop's thread, so that you have a reference to it. Make sure to actually kick off its Twisted logic with reactor.callFromThread so you don't need to wrap all of its Twisted API calls.
Implement RPCDoer.doRPC to return a Deferred, using only Twisted API calls (i.e. don't call into your existing application code, so you don't need to worry about thread safety for your application objects; pass doRPC all the information that it needs as arguments).
You can now implement doComputation like this:
def doComputation(self):
rpcResult = blockingCallFromThread(reactor, self.myRPCDoer.doRPC)
return self.computeSomethingFrom(rpcResult)
Remember to call reactor.callFromThread(reactor.stop); twistedThread.join() from your main-loop's shutdown procedure, otherwise you may see some confusing tracebacks or log messages on exit.
Finally, one option that you should really consider, especially in the long term: dump your existing main loop, and figure out a way to just use Twisted's. In my experience this is the right answer for 9 out of 10 askers of questions like this. I'm not saying that this is always the way to go - there are plenty of cases where you really need to keep your own main loop, or where it's just way too much effort to get rid of the existing loop. But, maintaining your own loop is work too. Keep in mind that the Twisted loop has been extensively tested by millions of users and used in a huge variety of environments. If your loop is also extremely mature, that may not be a big deal, but if you're writing a small, new program, the difference in reliability may be significant.

It seems like the correct and very simple answer here is a LoopingCall:
http://www.saltycrane.com/blog/2008/10/running-functions-periodically-using-twisteds-loopingcall/
from datetime import datetime
from twisted.internet.task import LoopingCall
from twisted.internet import reactor
def doComputation():
print "Custom fn run at", datetime.now()
lc = LoopingCall(doComputation)
lc.start(0.1) # run your own loop 10 times a second
# put your other twisted here
reactor.run()

Python: time a method call and stop it if time is exceeded

I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.

The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.

You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.

You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.

I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.

Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit

a quick google for "python timeout" reveals a TimeoutFunction class

Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making file-handling code compatible with asyncio - python

Related

Enhance synchronous software API to allow asynchronous consuming

Use twisted with custom main loop [duplicate]

Python Twisted Deferred : clarification needed

Use my own main loop in twisted

Python: time a method call and stop it if time is exceeded

Categories

Resources