How can I make the following object instantiation asynchronous:
class IndexHandler(tornado.web.RequentHandler):
def get(self, id):
# Async the following
data = MyDataFrame(id)
self.write(data.getFoo())
The MyDataFrame returns a pandas DataFrame object and can take some time depending on the file it has to parse.
MyDataFrame() is a synchronous interface; to use it without blocking you need to do one of two things:
Rewrite it to be asynchronous. You can't really make an __init__ method asynchronous, so you'll need to refactor things into a static factory function instead of a constructor. In most cases this path only makes sense if the method depends on network I/O (and not reading from the filesystem or processing the results on the CPU).
Run it on a worker thread and asynchronously wait for its result on the main thread. From the way you've framed the question, this sounds like the right approach for you. I recommend the concurrent.futures package (in the standard library since Python 3.2; available via pip install futures for 2.x).
This would look something like:
#tornado.gen.coroutine
def get(self, id):
data = yield executor.submit(MyDataFrame, id)
self.write(data.getFoo())
where executor is a global instance of ThreadPoolExecutor.
Related
Consider a simple workflow like this:
from dask.distributed import Client
import time
with Client() as client:
futs = client.map(time.sleep, list(range(10)))
The above code will submit and almost immediately cancel the futures since the context manager will close. It's possible to keep the context manager open until tasks are completed with client.gather, however that will block further execution in the current process.
I am interested in submitting tasks to multiple clusters (e.g. local and distributed) within the same process, ideally without blocking the current process. It's straightforward to do with explicit definition of different clients and clusters, but is it also possible with context managers (one for each unique client/cluster)?
It might sound like a bit of an anti-pattern, but maybe there is way to close the cluster only after computations all futures run. I tried fire_and_forget and also tried passing shutdown_on_close=False, but that doesn't seem to be implemented.
For some Dask cluster/scheduler types, such as the dask-cloudprovider ECSCluster, the approach described above using the with block and shutdown_on_close=False would work fine.
Both ECSCluster and SLURMCluster are derived from SpecCluster. However, ECSCluster passes its **kwargs (including shutdown_on_close) down to the SpecCluster constructor via this call:
super().__init__(**kwargs)
(see the ECSCluster code here)
SLURMCluster does not: it calls the JobQueueCluster constructor which in turn instantiates SpecCluster with only a subset of its parameters:
super().__init__(
scheduler=scheduler,
worker=worker,
loop=loop,
security=security,
silence_logs=silence_logs,
asynchronous=asynchronous,
name=name,
)
See the JobQueueCluster code here
Therefore SLURMCluster/JobQueueCluster is ignoring shutdown_on_close (and other optional parameters). Looks like an update to JobQueueCluster would be required for your use case.
I have a class which processes a buch of work elements asynchronously (mainly due to overlapping HTTP connection requests) using asyncio. A very simplified example to demonstrate the structure of my code:
class Work:
...
def worker(self, item):
# do some work on item...
return
def queue(self):
# generate the work items...
yield from range(100)
async def run(self):
with ThreadPoolExecutor(max_workers=10) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, self.worker, item)
for item in self.queue()
]
for result in await asyncio.gather(*tasks):
pass
work = Work()
asyncio.run(work.run())
In practice, the workers need to access a shared container-like object and call its methods which are not async-safe. For example, let's say the worker method calls a function defined like this:
def func(shared_obj, value):
for node in shared_obj.filter(value):
shared_obj.remove(node)
However, calling func from a worker might affect the other asynchronous workers in this or any other function involving the shared object. I know that I need to use some synchronization, such as a global lock, but I don't find its usage easy:
asyncio.Lock can be used only in async functions, so I would have to mark all such function definitions as async
I would also have to await all calls of these functions
await is also usable only in async functions, so eventually all functions between worker and func would be async
if the worker was async, it would not be possible to pass it to loop.run_in_executor (it does not await)
Furthermore, some of the functions where I would have to add async may be generic in the sense that they should be callable from asynchronous as well as "normal" context.
I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock and work with it in a couple of places, without having to further annotate the functions. Also, there is a nice solution to wrap the shared object such that all access is transparently guarded by a lock. I'm wondering if something similar is possible with asyncio...
I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock...
What you are missing is that you're not really using asyncio at all. run_in_executor serves to integrate CPU-bound or legacy sync code into an asyncio application. It works by submitting the function it to a ThreadPoolExecutor and returning an awaitable handle which gets resolved once the function completes. This is "async" in the sense of running in the background, but not in the sense that is central to asyncio. An asyncio program is composed of non-blocking pieces that use async/await to suspend execution when data is unavailable and rely on the event loop to efficiently wait for multiple events at once and resume appropriate async functions.
In other words, as long as you rely on run_in_executor, you are just using threading (more precisely concurrent.futures with a threading executor). You can use a threading.Lock to synchronize between functions, and things will work exactly as if you used threading in the first place.
To get the benefits of asyncio such as scaling to a large number of concurrent tasks or reliable cancellation, you should design your program as async (or mostly async) from the ground up. Then you'll be able to modify shared data atomically simply by doing it between two awaits, or use asyncio.Lock for synchronized modification across awaits.
I'm running a REST server using Flask, and I have a method that updates some variables that other methods only read. I'd like to be able to safely update these variables, but I'm not sure how to approach this:
Is there some built-in Flask feature to suspend other requests while a specific one is being handled? If that method isn't running, other methods are free to run concurrently.
Perhaps I need to use some thread lock? I reviewed the locks Python's threading library has to offer, and couldn't find a lock that offers two kinds of locking: for writing and for reading. Do I need to implement such a thing myself?
I think a lock probably is what you want; an example of how to use one is as follows:
from threading import RLock
class App(object):
def __init__(self):
self._lock = RLock()
self._thing = 0
def read_thing(self):
with self._lock:
print self._thing
def write_thing(self)
with self._lock:
self._thing += 1
So, let's imagine this object of ours (App) is created and then accessed from two different threads (e.g. two different requests); the lock object is used in a context-management fashion (the "with" keyword) to ensure that all operations that could be thread-unsafe are done within the lock.
Somewhere at the low level some magic is done to ensure that for the duration that the lock is held, nothing else happens to that variable.
This means we can spam read_thing and write_thing to our hearts contents in as many threads as we like, and we shouldn't break anything.
So, for your Flask app, declare a lock and then whenever you access those variables you're worried about, do so inside the lock.
NOTE: If you're working with dictionaries, be sure to take copies of the dictionary ("copy.deepcopy" is one way), because otherwise you'll pass a reference to the actual dictionary and you'll be back to being thread-unsafe.
The "traditional" way for a library to take file input is to do something like this:
def foo(file_obj):
data = file_obj.read()
# Do other things here
The client code is responsible for opening the file, seeking to the appropriate point (if necessary), and closing it. If the client wants to hand us a pipe or socket (or a StringIO, for that matter), they can do that and it Just Works.
But this isn't compatible with asyncio, which requires a syntax more like this:
def foo(file_obj):
data = yield from file_obj.read()
# Do other things here
Naturally, this syntax only works with asyncio objects; trying to use it with traditional file objects makes a mess. The reverse is also true.
Worse, it seems to me there's no way to wrap this yield from inside a traditional .read() method, because we need to yield all the way up to the event loop, not just at the site where the reading happens. The gevent library does do something like this, but I don't see how to adapt their greenlet code into generators.
If I'm writing a library that handles file input, how should I deal with this situation? Do I need two versions of the foo() function? I have many such functions; duplicating all of them is not scalable.
I could tell my client developers to use run_in_executor() or some equivalent, but that feels like working against asyncio instead of with it.
This is one of the downsides of explicit asynchronous frameworks. Unlike gevent, which can monkeypatch synchronous code to make it asynchronous without any code changes, you can't make synchronous code asyncio-compatible without rewriting it to use asyncio.coroutine and yield from (or at least asyncio.Futures and callbacks) all the way down.
There's no way that I know of to have the same function work properly in both an asyncio and normal, synchronous context; any code that's asyncio compatible is going to rely on the event loop to be running to drive the asynchronous portions, so it won't work in a normal context, and synchronous code is always going to end up blocking the event loop if its run in an asyncio context. This is why you generally see asyncio-specific (or at least asynchronous framework-specific) versions of libraries alongside synchronous versions. There's just no good way to present a unified API that works with both.
Having considered this some more, I've come to the conclusion that it is possible to do this, but it's not exactly beautiful.
Start with the traditional version of foo():
def foo(file_obj):
data = file_obj.read()
# Do other things here
We need to pass a file object which will behave "correctly" here. When the file object needs to do I/O, it should follow this process:
It creates a new event.
It creates a closure which, when invoked, performs the necessary I/O and then sets the event.
It hands the closure off to the event loop using call_soon_threadsafe().
It blocks on the event.
Here's some example code:
import asyncio, threading
# inside the file object class
def read(self):
event = threading.Event()
def closure():
# self.reader is an asyncio StreamReader or similar
self._tmp = yield from self.reader.read()
event.set()
asyncio.get_event_loop().call_soon_threadsafe(closure)
event.wait()
return self._tmp
We then arrange for foo(file_obj) to be run in an executor (e.g. using run_in_executor() as suggested in the OP).
The nice thing about this technique is that it works even if the author of foo() has no knowledge of asyncio. It also ensures I/O is served on the event loop, which could be desirable in certain circumstances.
First of all i know i can use threading to accomplish such task, like so:
import Queue
import threading
# called by each thread
def do_stuff(q, arg):
result = heavy_operation(arg)
q.put(result)
operations = range(1, 10)
q = Queue.Queue()
for op in operations:
t = threading.Thread(target=do_stuff, args = (q,op))
t.daemon = True
t.start()
s = q.get()
print s
However, in google app engine there's something called ndb tasklets and according to their documentation you can execute code in parallel using them.
Tasklets are a way to write concurrently running functions without
threads; tasklets are executed by an event loop and can suspend
themselves blocking for I/O or some other operation using a yield
statement. The notion of a blocking operation is abstracted into the
Future class, but a tasklet may also yield an RPC in order to wait for
that RPC to complete.
Is it possible to accomplish something like the example with threading above?
I already know how to handle retrieving entities using get_async() (got it from their examples at doc page) but its very unclear to me when it comes to parallel code execution.
Thanks.
The answer depended on what your heavy_operation really is. If the heavy_operation use RPC (Remote Procedure Call, such as datastore access, UrlFetch, ... etc), then the answer is yes.
In
how to understand appengine ndb.tasklet?
I asked a similar question, you may find more details there.
May I put any kind of code inside a function and decorate it as ndb.tasklet? Then used it as async function later. Or it must be appengine RPC?
The Answer
Technically yes, but it will not run asynchronously. When you decorate a non-yielding function with #tasklet, its Future's value is computed and set when you call that function. That is, it runs through the entire function when you call it. If you want to achieve asynchronous operation, you must yield on something that does asynchronous work. Generally in GAE it will work its way down to an RPC call.