Python asyncio: synchronize all access to a shared object - python

I have a class which processes a buch of work elements asynchronously (mainly due to overlapping HTTP connection requests) using asyncio. A very simplified example to demonstrate the structure of my code:
class Work:
...
def worker(self, item):
# do some work on item...
return
def queue(self):
# generate the work items...
yield from range(100)
async def run(self):
with ThreadPoolExecutor(max_workers=10) as executor:
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(executor, self.worker, item)
for item in self.queue()
]
for result in await asyncio.gather(*tasks):
pass
work = Work()
asyncio.run(work.run())
In practice, the workers need to access a shared container-like object and call its methods which are not async-safe. For example, let's say the worker method calls a function defined like this:
def func(shared_obj, value):
for node in shared_obj.filter(value):
shared_obj.remove(node)
However, calling func from a worker might affect the other asynchronous workers in this or any other function involving the shared object. I know that I need to use some synchronization, such as a global lock, but I don't find its usage easy:
asyncio.Lock can be used only in async functions, so I would have to mark all such function definitions as async
I would also have to await all calls of these functions
await is also usable only in async functions, so eventually all functions between worker and func would be async
if the worker was async, it would not be possible to pass it to loop.run_in_executor (it does not await)
Furthermore, some of the functions where I would have to add async may be generic in the sense that they should be callable from asynchronous as well as "normal" context.
I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock and work with it in a couple of places, without having to further annotate the functions. Also, there is a nice solution to wrap the shared object such that all access is transparently guarded by a lock. I'm wondering if something similar is possible with asyncio...

I'm probably missing something serious in the whole concept. With the threading module, I would just create a lock...
What you are missing is that you're not really using asyncio at all. run_in_executor serves to integrate CPU-bound or legacy sync code into an asyncio application. It works by submitting the function it to a ThreadPoolExecutor and returning an awaitable handle which gets resolved once the function completes. This is "async" in the sense of running in the background, but not in the sense that is central to asyncio. An asyncio program is composed of non-blocking pieces that use async/await to suspend execution when data is unavailable and rely on the event loop to efficiently wait for multiple events at once and resume appropriate async functions.
In other words, as long as you rely on run_in_executor, you are just using threading (more precisely concurrent.futures with a threading executor). You can use a threading.Lock to synchronize between functions, and things will work exactly as if you used threading in the first place.
To get the benefits of asyncio such as scaling to a large number of concurrent tasks or reliable cancellation, you should design your program as async (or mostly async) from the ground up. Then you'll be able to modify shared data atomically simply by doing it between two awaits, or use asyncio.Lock for synchronized modification across awaits.

Related

is asyncio.to_thread() method different to ThreadPoolExecutor?

I see that asyncio.to_thread() method is been added #python 3.9+, its description says it runs blocking codes on a separate thread to run at once. see example below:
def blocking_io():
print(f"start blocking_io at {time.strftime('%X')}")
# Note that time.sleep() can be replaced with any blocking
# IO-bound operation, such as file operations.
time.sleep(1)
print(f"blocking_io complete at {time.strftime('%X')}")
async def main():
print(f"started main at {time.strftime('%X')}")
await asyncio.gather(
asyncio.to_thread(blocking_io),
asyncio.sleep(1))
print(f"finished main at {time.strftime('%X')}")
asyncio.run(main())
# Expected output:
#
# started main at 19:50:53
# start blocking_io at 19:50:53
# blocking_io complete at 19:50:54
# finished main at 19:50:54
By explanation, it seems like using thread mechanism and not context switching nor coroutine. Does this mean it is not actually an async after all? is it same as a traditional multi-threading as in concurrent.futures.ThreadPoolExecutor? what is the benefit of using thread this way then?
Source code of to_thread is quite simple. It boils down to awaiting run_in_executor with a default executor (executor argument is None) which is ThreadPoolExecutor.
In fact, yes, this is traditional multithreading, сode intended to run on a separate thread is not asynchronous, but to_thread allows you to await for its result asynchronously.
Also note that the function runs in the context of the current task, so its context variable values will be available inside the func.
async def to_thread(func, /, *args, **kwargs):
"""Asynchronously run function *func* in a separate thread.
Any *args and **kwargs supplied for this function are directly passed
to *func*. Also, the current :class:`contextvars.Context` is propogated,
allowing context variables from the main thread to be accessed in the
separate thread.
Return a coroutine that can be awaited to get the eventual result of *func*.
"""
loop = events.get_running_loop()
ctx = contextvars.copy_context()
func_call = functools.partial(ctx.run, func, *args, **kwargs)
return await loop.run_in_executor(None, func_call)
you would use asyncio.to_tread when ever you need to call a blocking api from a third party lib that either does not have an asyncio adapter/interface or where you do not want to create one because you just need to use a limited number of functions form that lib.
a concrete example is i am currently writing a applicaiton that will eventually run as a daemon at which point it will use asyncio for its core event loop. The eventloop will involved monitoring a unix socket for notifications which will trigger the deamon to take an action.
for rapid prototyping its currently a cli but one of the depencies/external system the deamon will interact with is call libvirt, an abstraction layer for virtual machine management written in c with a python wrapper called libvirt python.
the python binding are blocking and comunitcate with the libvirt deamon over a separate unix socket with a blocking request responce protocol.
you can conceptually think of making a call to the libvirt bindings as each function internally making a http request to a server and waiting for the server to complete the action. The exact mechanics of how it does that are not important for this disucssion just that its a blocking io operation that depends on and external process that may take some time. i.e. this is not a cpu bound call and therefore it can be offloaded to a thread and awaited.
if i was to directly call “domains = libvirt.conn.listAllDomains()” in a async function
that would block my asyncio event loop until i got a responce form libvirt.
so if any events were recived on the unix socket my main loop is monitoring
they would not be processed while we are waiting for the libvirt deamon to look up all domains and return the list of them to us.
if i use “domains = await asyncio.to_thread(libvirt.conn.listAllDomains)”
however the await call will suspend my current coroutine until we get the responce, yeilding execution back to the asyncio event loop. that means if the daemon recives a notification while we are waiting on libvirt it can be schduled to run concurrently instead of being blocked.
in my application i will also need to read and write to linux speical files in /sys. linux has natiave aio file support which can be used with asyncio vai aiofile however linux does not supprot the aio interface for managing special files, so i would have to use blocking io.
one way to do that in a async applicaiton would be to wrap function that writes to the special files asyncio.to_thread.
i could and might use a decorator to use run_in_executor directly since i own the write_sysfs function but if i did not then to_thread is more polite then monkeypatching someone else’s lib and less work then creating my own wrapper api.
hopefully those are useful examples of where you might want to use to_thread. its really just a convince function and you can use run_in_executor to do the same thing with so addtional overhead.
if you need to support older python release you might also prefer run_in_executor since it predates the intorduction of to_thread but if you can assume 3.9+ then its a nice addtion to leverage when you need too.

Python: why use AsyncIO if not with asyncio.gather()?

I recently started looking into asynchronous programming in Python. Let's say we want to run a function asynchronously, an example below:
async def print_i_async(no):
print("Async: Preparing print of " + str(no))
await asyncio.sleep(1)
print(str(no))
async def main_async(no):
await asyncio.gather(*(print_i_async(i) for i in range(no)))
asyncio.run(main_async(no))
This will as expected work asynchronously. It's not clear to me, however, why would we use asynchronous functions if not with asyncio.gather(). For example:
def print_i_serial(no):
print("Serial: Preparing print of " + str(no))
time.sleep(1)
print(str(no))
for i in range(5):
print_i_serial(i)
for i in range(5):
asyncio.run(print_i_async(i))
These two functions produce the same result. Am I missing something? Is there any reason we would use an async def if we don't use asyncio.gather(), given this is how we actually get asynchronous results?
There are many reasons to use asyncio besides gather.
What you are really asking is: are there more ways to create concurrent executions besides gather?
To that the answer is yes.
Yes, gather is one of the simplest and most straightforward examples for creating concurrency with asyncio, but it's not limited to gather.
What gather does is creating a bunch of awaitables (if needed, for example coroutines are wrapped in a task) to wait for and return the result once all the futures are ready (and a bunch of other stuff such as propagating cancellation).
Let's examine just two more examples of ways to achieve concurrency:
as_completed - similarly to gather, you send in a bunch of awaitables, but instead of waiting for all of them to be ready, this method returns you the futures as they become ready, unordered.
Another example is to create tasks yourself, e.g. with event_loop.create_task(). This will allow you to create a task that will run on the event loop, which you can later await. In the meantime (until you await the task) you can continue running other code, and basically achieve concurrency (note the task will not run straightaway, but only when you yield control back to the event loop, and it handles the task).
There are many more ways to achieve concurrency. You can start with these examples (the 2nd one is actually a general way you can use to create lots of different concurrent "topologies" of executions).
You can start by reading https://docs.python.org/3/library/asyncio-task.html

How to efficiently use asyncio when calling a method on a BaseProxy?

I'm working on an application that uses LevelDB and that uses multiple long-lived processes for different tasks.
Since LevelDB does only allow a single process maintaining a database connection, all our database access is funneled through a special database process.
To access the database from another process we use a BaseProxy. But since we are using asyncio our proxy shouldn't block on these APIs that call into the db process which then eventually read from the db. Therefore we implement the APIs on the proxy using an executor.
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
thread_pool_executor,
self._callmethod,
method_name,
args,
)
And while that works just fine, I wonder if there's a better alternative to wrapping the _callmethod call of the BaseProxy in a ThreadPoolExecutor.
The way I understand it, the BaseProxy calling into the DB process is the textbook example of waiting on IO, so using a thread for this seems unnecessary wasteful.
In a perfect world, I'd assume an async _acallmethod to exist on the BaseProxy but unfortunately that API does not exist.
So, my question basically boils down to: When working with BaseProxy is there a more efficient alternative to running these cross process calls in a ThreadPoolExecutor?
Unfortunately, the multiprocessing library is not suited to conversion to asyncio, what you have is the best you can do if you must use BaseProxy to handle your IPC (Inter-Process communication).
While it is true that the library uses blocking I/O here you can't easily reach in and re-work the blocking parts to use non-blocking primitives instead. If you were to insist on going this route you'd have to patch or rewrite the internal implementation details of that library, but being internal implementation details these can differ from Python point release to point release making any patching fragile and prone to break with minor Python upgrades. The _callmethod method is part of a deep hierarchy of abstractions involving threads, socket or pipe connections, and serializers. See multiprocessing/connection.py and multiprocessing/managers.py.
So your options here are to stick with your current approach (using a threadpool executor to shove BaseProxy._callmethod() to another thread) or to implement your own IPC solution using asyncio primitives. Your central database-access process would act as a server for your other processes to connect to as a client, either using sockets or named pipes, using an agreed-upon serialisation scheme for client requests and server responses. This is what multiprocessing implements for you, but you'd implement your own (simpler) version, using asyncio streams and whatever serialisation scheme best suits your application patterns (e.g. pickle, JSON, protobuffers, or something else entirely).
A thread pool is what you want. aioprocessing provides some async functionality of multiprocessing, but it does it using threads as you have proposed. I suggest making an issue against python if there isn't one for exposing true async multiprocessing.
https://github.com/dano/aioprocessing
In most cases, this library makes blocking calls to multiprocessing methods asynchronous by executing the call in a ThreadPoolExecutor
Assuming you have the python and the database running in the same system (i.e. you are not looking to async any network calls), you have two options.
what you are already doing (run in executor). It blocks the db thread but main thread remains free to do other stuff. This is not pure non-blocking, but it is quite an acceptable solution for I/O blocking cases, with a small overhead of maintaining a thread.
For true non-blocking solution (that can be run in a single thread without blocking) you have to have #1. native support for async (callback) from the DB for each fetch call and #2 wrap that in your custom event loop implementation. Here you subclass the Base loop, and overwrite methods to integrate your db callbacks. For example you can create a base loop that implements a pipe server. the db writes to the pipe and python polls the pipe. See the implementation of Proactor event loop in the asyncio code base. Note: I have never implemented any custom event loop.
I am not familiar with leveldb, but for a key-value store, it is not clear if there will be any significant benefit for such a callback for fetch and pure non-blocking implementation. In case you are getting multiple fetches inside an iterator and that is your main problem you can make the loop async (with each fetch still blocking) and can improve your performance. Below is a dummy code that explains this.
import asyncio
import random
import time
async def talk_to_db(d):
"""
blocking db iteration. sleep is the fetch function.
"""
for k, v in d.items():
time.sleep(1)
yield (f"{k}:{v}")
async def talk_to_db_async(d):
"""
real non-blocking db iteration. fetch (sleep) is native async here
"""
for k, v in d.items():
await asyncio.sleep(1)
yield (f"{k}:{v}")
async def talk_to_db_async_loop(d):
"""
semi-non-blocking db iteration. fetch is blocking, but the
loop is not.
"""
for k, v in d.items():
time.sleep(1)
yield (f"{k}:{v}")
await asyncio.sleep(0)
async def db_call_wrapper(db):
async for row in talk_to_db(db):
print(row)
async def db_call_wrapper_async(db):
async for row in talk_to_db_async(db):
print(row)
async def db_call_wrapper_async_loop(db):
async for row in talk_to_db_async_loop(db):
print(row)
async def func(i):
await asyncio.sleep(5)
print(f"done with {i}")
database = {i:random.randint(1,20) for i in range(20)}
async def main():
db_coro = db_call_wrapper(database)
coros = [func(i) for i in range(20)]
coros.append(db_coro)
await asyncio.gather(*coros)
async def main_async():
db_coro = db_call_wrapper_async(database)
coros = [func(i) for i in range(20)]
coros.append(db_coro)
await asyncio.gather(*coros)
async def main_async_loop():
db_coro = db_call_wrapper_async_loop(database)
coros = [func(i) for i in range(20)]
coros.append(db_coro)
await asyncio.gather(*coros)
# run the blocking db iteration
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
# run the non-blocking db iteration
loop = asyncio.get_event_loop()
loop.run_until_complete(main_async())
# run the non-blocking (loop only) db iteration
loop = asyncio.get_event_loop()
loop.run_until_complete(main_async_loop())
This is something you can try. Otherwise, I would say your current method is quite efficient. I do not think BaseProxy can give you an async acall API, it does not know how to handle the callback from your db.

Is it safe that when Two asyncio tasks access the same awaitable object?

Simply speaking, thread-safe means that it is safe when more than one thread access the same resource and I know Asyncio use a single thread fundamentally.
However, more than one Asyncio Task could access a resource multiple time at a time like multi-threading.
For example DB connection(if the object is not thread-safe and supports Asyncio operation).
Schedule Task A and Task B accessing the same DB object.
IO Loop executes Task A.
Task A await IO operation on the DB object.(it will take long time enough)
IO Loop executes Task B
Step3's IO operation is still in progress(not done).
Task B await IO operation on the same DB object.
Now Task B is trying to access the same object at a time.
Is it completely safe in Asyncio and if so, what does it make safe?
Using the same asyncio object from multiple tasks is safe in general. As an example, aiohttp has a session object, and it is expected for multiple tasks to access the same session "in parallel".
if so, what does it make safe?
The basic architecture of asyncio allows for multiple coroutines to await a single future result - they will simply all subscribe to the future's completion, and all will be scheduled to run once the result is ready. And this applies not only to coroutines, but also to synchronous code that subscribes to the future using add_done_callback.
That is how asyncio will handle your scenario: tasks A and B will ultimately subscribe to some future awaited by the DB object and. Once the result is available, it will be delivered to both of them, in turn.
Pitfalls typically associated with multi-threaded programming do not apply to asyncio because:
Unlike with threads, it is very predictable where a context switch can occur - just look at await statements in the code (and also async with and async for - but those are still very visible keywords). Anything between them is, for all intents and purposes, atomic. This eliminates the need for synchronization primitives to protect objects, as well as the mistakes that result from mishandling such tools.
All access to data happens from the thread that runs the event loop. This eliminates the possibility of a data race, reading of shared memory that is being concurrently written to.
One scenario in which multi-tasking could fail is multiple consumers attaching to the same stream-like resource. For example, if several tasks try to await reader.read(n) on the same reader stream, exactly one of them will get the new data1, and the others will keep waiting until new data arrives. The same applies to any shared streaming resource, including file descriptors or generators shared by multiple objects. And even then, one of the tasks is guaranteed to obtain the data, and the integrity of the stream object will not be compromised in any way.
1 One task receiving the data only applies if the tasks share the reader and each task separately calls data = await reader.read(n). If one were to extract a future with fut = asyncio.ensure_future(reader.read(n)) (without using await), share the future among multiple tasks, and await it in each task with data = await fut, all tasks would be notified of the particular chunk of data that ends up returned by that future.
No, asyncio is not thread safe. Generally only one thread should have control over an event loop and/or a resource associated to the event loop. If some other thread wants to access it, it should do it via special methods, like call_soon_threadsafe.

Making file-handling code compatible with asyncio

The "traditional" way for a library to take file input is to do something like this:
def foo(file_obj):
data = file_obj.read()
# Do other things here
The client code is responsible for opening the file, seeking to the appropriate point (if necessary), and closing it. If the client wants to hand us a pipe or socket (or a StringIO, for that matter), they can do that and it Just Works.
But this isn't compatible with asyncio, which requires a syntax more like this:
def foo(file_obj):
data = yield from file_obj.read()
# Do other things here
Naturally, this syntax only works with asyncio objects; trying to use it with traditional file objects makes a mess. The reverse is also true.
Worse, it seems to me there's no way to wrap this yield from inside a traditional .read() method, because we need to yield all the way up to the event loop, not just at the site where the reading happens. The gevent library does do something like this, but I don't see how to adapt their greenlet code into generators.
If I'm writing a library that handles file input, how should I deal with this situation? Do I need two versions of the foo() function? I have many such functions; duplicating all of them is not scalable.
I could tell my client developers to use run_in_executor() or some equivalent, but that feels like working against asyncio instead of with it.
This is one of the downsides of explicit asynchronous frameworks. Unlike gevent, which can monkeypatch synchronous code to make it asynchronous without any code changes, you can't make synchronous code asyncio-compatible without rewriting it to use asyncio.coroutine and yield from (or at least asyncio.Futures and callbacks) all the way down.
There's no way that I know of to have the same function work properly in both an asyncio and normal, synchronous context; any code that's asyncio compatible is going to rely on the event loop to be running to drive the asynchronous portions, so it won't work in a normal context, and synchronous code is always going to end up blocking the event loop if its run in an asyncio context. This is why you generally see asyncio-specific (or at least asynchronous framework-specific) versions of libraries alongside synchronous versions. There's just no good way to present a unified API that works with both.
Having considered this some more, I've come to the conclusion that it is possible to do this, but it's not exactly beautiful.
Start with the traditional version of foo():
def foo(file_obj):
data = file_obj.read()
# Do other things here
We need to pass a file object which will behave "correctly" here. When the file object needs to do I/O, it should follow this process:
It creates a new event.
It creates a closure which, when invoked, performs the necessary I/O and then sets the event.
It hands the closure off to the event loop using call_soon_threadsafe().
It blocks on the event.
Here's some example code:
import asyncio, threading
# inside the file object class
def read(self):
event = threading.Event()
def closure():
# self.reader is an asyncio StreamReader or similar
self._tmp = yield from self.reader.read()
event.set()
asyncio.get_event_loop().call_soon_threadsafe(closure)
event.wait()
return self._tmp
We then arrange for foo(file_obj) to be run in an executor (e.g. using run_in_executor() as suggested in the OP).
The nice thing about this technique is that it works even if the author of foo() has no knowledge of asyncio. It also ensures I/O is served on the event loop, which could be desirable in certain circumstances.

Categories