This question already has answers here:
asynchronous python itertools chain multiple generators
(2 answers)
Closed 3 years ago.
I would like to listen for events from multiple instances of the same object and then merge this event streams to one stream. For example, if I use async generators:
class PeriodicYielder:
def __init__(self, period: int) -> None:
self.period = period
async def updates(self):
while True:
await asyncio.sleep(self.period)
yield self.period
I can successfully listen for events from one instance:
async def get_updates_from_one():
each_1 = PeriodicYielder(1)
async for n in each_1.updates():
print(n)
# 1
# 1
# 1
# ...
But how can I get events from multiple async generators? In other words: how can I iterate through multiple async generators in the order they are ready to produce next value?
async def get_updates_from_multiple():
each_1 = PeriodicYielder(1)
each_2 = PeriodicYielder(2)
async for n in magic_async_join_function(each_1.updates(), each_2.updates()):
print(n)
# 1
# 1
# 2
# 1
# 1
# 2
# ...
Is there such magic_async_join_function in stdlib or in 3rd party module?
You can use wonderful aiostream library. It'll look like this:
import asyncio
from aiostream import stream
async def test1():
for _ in range(5):
await asyncio.sleep(0.1)
yield 1
async def test2():
for _ in range(5):
await asyncio.sleep(0.2)
yield 2
async def main():
combine = stream.merge(test1(), test2())
async with combine.stream() as streamer:
async for item in streamer:
print(item)
asyncio.run(main())
Result:
1
1
2
1
1
2
1
2
2
2
If you wanted to avoid the dependency on an external library (or as a learning exercise), you could merge the async iterators using a queue:
def merge_async_iters(*aiters):
# merge async iterators, proof of concept
queue = asyncio.Queue(1)
async def drain(aiter):
async for item in aiter:
await queue.put(item)
async def merged():
while not all(task.done() for task in tasks):
yield await queue.get()
tasks = [asyncio.create_task(drain(aiter)) for aiter in aiters]
return merged()
This passes the test from Mikhail's answer, but it's not perfect: it doesn't propagate the exception in case one of the async iterators raises. Also, if the task that exhausts the merged generator returned by merge_async_iters() gets cancelled, or if the same generator is not exhausted to the end, the individual drain tasks are left hanging.
A more complete version could handle the first issue by detecting an exception and transmitting it through the queue. The second issue can be resolved by merged generator cancelling the drain tasks as soon as the iteration is abandoned. With those changes, the resulting code looks like this:
def merge_async_iters(*aiters):
queue = asyncio.Queue(1)
run_count = len(aiters)
cancelling = False
async def drain(aiter):
nonlocal run_count
try:
async for item in aiter:
await queue.put((False, item))
except Exception as e:
if not cancelling:
await queue.put((True, e))
else:
raise
finally:
run_count -= 1
async def merged():
try:
while run_count:
raised, next_item = await queue.get()
if raised:
cancel_tasks()
raise next_item
yield next_item
finally:
cancel_tasks()
def cancel_tasks():
nonlocal cancelling
cancelling = True
for t in tasks:
t.cancel()
tasks = [asyncio.create_task(drain(aiter)) for aiter in aiters]
return merged()
Different approaches to merging async iterators can be found in this answer, and also this one, where the latter allows for adding new streams mid-stride. The complexity and subtlety of these implementations shows that, while it is useful to know how to write one, actually doing so is best left to well-tested external libraries such as aiostream that cover all the edge cases.
Related
I have a complex function Vehicle.set_data, which has many nested functions, API calls, DB calls, etc. For the sake of this example, I will simplify it.
I am trying to use Async IO to run Vehicle.set_data on multiple vehicles at once. Here is my Vehicle model:
class Vehicle:
def __init__(self, token):
self.token = token
# Works async
async def set_data(self):
await asyncio.sleep(random.random() * 10)
# Does not work async
# def set_data(self):
# time.sleep(random.random() * 10)
And here is my Async IO routinue:
async def set_vehicle_data(vehicle):
# sleep for T seconds on average
await vehicle.set_data()
def get_random_string():
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
async def producer(queue):
count = 0
while True:
count += 1
# produce a token and send it to a consumer
token = get_random_string()
vehicle = Vehicle(token)
print(f'produced {vehicle.token}')
await queue.put(vehicle)
if count > 3:
break
async def consumer(queue):
while True:
vehicle = await queue.get()
# process the token received from a producer
print(f'Starting consumption for vehicle {vehicle.token}')
await set_vehicle_data(vehicle)
queue.task_done()
print(f'Ending consumption for vehicle {vehicle.token}')
async def main():
queue = asyncio.Queue()
# #todo now, do I need multiple producers
producers = [asyncio.create_task(producer(queue))
for _ in range(3)]
consumers = [asyncio.create_task(consumer(queue))
for _ in range(3)]
# with both producers and consumers running, wait for
# the producers to finish
await asyncio.gather(*producers)
print('---- done producing')
# wait for the remaining tasks to be processed
await queue.join()
# cancel the consumers, which are now idle
for c in consumers:
c.cancel()
asyncio.run(main())
In the example above, this commented section of code does not allow multiple vehicles to process at once:
# Does not work async
# def set_data(self):
# time.sleep(random.random() * 10)
Because this is such a complex query in our actual codebase, it would be a tremendous refactor to go flag every single nested function with async and await. Is there any way I can make this function work async without marking up my whole codebase with async?
You can run the function in a separate thread with asyncio.to_thread
await asyncio.to_thread(self.set_data)
If you're using python <3.9 use loop.run_in_executor
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self.set_data)
I am working a sample program that reads from a datasource (csv or rdbms) in chunks, makes some transformation and sends it via socket to a server.
But because the csv is very large, for testing purpose I want to break the reading after few chunks.
Unfortunately something goes wrong and I do not know what and how to fix it. Probably I have to do some cancellation, but now sure where and how. I get the following error:
Task was destroyed but it is pending!
task: <Task pending coro=<<async_generator_athrow without __name__>()>>
The sample code is:
import asyncio
import json
async def readChunks():
# this is basically a dummy alternative for reading csv in chunks
df = [{"chunk_" + str(x) : [r for r in range(10)]} for x in range(10)]
for chunk in df:
await asyncio.sleep(0.001)
yield chunk
async def send(row):
j = json.dumps(row)
print(f"to be sent: {j}")
await asyncio.sleep(0.001)
async def main():
i = 0
async for chunk in readChunks():
for k, v in chunk.items():
await asyncio.gather(send({k:v}))
i += 1
if i > 5:
break
#print(f"item in main via async generator is {chunk}")
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
Many async resources, such as generators, need to be cleaned up with the help of an event loop. When an async for loop stops iterating an async generator via break, the generator is cleaned up by the garbage collector only. This means the task is pending (waits for the event loop) but gets destroyed (by the garbage collector).
The most straightforward fix is to aclose the generator explicitly:
async def main():
i = 0
aiter = readChunks() # name iterator in order to ...
try:
async for chunk in aiter:
...
i += 1
if i > 5:
break
finally:
await aiter.aclose() # ... clean it up when done
These patterns can be simplified using the asyncstdlib (disclaimer: I maintain this library). asyncstdlib.islice allows to take a fixed number of items before cleanly closing the generator:
import asyncstdlib as a
async def main():
async for chunk in a.islice(readChunks(), 5):
...
If the break condition is dynamic, scoping the iterator guarantees cleanup in any case:
import asyncstdlib as a
async def main():
async with a.scoped_iter(readChunks()) as aiter:
async for idx, chunk in a.enumerate(aiter):
...
if idx >= 5:
break
This works...
import asyncio
import json
import logging
logging.basicConfig(format='%(asctime)s.%(msecs)03d %(message)s',
datefmt='%S')
root = logging.getLogger()
root.setLevel(logging.INFO)
async def readChunks():
# this is basically a dummy alternative for reading csv in chunks
df = [{"chunk_" + str(x) : [r for r in range(10)]} for x in range(10)]
for chunk in df:
await asyncio.sleep(0.002)
root.info('readChunks: next chunk coming')
yield chunk
async def send(row):
j = json.dumps(row)
root.info(f"to be sent: {j}")
await asyncio.sleep(0.002)
async def main():
i = 0
root.info('main: starting to read chunks')
async for chunk in readChunks():
for k, v in chunk.items():
root.info(f'main: sending an item')
#await asyncio.gather(send({k:v}))
stuff = await send({k:v})
i += 1
if i > 5:
break
#print(f"item in main via async generator is {chunk}")
##loop = asyncio.get_event_loop()
##loop.run_until_complete(main())
##loop.close()
if __name__ == '__main__':
asyncio.run(main())
... At least it runs and finishes.
The issue with stopping an async generator by reaking out of an async for loop is described in bugs.python.org/issue38013 and looks like it was fixed in 3.7.5.
However, using
loop = asyncio.get_event_loop()
loop.set_debug(True)
loop.run_until_complete(main())
loop.close()
I get a debug error but no Exception in Python 3.8.
Task was destroyed but it is pending!
task: <Task pending name='Task-8' coro=<<async_generator_athrow without __name__>()>>
Using the higher level API asyncio.run(main()) with debugging ON I do not get the debug message. If you are going to try and upgrade to Python 3.7.5-9 you probably should still use asyncio.run().
The problem is simple. You do early exit from loop, but async generator is not exhausted yet(its pending):
...
if i > 5:
break
...
Your readChunks is running in async and your loop. and without completing the program you are breaking it.
That's why it gives asyncio task was destroyed but it is pending
In short async task was doing its work in the background but you killed it by breaking the loop (stopping the program).
I want to gather data from asyncio loops running in sibling processes with Python 3.7
Ideally I would use a multiprocess.JoinableQueue, relaying on its join() call for synchronization.
However, its synchronization primitives block the event loop in full (see my partial answer below for an example).
Illustrative prototype:
class MP_GatherDict(dict):
'''A per-process dictionary which can be gathered from a single one'''
def __init__(self):
self.q = multiprocess.JoinableQueue()
super().__init__()
async def worker_process_server(self):
while True:
(await?) self.q.put(dict(self)) # Put a shallow copy
(await?) self.q.join() # Wait for it to be gathered
async def gather(self):
all_dicts = []
while not self.q.empty():
all_dicts.append(await self.q.get())
self.q.task_done()
return all_dicts
Note that the put->get->join->put flow might not work as expected but this question really is about using multiprocess primitives in asyncio event loop...
The question would then be how to best await for multiprocess primitives from an asyncio event loop?
This test shows that multiprocess.Queue.get() blocks the whole event loop:
mp_q = mp.JoinableQueue()
async def mp_queue_wait():
try:
print('Queue:',mp_q.get(timeout=2))
except Exception as ex:
print('Queue:',repr(ex))
async def main_loop_task():
task = asyncio.get_running_loop().create_task(mp_queue_wait())
for i in range(3):
print(i, os.times())
await asyncio.sleep(1)
await task
print(repr(task))
asyncio.run(main_loop_task())
Whose output is:
0 posix.times_result(user=0.41, system=0.04, children_user=0.0, children_system=0.0, elapsed=17208620.18)
Queue: Empty()
1 posix.times_result(user=0.41, system=0.04, children_user=0.0, children_system=0.0, elapsed=17208622.18)
2 posix.times_result(user=0.41, system=0.04, children_user=0.0, children_system=0.0, elapsed=17208623.18)
<Task finished coro=<mp_queue_wait() done,...> result=None>
So I am looking at asyncio.loop.run_in_executor() as the next possible answer, however spawning an executor/thread just for this seems overkill...
Here is same test using the default executor:
async def mp_queue_wait():
try:
result = await asyncio.get_running_loop().run_in_executor(None,mp_q.get,True,2)
except Exception as ex:
result = ex
print('Queue:',repr(result))
return result
And the (desired) result:
0 posix.times_result(user=0.36, system=0.02, children_user=0.0, children_system=0.0, elapsed=17210674.65)
1 posix.times_result(user=0.37, system=0.02, children_user=0.0, children_system=0.0, elapsed=17210675.65)
Queue: Empty()
2 posix.times_result(user=0.37, system=0.02, children_user=0.0, children_system=0.0, elapsed=17210676.66)
<Task finished coro=<mp_queue_wait() done, defined at /home/apozuelo/Documents/5G_SBA/Tera5G/services/db.py:211> result=Empty()>
This comes bit late, but.
You need to create an async wrapper around the mp.JoinableQueue() since both get()and put() block the whole process (GIL).
There are two approaches for this:
Use threads
Use asyncio.sleep() and get_nowait(), put_nowait() methods.
I chose the option 2 since it is easy.
from queue import Queue, Full, Empty
from typing import Any, Generic, TypeVar
from asyncio import sleep
T= TypeVar('T')
class AsyncQueue(Generic[T]):
"""Async wrapper for queue.Queue"""
SLEEP: float = 0.01
def __init__(self, queue: Queue[T]):
self._Q : Queue[T] = queue
async def get(self) -> T:
while True:
try:
return self._Q.get_nowait()
except Empty:
await sleep(self.SLEEP)
async def put(self, item: T) -> None:
while True:
try:
self._Q.put_nowait(item)
return None
except Full:
await sleep(self.SLEEP)
def task_done(self) -> None:
self._Q.task_done()
return None
I have the following method in my Tornado handler:
async def get(self):
url = 'url here'
try:
async for batch in downloader.fetch(url):
self.write(batch)
await self.flush()
except Exception as e:
logger.warning(e)
This is the code for downloader.fetch():
async def fetch(url, **kwargs):
timeout = kwargs.get('timeout', aiohttp.ClientTimeout(total=12))
response_validator = kwargs.get('response_validator', json_response_validator)
extractor = kwargs.get('extractor', json_extractor)
try:
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get(url) as resp:
response_validator(resp)
async for batch in extractor(resp):
yield batch
except aiohttp.client_exceptions.ClientConnectorError:
logger.warning("bad request")
raise
except asyncio.TimeoutError:
logger.warning("server timeout")
raise
I would like yield the "batch" object from multiple downloaders in paralel.
I want the first available batch from the first downloader and so on until all downloaders finished. Something like this (this is not working code):
async for batch in [downloader.fetch(url1), downloader.fetch(url2)]:
....
Is this possible? How can I modify what I am doing in order to be able to yield from multiple coroutines in parallel?
How can I modify what I am doing in order to be able to yield from multiple coroutines in parallel?
You need a function that merges two async sequences into one, iterating over both in parallel and yielding elements from one or the other, as they become available. While such a function is not included in the current standard library, you can find one in the aiostream package.
You can also write your own merge function, as shown in this answer:
async def merge(*iterables):
iter_next = {it.__aiter__(): None for it in iterables}
while iter_next:
for it, it_next in iter_next.items():
if it_next is None:
fut = asyncio.ensure_future(it.__anext__())
fut._orig_iter = it
iter_next[it] = fut
done, _ = await asyncio.wait(iter_next.values(),
return_when=asyncio.FIRST_COMPLETED)
for fut in done:
iter_next[fut._orig_iter] = None
try:
ret = fut.result()
except StopAsyncIteration:
del iter_next[fut._orig_iter]
continue
yield ret
Using that function, the loop would look like this:
async for batch in merge(downloader.fetch(url1), downloader.fetch(url2)):
....
Edit:
As mentioned in the comment, below method does not execute given routines in parallel.
Checkout aitertools library.
import asyncio
import aitertools
async def f1():
await asyncio.sleep(5)
yield 1
async def f2():
await asyncio.sleep(6)
yield 2
async def iter_funcs():
async for x in aitertools.chain(f2(), f1()):
print(x)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(iter_funcs())
It seems that, functions being iterated must be couroutine.
Given a regular generator, you can get an iterator from it that can only be consumed once and continue where you left off. Like this -
sync_gen = (i in range(10))
def fetch_batch_sync(num_tasks, job_list):
for i, job in enumerate(job_list):
yield job
if i == num_tasks - 1:
break
>>> sync_gen_iter = sync_gen.__iter__()
>>> for i in fetch_batch_sync(2, sync_gen_iter):
... print i
...
0
1
>>> for i in fetch_batch_sync(3, sync_gen_iter):
... print i
...
2
3
4
Is there a way to do the same with an async generator?
async def fetch_batch_async(num_tasks, job_list_iter):
async for i, job in enumerate(job_list_iter):
yield job
if i == num_tasks - 1:
break
The only difference between regular and async generators is that async generators' equivalents of __next__ and __iter__ methods are themselves async. This is why ordinary for and enumerate fail to recognize them as iterables.
As with regular generators, it is possible to extract a subset of values out of an async generator, but you need to use the appropriate tools. fetch_batch_async already uses async for, but it should also use an async version of enemuerate; for example:
async def aenumerate(aiterable, start=0):
i = start
async for obj in aiterable:
yield i, obj
i += 1
fetch_batch_async would use it exactly like enumerate:
async def fetch_batch_async(num_tasks, job_list_iter):
async for i, job in aenumerate(job_list_iter):
yield job
if i == num_tasks - 1:
break
Finally, this code uses fetch_batch_async to extract several items out of an infinite async iterator:
import asyncio, time
async def infinite():
while True:
yield time.time()
await asyncio.sleep(.1)
async def main():
async for received in fetch_batch_async(10, infinite()):
print(received)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())