Parallelize generators with asyncio

Parallelize generators with asyncio - python

My application reads data from a slow i/o source, does some processing and then writes it to a local file. I've implemented this with generators like so:
import time
def io_task(x):
print("requesting data for input %s" % x)
time.sleep(1) # this simulates a blocking I/O task
return 2*x
def producer(xs):
for x in xs:
yield io_task(x)
def consumer(xs):
with open('output.txt', 'w') as fp:
for x in xs:
print("writing %s" % x)
fp.write(str(x) + '\n')
data = [1,2,3,4,5]
consumer(producer(data))
Now I'd like to parallelize this task with the help of asyncio, but I can't seem to figure out how. The main issue for me is to directly feed data through a generator from the producer to the consumer while letting asyncio make multiple parallel requests to io_task(x). Also, this whole async def vs. #asyncio.coroutine thing is confusing me.
Can someone show me how to build a minimal working example that uses asyncio from this sample code?
(Note: It is not ok to just make calls to io_task(), buffer the results and then write them to a file. I need a solution that works on large data sets that can exceed the main memory, that's why I've been using generators so far. It is however safe to assume that the consumer is always faster than all producers combined)

Since python 3.6 and asynchronous generators, very few changes need be applied to make your code compatible with asyncio.
The io_task function becomes a coroutine:
async def io_task(x):
await asyncio.sleep(1)
return 2*x
The producer generator becomes an asynchronous generator:
async def producer(xs):
for x in xs:
yield await io_task(x)
The consumer function becomes a coroutine and uses aiofiles, asynchronous context management and asynchronous iteration:
async def consumer(xs):
async with aiofiles.open('output.txt', 'w') as fp:
async for x in xs:
await fp.write(str(x) + '\n')
And the main coroutine runs in an event loop:
data = [1,2,3,4,5]
main = consumer(producer(data))
loop = asyncio.get_event_loop()
loop.run_until_complete(main)
loop.close()
Also, you may consider using aiostream to pipeline some processing operations between the producer and the consumer.
EDIT: The different I/O tasks can easily be run concurrently on the producer side by using as_completed:
async def producer(xs):
coros = [io_task(x) for x in xs]
for future in asyncio.as_completed(coros):
yield await future

Related

How to encapsulate asyncio code in Python?

I'd like to use asyncio to do a lot of simultaneous non-blocking IO in Python. However, I want that use of asyncio to be abstracted away from the user--under the hood there's a lot of asychronous calls going on simultaneously to speed things up, but for the user there's a single, synchronous call.
Basically something like this:
async def _slow_async_fn(address):
data = await async_load_data(address)
return data
def synchronous_blocking_io()
addresses = ...
tasks = []
for address in addresses:
tasks.append(_slow_async_fn(address))
all_results = some_fn(asyncio.gather(*tasks))
return all_results
The problem is, how can I achieve this in a way that's agnostic to the user's running environment? I use a pattern like asyncio.get_event_loop().run_until_complete(), I run into issues if the code is being called inside an environment like Jupyter where there's already an event loop running. Is there a way to robustly gather the results of a set of asynchronous tasks that doesn't require pushing async/await statements all the way up the program?

The restriction on running loops is per thread, so running a new event loop is possible, as long as it is in a new thread.
import asyncio
import concurrent.futures
async def gatherer_of(tasks):
# It's necessary to wrap asyncio.gather() in a coroutine (reasons beyond scope)
return await asyncio.gather(*tasks)
def synchronous_blocking_io():
addresses = ...
tasks = []
for address in addresses:
tasks.append(_slow_async_fn(address))
loop = asyncio.new_event_loop()
return loop.run_until_complete(gatherer_of(tasks))
def synchronous_blocking_io_wrapper():
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
fut = executor.submit(synchronous_blocking_io)
return fut.result()
# Testing
async def async_runner():
# Simulating execution from a running loop
return synchronous_blocking_io_wrapper()
# Run from synchronous client
# print(synchronous_blocking_io_wrapper())
# Run from async client
# print(asyncio.run(async_runner()))
The same result can be achieved with the ProcessPoolExecutor, by manually running synchronous_blocking_io in a new thread and joining it, starting an entirely new process and so forth. As long as you are not in the same thread, you won't conflict with any running event loop.

Python: asynchronous generator is already running

As in the following example, I encountered an unusual error when using async Generator.
async def demo():
async def get_data():
for i in range(5): # loop: for or while
await asyncio.sleep(1) # some IO code
yield i
datas = get_data()
await asyncio.gather(
anext(datas),
anext(datas),
anext(datas),
anext(datas),
anext(datas),
)
if __name__ == '__main__':
# asyncio.run(main())
asyncio.run(demo())
Console output:
2022-05-11 23:55:24,530 DEBUG asyncio 29180 30600 Using proactor: IocpProactor
Traceback (most recent call last):
File "E:\workspace\develop\python\crawlerstack-proxypool\demo.py", line 77, in <module>
asyncio.run(demo())
File "D:\devtools\Python310\lib\asyncio\runners.py", line 44, in run
return loop.run_until_complete(main)
File "D:\devtools\Python310\lib\asyncio\base_events.py", line 641, in run_until_complete
return future.result()
File "E:\workspace\develop\python\crawlerstack-proxypool\demo.py", line 66, in demo
await asyncio.gather(
RuntimeError: anext(): asynchronous generator is already running
Situation description: I have a loop logic that fetches a batch of data from Redis at a time, and I want to use yield to return the result. But this error occurs when I create a concurrent task.
Is there a good solution to this situation? I don't mean to change the way I'm using it now, but to see if I can tell if it's running or something like a lock and wait for it to run and then execute anext.
Maybe my logic is not reasonable now, but I also want to understand some critical language, let me realize the seriousness of this.
Thank you for your help.

TL;DR: the right way
Async generators suit badly for a parallel consumption. See my explanations below. As a proper workaround, use asyncio.Queue for the communication between producers and consumers:
queue = asyncio.Queue()
async def producer():
for item in range(5):
await asyncio.sleep(random.random()) # imitate async fetching
print('item fetched:', item)
await queue.put(item)
async def consumer():
while True:
item = await queue.get()
await asyncio.sleep(random.random()) # imitate async processing
print('item processed:', item)
await asyncio.gather(producer(), consumer(), consumer())
The above code snippet works well for an infinite stream of items: for example, a web server, which runs forever serving requests from clients. But what if we need to process a finite number of items? How should consumers know when to stop?
This deserves another question on Stack Overflow to cover all alternatives, but the simplest option is a sentinel approach, described below.
Sentinel: finite data streams approach
Introduce a sentinel = object(). When all items from an external data source are fetched and put to the queue, producer must push as many sentinels to the queue as many consumers you have. Once a consumer fetches the sentinel, it knows it should stop: if item is sentinel: break from loop.
sentinel = object()
consumers_count = 2
async def producer():
... # the same code as above
if new_item is None: # if no new data
for _ in range(consumers_count):
await queue.put(sentinel)
async def consumer():
while True:
... # the same code as above
if item is sentinel:
break
await asyncio.gather(
producer(),
*(consumer() for _ in range(consumers_count)),
)
TL;DR [2]: a dirty workaround
Since you require to not change your async generator approach, here is an asyncgen-based alternative. To resolve this issue (in a simple-yet-dirty way), you may wrap the source async generator with a lock:
async def with_lock(agen, lock: asyncio.Lock):
while True:
async with lock: # only one consumer is allowed to read
try:
yield await anext(agen)
except StopAsyncIteration:
break
lock = asyncio.Lock() # a common lock for all consumers
await asyncio.gather(
# every consumer must have its own "wrapped" generator
anext(with_lock(datas, lock)),
anext(with_lock(datas, lock)),
...
)
This will ensure only one consumer awaits for an item from the generator at a time. While this consumer awaits, other consumers are being executed, so parallelization is not lost.
A roughly equivalent code with async for (looks a little smarter):
async def with_lock(agen, lock: asyncio.Lock):
await lock.acquire()
async for item in agen:
lock.release()
yield item
await lock.acquire()
lock.release()
However, this code only handles async generator's anext method. Whereas generators API also includes aclose and athrow methods. See an explanation below.
Though, you may add support for these to the with_lock function too, I would recommend to either subclass a generator and handle the lock support inside, or better use the Queue-based approach from above.
See contextlib.aclosing for some inspiration.
Explanation
Both sync and async generators have a special attribute: .gi_running (for regular generators) and .ag_running (for async ones). You may discover them by executing dir on a generator:
>>> dir((i for i in range(0))
[..., 'gi_running', ...]
They are set to True when a generator's .__next__ or .__anext__ method is executed (next(...) and anext(...) are just a syntactic sugar for those).
This prevents re-executing next(...) on a generator, when another next(...) call on the same generator is already being executed: if the running flag is True, an exception is raised (for a sync generator it raises ValueError: generator already executing).
So, returning to your example, when you run await anext(datas) (via asyncio.gather), the following happens:
datas.ag_running is set to True.
An execution flow steps into the datas.__anext__ method.
Once an inner await statement is reached inside of the __anext__ method (await asyncio.sleep(1) in your case), asyncio's loop switches to another consumer.
Now, another consumer tries to call await anext(datas) too, but since datas.ag_running flag is still set to True, this results in a RuntimeError.
Why is this flag needed?
A generator's execution can be suspended and resumed. But only at yield statements. Thus, if a generator is paused at an inner await statement, it cannot be "resumed", because its state disallows it.
That's why a parallel next/anext call to a generator raises an exception: it is not ready to be resumed, it is already running.
athrow and aclose
Generators' API (both sync and async) includes not only send/asend method for iteration, but also:
close/aclose to release generator-allocated resources (e.g. a database connection) on exit or an exception
and throw/athrow to inform generator that it has to handle an exception.
aclose and athrow are async methods too. Which means that if two consumers try to close/throw an underlying generator in parallel, you will encounter the same issue since a generator will be closing (or handling an exception) while closed (thrown an exception) again.
Sync generators example
Though this is a frequent case for async generators, reproducing it for sync generators is not that naive, since sync next(...) calls are rarely interrupted.
One of the ways to interrupt a sync generator is to run a multithreaded code with multiple consumers (run in parallel threads) reading from a single generator. In that case, when the generator's code is interrupted while executing a next call, all other consumers' parallel attempts to call next will result in an exception.
Another way to achieve this is demonstrated in the generators-related PEP #255 via a self-consuming generator:
>>> def g():
... i = next(me)
... yield i
...
>>> me = g()
>>> next(me)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in g
ValueError: generator already executing
When outer next(me) is called, it sets me.gi_running to True and then executes the generator function code. A subsequent inner next(me) call leads to a ValueError.
Conclusion
Generators (especially async) work the best when consumed by a single reader. Multiple consumers support is hard, since requires patching behaviour of all the generator's methods, and thus discouraged.

Asyncify string joining in Python

I have the following code snippet which I want to transform into asynchronous code (data tends to be a large Iterable):
transformed_data = (do_some_transformation(d) for d in data)
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
I managed to rewrite the do_some_transformation-function to be async so I can do the following:
transformed_data = (await do_some_transformation(d) for d in data)
async_generator = (json.dumps(event, separators=(",", ":")) async for t in transformed_data)
stacked_jsons = ???
What's the best way to incrementally join the jsons produced by the async generator so that the joining process is also asynchronous?
This snippet is part of a larger I/O-bound-application which and has many asynchronous components and thus would profit from asynchifying everything.

The point of str.join is to transform an entire list at once.1 If items arrive incrementally, it can be advantageous to accumulate them one by one.
async def join(by: str, _items: 'AsyncIterable[str]') -> str:
"""Asynchronously joins items with some string"""
result = ""
async for item in _items:
if result and by: # only add the separator between items
result += by
result += item
return result
The async for loop is sufficient to let the async iterable suspend between items so that other tasks may run. The primary advantage of this approach is that even for very many items, this never stalls the event loop for longer than adding the next item.
This utility can directly digest the async generator:
stacked_jsons = join("\n\n", (json.dumps(event, separators=(",", ":")) async for t in transformed_data))
When it is know that the data is small enough that str.join runs in adequate time, one can directly convert the data to a list instead and use str.join:
stacked_jsons = "\n\n".join([json.dumps(event, separators=(",", ":")) async for t in transformed_data])
The [... async for ...] construct is an asynchronous list comprehension. This internally works asynchronously to iterate, but produces a regular list once all items are fetched – only this resulting list is passed to str.join and can be processed synchronously.
1 Even when joining an iterable, str.join will internally turn it into a list first.

More in depth explanation about my comment:
Asyncio is a great tool if your processor has a lot of waiting to do.
For example: when you make request to a db over the network, after the request is sent your cpu just does nothing until it gets an answer.
Using the async await syntax you can have your processor execute other tasks while "waiting" for the current one to finish. this does not mean it runs them in parallel. There is only one task running at a time.
In your case (for what i can see) the cpu never waits for something it is constantly running string operations.
if you want to run these operations in parallel you might want to take a look at ProcesPools.
This is not bound by a single process and core but will spread the processing over several cores to run it in parallel.
from concurrent.futures import ProcessPoolExecutor
def main():
with ProcessPoolExecutor() as executor:
transformed_data = executor.map(do_some_transformation, data) #returns an iterable
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
if __name__ == '__main__':
main()
I hope the provided code can help you.
ps.
The if __name__ part is required
edit: i saw your comment about 10k dicts, assume you have 8 cores (ignore multithreading) then each process will only transform 1250 dicts, instead of the 10k your main thread does now. These processes run simultaniously and although the performance increase is not linear it should process them a lot faster.

TL;DR: Consider using producer/consumer pattern, if do_some_transformation is IO bound, and you really want an incremental aggregation.
Of course, async itself only brings an advantage if you actually have any other proper async tasks to begin with.
As #MisterMiyagi said, if do_some_transformation is IO bound and time consuming, firing all transformation as a horde of async tasks can be a good idea.
Example code:
import asyncio
import json
data = ({"large": "data"},) * 3 # large
stacked_jsons = ""
async def transform(d: dict, q: asyncio.Queue) -> None:
# `do_some_transformation`: long IO bound task
await asyncio.sleep(1)
await q.put(d)
# WARNING: incremental concatination of string would be slow,
# since string is immutable.
async def join(q: asyncio.Queue):
global stacked_jsons
while True:
d = await q.get()
stacked_jsons += json.dumps(d, separators=(",", ":")) + "\n\n"
q.task_done()
async def main():
q = asyncio.Queue()
producers = [asyncio.create_task(transform(d, q)) for d in data]
consumer = asyncio.create_task(join(q))
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too
consumer.cancel()
print(stacked_jsons)
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")
So that do_some_transformation don't block each other. Output:
$ python test.py
{"large":"data"}
{"large":"data"}
{"large":"data"}
test.py executed in 1.00 seconds.
Besides, I don't think incremental concatenation of string is a good idea, since string is immutable and a lot of memory would be wasted ;)
Reference: Async IO in Python: A Complete Walkthrough - Real Python

How to use asyncio properly for a generator function?

I'm reading in several thousand files at once, and for each file I need to perform operations on before yielding rows from each file. To increase performance I thought I could use asyncio to perhaps perform operations on files (and yield rows) whilst waiting for new files to be read in.
However from print statements I can see that all the files are opened and gathered, then each file is iterated over (same as would occur without asyncio).
I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
import asyncio
async def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
async def async_generator():
file_outputs = await asyncio.gather(*[open_files(file) for file in files])
for file_output in file_ouputs:
print('using open file')
for row in file_output:
# Do stuff to row
yield row
async def main():
async for yield_value in async_generator():
pass
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
opening files
opening files
.
.
.
using open file
using open file
EDIT
Using the code supplied by #user4815162342, I noticed that, although it was 3x quicker, the set of rows yielded from the generator were slightly different than if done without concurrency. I'm unsure as of yet if this is because some yields were missed out from each file, or if the files were somehow re-ordered. So I introduced the following changes to the code from user4815162342 and entered a lock into the pool.submit()
I should have mentioned when first asking, the ordering of rows in each file and of the files themselves is required.
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
m = multiprocessing.Manager()
lock = m.Lock()
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file, lock) for file in files]
for fut in concurrent.futures.as_completed(file_output_futures):
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()
This way my non-concurrent and concurrent approaches yield the same values each time, however I have just lost all the speed gained from using concurrency.

I feel like I'm missing something quite obvious here which is making my asynchronous attempts, synchronous.
There are two issues with your code. The first one is that asyncio.gather() by design waits for all the futures to complete in parallel, and only then returns their results. So the processing you do in the generator is not interspersed with the IO in open_files as was your intention, but only begins after all the calls to open_files have returned. To process async calls as they are done, you should be using something like asyncio.as_completed.
The second and more fundamental issue is that, unlike threads which can parallelize synchronous code, asyncio requires everything to be async from the ground up. It's not enough to add async to a function like open_files to make it async. You need to go through the code and replace any blocking calls, such as calls to IO, with equivalent async primitives. For example, connecting to a network port should be done with open_connection, and so on. If your async function doesn't await anything, as appears to be the case with open_files, it will execute exactly like a regular function and you won't get any benefits of asyncio.
Since you use IO on regular files, and operating systems don't expose portable async interface for regular files, you are unlikely to profit from asyncio. There are libraries like aiofiles that use threads under the hood, but they are as likely to make your code slower than to speed it up because their nice-looking async APIs involve a lot of internal thread synchronization. To speed up your code, you can use a classic thread pool, which Python exposes through the concurrent.futures module. For example (untested):
import concurrent.futures
def open_files(file):
with open(file) as file:
# do stuff
print('opening files')
return x
def generator():
pool = concurrent.futures.ThreadPoolExecutor()
file_output_futures = [pool.submit(open_files, file) for file in files]
for fut in file_output_futures:
file_output = fut.result()
print('using open file')
for row in file_output:
# Do stuff to row
yield row
def main():
for yield_value in generator():
pass
if __name__ == '__main__':
main()

How to use asyncio with a very long list of tasks (generator)

I have a small program that loads a pretty heavy CSV (over 800MB, in chunks, using pandas.read_csv to limit memory usage) and performs a few API calls to servers "out in the wild", and finally builds a result object which is then stored in a database.
I have added caching for the network requests where possible, but even then, the code takes over 10 hours to complete. When I profile the code with PySpy, most of it is waiting for network requests.
I tried converting it to use asyncio to speed things up, and have managed to get the code to work on a small subset of the input file. However with the full file, the memory use become prohibitive.
Here is what I have tried:
import pandas as pd
import httpx
async def process_item(item, client):
# send a few requests with httpx session
# process results
await save_results_to_db(res)
async def get_items_from_csv():
# loads the heavy CSV file
for chunk in pd.read_csv(filename, ...):
for row in chunk.itertuples():
item = item_from_row(row)
yield item
async def main():
async with httpx.AsyncClient() as client:
tasks = []
for item in get_items_from_csv():
tasks.append(process_item(item, client))
await asyncio.gather(*tasks)
asyncio.run(main())
Is there a way to avoid creating the tasks list, which becomes a very heavy object with over 1.5M items in it? The other downside of this is that no task seems to be processed until the entire file has been read, which is not ideal.
I'm using python 3.7 but can easily upgrade to 3.8 if needed.

I think what you are looking for here is not running in batches but running N workers which concurrently pull tasks off of a queue.
N = 10 # scale based on the processing power and memory you have
async def main():
async with httpx.AsyncClient() as client:
tasks = asyncio.Queue()
for item in get_items_from_csv():
tasks.put_nowait(process_item(item, client))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
# for a server
# while task := await tasks.get():
# await task
await asyncio.gather(*[worker() for _ in range(N)])
I used an asyncio.Queue but you can also just use a collections.deque since all tasks are being added to the queue prior to starting a worker. The former is especially useful when running workers that run in a long running process (e.g. a server) where items may be asynchronously queued.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize generators with asyncio - python

Related

How to encapsulate asyncio code in Python?

Python: asynchronous generator is already running

Asyncify string joining in Python

How to use asyncio properly for a generator function?

How to use asyncio with a very long list of tasks (generator)

Categories

Resources