Asyncify string joining in Python - python

I have the following code snippet which I want to transform into asynchronous code (data tends to be a large Iterable):
transformed_data = (do_some_transformation(d) for d in data)
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
I managed to rewrite the do_some_transformation-function to be async so I can do the following:
transformed_data = (await do_some_transformation(d) for d in data)
async_generator = (json.dumps(event, separators=(",", ":")) async for t in transformed_data)
stacked_jsons = ???
What's the best way to incrementally join the jsons produced by the async generator so that the joining process is also asynchronous?
This snippet is part of a larger I/O-bound-application which and has many asynchronous components and thus would profit from asynchifying everything.

The point of str.join is to transform an entire list at once.1 If items arrive incrementally, it can be advantageous to accumulate them one by one.
async def join(by: str, _items: 'AsyncIterable[str]') -> str:
"""Asynchronously joins items with some string"""
result = ""
async for item in _items:
if result and by: # only add the separator between items
result += by
result += item
return result
The async for loop is sufficient to let the async iterable suspend between items so that other tasks may run. The primary advantage of this approach is that even for very many items, this never stalls the event loop for longer than adding the next item.
This utility can directly digest the async generator:
stacked_jsons = join("\n\n", (json.dumps(event, separators=(",", ":")) async for t in transformed_data))
When it is know that the data is small enough that str.join runs in adequate time, one can directly convert the data to a list instead and use str.join:
stacked_jsons = "\n\n".join([json.dumps(event, separators=(",", ":")) async for t in transformed_data])
The [... async for ...] construct is an asynchronous list comprehension. This internally works asynchronously to iterate, but produces a regular list once all items are fetched – only this resulting list is passed to str.join and can be processed synchronously.
1 Even when joining an iterable, str.join will internally turn it into a list first.

More in depth explanation about my comment:
Asyncio is a great tool if your processor has a lot of waiting to do.
For example: when you make request to a db over the network, after the request is sent your cpu just does nothing until it gets an answer.
Using the async await syntax you can have your processor execute other tasks while "waiting" for the current one to finish. this does not mean it runs them in parallel. There is only one task running at a time.
In your case (for what i can see) the cpu never waits for something it is constantly running string operations.
if you want to run these operations in parallel you might want to take a look at ProcesPools.
This is not bound by a single process and core but will spread the processing over several cores to run it in parallel.
from concurrent.futures import ProcessPoolExecutor
def main():
with ProcessPoolExecutor() as executor:
transformed_data = executor.map(do_some_transformation, data) #returns an iterable
stacked_jsons = "\n\n".join(json.dumps(t, separators=(",", ":")) for t in transformed_data)
if __name__ == '__main__':
main()
I hope the provided code can help you.
ps.
The if __name__ part is required
edit: i saw your comment about 10k dicts, assume you have 8 cores (ignore multithreading) then each process will only transform 1250 dicts, instead of the 10k your main thread does now. These processes run simultaniously and although the performance increase is not linear it should process them a lot faster.

TL;DR: Consider using producer/consumer pattern, if do_some_transformation is IO bound, and you really want an incremental aggregation.
Of course, async itself only brings an advantage if you actually have any other proper async tasks to begin with.
As #MisterMiyagi said, if do_some_transformation is IO bound and time consuming, firing all transformation as a horde of async tasks can be a good idea.
Example code:
import asyncio
import json
data = ({"large": "data"},) * 3 # large
stacked_jsons = ""
async def transform(d: dict, q: asyncio.Queue) -> None:
# `do_some_transformation`: long IO bound task
await asyncio.sleep(1)
await q.put(d)
# WARNING: incremental concatination of string would be slow,
# since string is immutable.
async def join(q: asyncio.Queue):
global stacked_jsons
while True:
d = await q.get()
stacked_jsons += json.dumps(d, separators=(",", ":")) + "\n\n"
q.task_done()
async def main():
q = asyncio.Queue()
producers = [asyncio.create_task(transform(d, q)) for d in data]
consumer = asyncio.create_task(join(q))
await asyncio.gather(*producers)
await q.join() # Implicitly awaits consumers, too
consumer.cancel()
print(stacked_jsons)
if __name__ == "__main__":
import time
s = time.perf_counter()
asyncio.run(main())
elapsed = time.perf_counter() - s
print(f"{__file__} executed in {elapsed:0.2f} seconds.")
So that do_some_transformation don't block each other. Output:
$ python test.py
{"large":"data"}
{"large":"data"}
{"large":"data"}
test.py executed in 1.00 seconds.
Besides, I don't think incremental concatenation of string is a good idea, since string is immutable and a lot of memory would be wasted ;)
Reference: Async IO in Python: A Complete Walkthrough - Real Python

Related

Stream processing mixed sync and async items

I have a list of objects to process. Some can be processed immediately, but others need to be processed by first fetching a URL. The organization looks something like:
processed_items = []
for item in list:
if url := item.get('location'):
fetched_item = fetch_item_from(url)
processed_item = process(fetched_item)
else:
processed_item = process(item)
if processed_item:
processed_items.append(processed_item)
The problem is that there are so many items that the only way to handle this in a memory efficient way is to process these files as they come in. On the other hand, doing them sequentially like this takes forever -- it's much more efficient to make the network requests asynchronously.
In theory, you could save all the items with URLs, then fetch them all at once using tasks and asyncio.gather. I have actually done this and it works. But this list of unfetched items can quickly eat up your memory, since the items are being streamed in, and making a ton of network requests all at once can make the server mad.
I think I'm looking for a result that leaves me with an array like
processed_items = [1, 2, <awaitable>, 3, <awaitable>, ...]
which I can then await the result of.
Is this the right approach? And if so, what's this design pattern called? Any first steps?
Just execute your code above in an asynchronous function - in a way that each item is processed in a separate task, and wrap your "fetch_item_from" function in an async function that uses an asyncio.Semaphore to limit the number of parallel requests to one you find optimal - be it 7, 10, 50 or 100.
If the rest of your processing is just CPU intensive you won't need any other async features there.
Actually, if your `fetch_item_from" is not async itself, you can simply do "run_in_executor" - and the nature of the process.future.Executor itself will limit the amount of concurrent requests, without the need to use a Semaphore are all.
import asyncio
MAXREQUESTS = 20
# Use this part if your original `fetch_item_from` is synchronous:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(MAXREQUESTS)
async def fetch_item_from_with_executor(url):
asyncio.get_running_loop()
# This is automatically limited to the number of workers in the executor
return await asyncio.run_in_executor(executor, fetch_item_from, url)
# Use this part if fetch_item_from is asynchronous itself
semaphore = asyncio.Semaphore(MAXREQUESTS)
async def fetch_item_from_async(url):
with semaphore:
return await fetch_item_from(url)
# common code:
async def process_item(item):
if url := item.get('location'):
item = await fetch_item_from_executor(url) # / fetch_item_from_async
return process(item)
async def main(list_):
pending_list = [asyncio.create_task(item) for item in list_]
processed_items = []
while pending_list:
# The timeout=10 bellow is optional, and will return the control
# here with the already completed tasks each 10 seconds:
# this way you can print some progress indicator to see how
# things are going - or even improve the code so that
# finished tasks are yielded earlier to be consumed by the callers of "main"
# in parallel
# if the timeout argument is omitted, all items are processed in a single batch.
done, pending_list = await asyncio.wait(pending_list, timeout=10)
processed_items.extend(done) # the filter builtin will add just
# retrieve the results from each task and filter out the falsy (None?) ones:
return [result for item in processed_items if (result:=item.result())]
list_= ...
processed_items = asyncio.run(main(list_))
(missing above is any error handling - if either fetch_item_from or process can raise any exception, you have to unfold the list-comprehension which calls .result() in each task blindly to
separate the ones that raised from the ones that completed sucessfully)

Python asyncio - how to use if function being called has nothing to return

I have 200 pairs of paths to diff. I wrote a little function that will diff each pair and update a dictionary which itself is one of the arguments to the function. Assume MY_DIFFER is some diffing tool I am calling via subprocess under the hood.
async def do_diff(path1, path2, result):
result[f"{path1} {path2}"] = MY_DIFFER(path1, path2)
As you can see I have nothing to return from this async function. I am just capturing the result in result.
I call this function in parallel elsewhere using asyncio like so:
path_tuples = [("/path11", "/path12"), ("/path21", "/path22"), ... ]
result = {}
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(do_diff(path1, path2, result) for path1, path2 in path_tuples)
)
)
Questions:
I don't know where to put await in the do_diff function. But the code seems to work without it as well.
I am not sure if the diffs are really happening in parallel, because when I look at the output of ps -eaf in another terminal, I see only one instance of the underlying tool I am calling at a time.
The speed of execution is same as when I was doing the diffs sequentially
So I am clearly doing something wrong. How can I REALLY do the diffs in parallel?
PS: I am in Python 3.6
Remember that asyncio doesn't run things in parallel, it runs things concurrently, using a cooperative multitasking model -- which means that coroutines need to explicitly yield time to other coroutines for them to run. This is what the await command does; it says "go run some other coroutines while I'm waiting for something to finish".
If you're never awaiting on something, you're not getting concurrent execution.
What you want is for your do_diff method to be able to await on the execution of your external tool, but you can't do that with just the subprocess module. You can do that using the run_in_executor method, which arranges to run a synchronous command (e.g., subprocess.run) in a separate thread or process and wait asynchronously for the result. That might look something like:
async def do_diff(path1, path2, result):
loop = asyncio.get_event_loop()
result[f"{path1} {path2}"] = await loop.run_in_executor(None, MY_DIFFER, path1, path2)
This will by default run MY_DIFFER in a separate thread, although you can utilize a separate process instead by passing an explicit executor as the first argument to run_in_executor.
Per my comment, solving this with concurrent.futures might look something like this:
import concurrent.futures
import time
# dummy function that just sleeps for 2 seconds
# replace this with your actual code
def do_diff(path1, path2):
print(f"diffing path {path1} and {path2}")
time.sleep(2)
return path1, path2, "information about diff"
# create 200 path tuples for demonstration purposes
path_tuples = [(f"/path{x}.1", f"/path{x}.2") for x in range(200)]
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=100) as executor:
for path1, path2 in path_tuples:
# submit the job to the executor
futures.append(executor.submit(do_diff, path1, path2))
# read the results
for future in futures:
print(future.result())

How to use asyncio with a very long list of tasks (generator)

I have a small program that loads a pretty heavy CSV (over 800MB, in chunks, using pandas.read_csv to limit memory usage) and performs a few API calls to servers "out in the wild", and finally builds a result object which is then stored in a database.
I have added caching for the network requests where possible, but even then, the code takes over 10 hours to complete. When I profile the code with PySpy, most of it is waiting for network requests.
I tried converting it to use asyncio to speed things up, and have managed to get the code to work on a small subset of the input file. However with the full file, the memory use become prohibitive.
Here is what I have tried:
import pandas as pd
import httpx
async def process_item(item, client):
# send a few requests with httpx session
# process results
await save_results_to_db(res)
async def get_items_from_csv():
# loads the heavy CSV file
for chunk in pd.read_csv(filename, ...):
for row in chunk.itertuples():
item = item_from_row(row)
yield item
async def main():
async with httpx.AsyncClient() as client:
tasks = []
for item in get_items_from_csv():
tasks.append(process_item(item, client))
await asyncio.gather(*tasks)
asyncio.run(main())
Is there a way to avoid creating the tasks list, which becomes a very heavy object with over 1.5M items in it? The other downside of this is that no task seems to be processed until the entire file has been read, which is not ideal.
I'm using python 3.7 but can easily upgrade to 3.8 if needed.
I think what you are looking for here is not running in batches but running N workers which concurrently pull tasks off of a queue.
N = 10 # scale based on the processing power and memory you have
async def main():
async with httpx.AsyncClient() as client:
tasks = asyncio.Queue()
for item in get_items_from_csv():
tasks.put_nowait(process_item(item, client))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
# for a server
# while task := await tasks.get():
# await task
await asyncio.gather(*[worker() for _ in range(N)])
I used an asyncio.Queue but you can also just use a collections.deque since all tasks are being added to the queue prior to starting a worker. The former is especially useful when running workers that run in a long running process (e.g. a server) where items may be asynchronously queued.

Parallelize generators with asyncio

My application reads data from a slow i/o source, does some processing and then writes it to a local file. I've implemented this with generators like so:
import time
def io_task(x):
print("requesting data for input %s" % x)
time.sleep(1) # this simulates a blocking I/O task
return 2*x
def producer(xs):
for x in xs:
yield io_task(x)
def consumer(xs):
with open('output.txt', 'w') as fp:
for x in xs:
print("writing %s" % x)
fp.write(str(x) + '\n')
data = [1,2,3,4,5]
consumer(producer(data))
Now I'd like to parallelize this task with the help of asyncio, but I can't seem to figure out how. The main issue for me is to directly feed data through a generator from the producer to the consumer while letting asyncio make multiple parallel requests to io_task(x). Also, this whole async def vs. #asyncio.coroutine thing is confusing me.
Can someone show me how to build a minimal working example that uses asyncio from this sample code?
(Note: It is not ok to just make calls to io_task(), buffer the results and then write them to a file. I need a solution that works on large data sets that can exceed the main memory, that's why I've been using generators so far. It is however safe to assume that the consumer is always faster than all producers combined)
Since python 3.6 and asynchronous generators, very few changes need be applied to make your code compatible with asyncio.
The io_task function becomes a coroutine:
async def io_task(x):
await asyncio.sleep(1)
return 2*x
The producer generator becomes an asynchronous generator:
async def producer(xs):
for x in xs:
yield await io_task(x)
The consumer function becomes a coroutine and uses aiofiles, asynchronous context management and asynchronous iteration:
async def consumer(xs):
async with aiofiles.open('output.txt', 'w') as fp:
async for x in xs:
await fp.write(str(x) + '\n')
And the main coroutine runs in an event loop:
data = [1,2,3,4,5]
main = consumer(producer(data))
loop = asyncio.get_event_loop()
loop.run_until_complete(main)
loop.close()
Also, you may consider using aiostream to pipeline some processing operations between the producer and the consumer.
EDIT: The different I/O tasks can easily be run concurrently on the producer side by using as_completed:
async def producer(xs):
coros = [io_task(x) for x in xs]
for future in asyncio.as_completed(coros):
yield await future

When to use and when not to use Python 3.5 `await` ?

I'm getting the flow of using asyncio in Python 3.5 but I haven't seen a description of what things I should be awaiting and things I should not be or where it would be neglible. Do I just have to use my best judgement in terms of "this is an IO operation and thus should be awaited"?
By default all your code is synchronous. You can make it asynchronous defining functions with async def and "calling" these functions with await. A More correct question would be "When should I write asynchronous code instead of synchronous?". Answer is "When you can benefit from it". In cases when you work with I/O operations as you noted you will usually benefit:
# Synchronous way:
download(url1) # takes 5 sec.
download(url2) # takes 5 sec.
# Total time: 10 sec.
# Asynchronous way:
await asyncio.gather(
async_download(url1), # takes 5 sec.
async_download(url2) # takes 5 sec.
)
# Total time: only 5 sec. (+ little overhead for using asyncio)
Of course, if you created a function that uses asynchronous code, this function should be asynchronous too (should be defined as async def). But any asynchronous function can freely use synchronous code. It makes no sense to cast synchronous code to asynchronous without some reason:
# extract_links(url) should be async because it uses async func async_download() inside
async def extract_links(url):
# async_download() was created async to get benefit of I/O
html = await async_download(url)
# parse() doesn't work with I/O, there's no sense to make it async
links = parse(html)
return links
One very important thing is that any long synchronous operation (> 50 ms, for example, it's hard to say exactly) will freeze all your asynchronous operations for that time:
async def extract_links(url):
data = await download(url)
links = parse(data)
# if search_in_very_big_file() takes much time to process,
# all your running async funcs (somewhere else in code) will be frozen
# you need to avoid this situation
links_found = search_in_very_big_file(links)
You can avoid it calling long running synchronous functions in separate process (and awaiting for result):
executor = ProcessPoolExecutor(2)
async def extract_links(url):
data = await download(url)
links = parse(data)
# Now your main process can handle another async functions while separate process running
links_found = await loop.run_in_executor(executor, search_in_very_big_file, links)
One more example: when you need to use requests in asyncio. requests.get is just synchronous long running function, which you shouldn't call inside async code (again, to avoid freezing). But it's running long because of I/O, not because of long calculations. In that case, you can use ThreadPoolExecutor instead of ProcessPoolExecutor to avoid some multiprocessing overhead:
executor = ThreadPoolExecutor(2)
async def download(url):
response = await loop.run_in_executor(executor, requests.get, url)
return response.text
You do not have much freedom. If you need to call a function you need to find out if this is a usual function or a coroutine. You must use the await keyword if and only if the function you are calling is a coroutine.
If async functions are involved there should be an "event loop" which orchestrates these async functions. Strictly speaking it's not necessary, you can "manually" run the async method sending values to it, but probably you don't want to do it. The event loop keeps track of not-yet-finished coroutines and chooses the next one to continue running. asyncio module provides an implementation of event loop, but this is not the only possible implementation.
Consider these two lines of code:
x = get_x()
do_something_else()
and
x = await aget_x()
do_something_else()
Semantic is absolutely the same: call a method which produces some value, when the value is ready assign it to variable x and do something else. In both cases the do_something_else function will be called only after the previous line of code is finished. It doesn't even mean that before or after or during the execution of asynchronous aget_x method the control will be yielded to event loop.
Still there are some differences:
the second snippet can appear only inside another async function
aget_x function is not usual, but coroutine (that is either declared with async keyword or decorated as coroutine)
aget_x is able to "communicate" with the event loop: that is yield some objects to it. The event loop should be able to interpret these objects as requests to do some operations (f.e. to send a network request and wait for response, or just suspend this coroutine for n seconds). Usual get_x function is not able to communicate with event loop.

Categories