Writing web responses to file in an asyncronous program - python

Working on replacing my implementation of a server query tool that uses ThreadPoolExecutors with all asynchronous calls using asyncio and aiohttp. Most of the transition is straight forward since network calls are non-blocking IO, it's the saving of the responses that has me in a conundrum.
All the examples I am using, even the docs for both libraries, use asyncio.gather() which collects all the awaitable results. In my case, these results can be files in the many GB range, and I don't want to store them in memory.
Whats an appropriate way to solve this? Is it to use asyncio.as_completed() and then:
for f in as_completed(aws):
earliest_result = await f
# Assumes `loop` defined under `if __name__` block outside coroutine
loop = get_event_loop()
# Run the blocking IO in an exectuor and write to file
_ = await loop.run_in_executor(None, save_result, earliest_result)
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor by default) thus making this an asynchronous, multi-threaded program vice an asynchronous, single-threaded program?
Futher, does this ensure only 1 earliest_result is being written to file at any time? I dont want the call to await loop.run_in_executor(...) to be running, then another result comes in and I try to run to the same file; I could limit with a semaphore I suppose.

I'd suggest to make use of aiohttp Streaming API. Write your responses directly to the disk instead of RAM and return file names instead of responses itself from gather. Doing so won't use a lot of memory at all. This is a small demo of what I mean:
import asyncio
import aiofiles
from aiohttp import ClientSession
async def make_request(session, url):
response = await session.request(method="GET", url=url)
filename = url.split('/')[-1]
async for data in response.content.iter_chunked(1024):
async with aiofiles.open(filename, "ba") as f:
await f.write(data)
return filename
async def main():
urls = ['https://github.com/Tinche/aiofiles',
'https://github.com/aio-libs/aiohttp']
async with ClientSession() as session:
coros = [make_request(session, url) for url in urls]
result_files = await asyncio.gather(*coros)
print(result_files)
asyncio.run(main())

Very clever way of using the asyncio.gather method by #merrydeath.
I tweaked the helper function like below and got a big performance boost:
response = await session.get(url)
filename = url.split('/')[-1]
async with aiofiles.open(filename, "ba") as f:
await f.write(response.read())
Results may differ depending on the download connection speed.

In my case, these results can be files in the many GB range, and I don't want to store them in memory.
If I'm correct and in your code single aws means a downloading of a single file, you may face a following problem: while as_completed allows to swap data from RAM to HDD asap, all your aws running parallely storing each their data (buffer with partly downloaded file) in RAM simultaneously.
To avoid this you'll need to use semaphore to ensure not to much files are downloading parallely in the first place thus to prevent RAM overuse.
Here's example of using semaphore.
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor
by default) thus making this an asynchronous, multi-threaded program
vice an asynchronous, single-threaded program?
I'm not sure, I understand your question, but yes, your code will use threads, but only save_result will be executed inside those threads. All other code still runs in single main thread. Nothing bad here.
Futher, does this ensure only 1 earliest_result is being written to
file at any time?
Yes, it is[*]. To be precisely keyword await at last line of your snippet will ensure it:
_ = await loop.run_in_executor(None, save_result, earliest_result)
You can read it as: "Start executing run_in_executor asynchronously and suspend execution flow at this line until run_in_executor is done and returned result".
[*] Yes, if you don't run multiple for f in as_completed(aws) loops parallely in the first place.

Related

Consuming multiple async generators natively in Python

I'm trying to create a simple network monitoring app in Python. It should essentially:
Run multiple scripts (in this case, bash commands like "ping" and "traceroute") infinitely and simultaneously
Yield each line from the output of each subprocess; each line should then be consumed elsewhere in the program and sent to a Kafka topic
Do some extra processing on the topic and send the data to InfluxDB (but that's less relevant - I do it with Faust).
What I did:
I tried using an async generator:
async def run(command: str):
proc = await asyncio.create_subprocess_shell(
command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
while True:
line = await proc.stdout.readline()
if line: yield line
Then consume it elsewhere in the program:
...
async for output_line in run("some_command"):
# do something with line
This works fine for a single subprocess, however I'm not sure what to do when I need multiple async generators to run in parallel and be consumed in parallel - some like asyncio.gather, maybe, but for async generators.
What do you think would be the best approach to go about doing this? Upon searching I found the aiostream module, which can merge multiple async generators like so. I can then instead yield a tuple with the line and, say, the command I gave, to identify which generator the output line came from.
However, maybe there's a simpler solution, hopefully a native one?
Thanks!
What you are looking for is asyncio.gather, which runs multiple awaitable objects simultaneously.
To use it, I think your first task is to wrap your parsing code into a single function, like:
async def parse(cmd):
async for output_line in run(cmd):
# something
Then in another function/context, wrap the parse with gather:
result = await asyncio.gather(
parse("cmd1"),
parse("cmd2"),
parse("cmd3"),
)

Why is create_task() needed to create a queue of coroutines using asyncio gather?

I have the following code running in an event loop where I'm downloading a large number of files using asyncio and restricting the number of files downloaded using asyncio.queue:
download_tasks = asyncio.Queue()
for file in files:
# download_file() is an async function that downloads a file from Microsoft blob storage
# that is basically await blob.download_blob()
download_tasks.put_nowait(asyncio.create_task(download_file(file=file))
async def worker():
while not download_tasks.empty():
return await download_tasks.get_nowait()
worker_limit = 10
# each call to download_file() returns a pandas dataframe
df_list = await asyncio.gather(*[worker() for _ in range(worker_limit)], return_exceptions=True)
df = pd.concat(df_list)
This code seems to run fine, but I originally had the for loop defined as:
for file in files:
# download_file() is an async function that downloads a file from Microsoft blob storage
# that is basically await blob.download_blob()
download_tasks.put_nowait(download_file(file=file)
With this code, the result is the same but I get the following warning:
RuntimeWarning: coroutine 'download_file' was never awaited
Looking at asyncio examples, sometimes I see create_task() used when creating a list or queue of coroutines to be run in gather and sometimes I don't. Why is it needed in my case and what's the best practice for using it?
Edit: As #user2357112supportsMonica discourteously pointed out, the return statement within worker() doesn't really make sense. The point of this code is to limit concurrency because I may have to download thousands at a time and would like to limit it to 10 at a time using the queue. So my actual question is, how can I use gather to return all my results using this queue implementation?
Edit 2: I seemed to have found an easy solution that works using a semaphore instead of a queue with the following code adapted from this answer https://stackoverflow.com/a/61478547/4844593:
download_tasks = []
for file in files:
download_tasks.append(download_file(file=file))
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
df_list = await gather_with_concurrency(10, *download_tasks)
return pd.concat(df_list)
As "user2357112 supports Monica" notes, the original issue probably comes from the workers having a return so each worker will download one file then quit, meaning any coroutines after the first 10 will be ignored and never awaited (you can probably see that if you log information about download_tasks after the supposed completion of your processing).
The create_tasks defeats that because it will immediately schedule the downloading at the same time (defeating the attempted rate limiting / workers pool), then the incorrect worker code will just ignore anything after the first 10 items.
Anyway the difference between coroutines (e.g. bare async functions) and tasks is that tasks are independently scheduled. That is, once you've created a task it lives its life independently and you don't have to await it if you don't want its result. That is similar to Javascript's async functions.
coroutines, however, don't do anything until they are awaited, they will only progress if they are explicitelly polled and that is only done by awaiting them (directly or indirectly e.g. gather or wait will await/poll the objects they wrap).

How do I make my list comprehension (and the function it calls) run asynchronously?

class Class1():
def func1():
self.conn.send('something')
data = self.conn.recv()
return data
class Class2():
def func2():
[class1.func1() for class1 in self.classes]
How do I make that last line asynchronously in python? I've been googling but can't understand async/await and don't know which functions I should be putting async in front of. In my case, all the class1.func1 need to send before any of them can receive anything. I was also seeing that __aiter__ and __anext__ need to be implemented, but I don't know how those are used in this context. Thanks!
It is indeed possible to fire off multiple requests and asynchronously
wait for them. Because Python is traditionally a synchronous language,
you have to be very careful about what libraries you use with
asynchronous Python. Any library that blocks the main thread (such as
requests) will break your entire asynchronicity. aiohttp is a common
choice for asynchronously making web API calls in Python. What you
want is to create a bunch of future objects inside a Python list and
await it. A future is an object that represents a value that will
eventually resolve to something.
EDIT: Since the function that actually makes the API call is
synchronous and blocking and you don't have control over it, you will
have to run that function in a separate thread.
Async List Comprehensions in Python
import asyncio
async def main():
loop = asyncio.get_event_loop()
futures = [asyncio.ensure_future(loop.run_in_executor(None, get_data, data)) for data in data_name_list]
await asyncio.gather(*futures) # wait for all the future objects to resolve
# Do something with futures
# ...
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

How to use asyncio with a very long list of tasks (generator)

I have a small program that loads a pretty heavy CSV (over 800MB, in chunks, using pandas.read_csv to limit memory usage) and performs a few API calls to servers "out in the wild", and finally builds a result object which is then stored in a database.
I have added caching for the network requests where possible, but even then, the code takes over 10 hours to complete. When I profile the code with PySpy, most of it is waiting for network requests.
I tried converting it to use asyncio to speed things up, and have managed to get the code to work on a small subset of the input file. However with the full file, the memory use become prohibitive.
Here is what I have tried:
import pandas as pd
import httpx
async def process_item(item, client):
# send a few requests with httpx session
# process results
await save_results_to_db(res)
async def get_items_from_csv():
# loads the heavy CSV file
for chunk in pd.read_csv(filename, ...):
for row in chunk.itertuples():
item = item_from_row(row)
yield item
async def main():
async with httpx.AsyncClient() as client:
tasks = []
for item in get_items_from_csv():
tasks.append(process_item(item, client))
await asyncio.gather(*tasks)
asyncio.run(main())
Is there a way to avoid creating the tasks list, which becomes a very heavy object with over 1.5M items in it? The other downside of this is that no task seems to be processed until the entire file has been read, which is not ideal.
I'm using python 3.7 but can easily upgrade to 3.8 if needed.
I think what you are looking for here is not running in batches but running N workers which concurrently pull tasks off of a queue.
N = 10 # scale based on the processing power and memory you have
async def main():
async with httpx.AsyncClient() as client:
tasks = asyncio.Queue()
for item in get_items_from_csv():
tasks.put_nowait(process_item(item, client))
async def worker():
while not tasks.empty():
await tasks.get_nowait()
# for a server
# while task := await tasks.get():
# await task
await asyncio.gather(*[worker() for _ in range(N)])
I used an asyncio.Queue but you can also just use a collections.deque since all tasks are being added to the queue prior to starting a worker. The former is especially useful when running workers that run in a long running process (e.g. a server) where items may be asynchronously queued.

When to use and when not to use Python 3.5 `await` ?

I'm getting the flow of using asyncio in Python 3.5 but I haven't seen a description of what things I should be awaiting and things I should not be or where it would be neglible. Do I just have to use my best judgement in terms of "this is an IO operation and thus should be awaited"?
By default all your code is synchronous. You can make it asynchronous defining functions with async def and "calling" these functions with await. A More correct question would be "When should I write asynchronous code instead of synchronous?". Answer is "When you can benefit from it". In cases when you work with I/O operations as you noted you will usually benefit:
# Synchronous way:
download(url1) # takes 5 sec.
download(url2) # takes 5 sec.
# Total time: 10 sec.
# Asynchronous way:
await asyncio.gather(
async_download(url1), # takes 5 sec.
async_download(url2) # takes 5 sec.
)
# Total time: only 5 sec. (+ little overhead for using asyncio)
Of course, if you created a function that uses asynchronous code, this function should be asynchronous too (should be defined as async def). But any asynchronous function can freely use synchronous code. It makes no sense to cast synchronous code to asynchronous without some reason:
# extract_links(url) should be async because it uses async func async_download() inside
async def extract_links(url):
# async_download() was created async to get benefit of I/O
html = await async_download(url)
# parse() doesn't work with I/O, there's no sense to make it async
links = parse(html)
return links
One very important thing is that any long synchronous operation (> 50 ms, for example, it's hard to say exactly) will freeze all your asynchronous operations for that time:
async def extract_links(url):
data = await download(url)
links = parse(data)
# if search_in_very_big_file() takes much time to process,
# all your running async funcs (somewhere else in code) will be frozen
# you need to avoid this situation
links_found = search_in_very_big_file(links)
You can avoid it calling long running synchronous functions in separate process (and awaiting for result):
executor = ProcessPoolExecutor(2)
async def extract_links(url):
data = await download(url)
links = parse(data)
# Now your main process can handle another async functions while separate process running
links_found = await loop.run_in_executor(executor, search_in_very_big_file, links)
One more example: when you need to use requests in asyncio. requests.get is just synchronous long running function, which you shouldn't call inside async code (again, to avoid freezing). But it's running long because of I/O, not because of long calculations. In that case, you can use ThreadPoolExecutor instead of ProcessPoolExecutor to avoid some multiprocessing overhead:
executor = ThreadPoolExecutor(2)
async def download(url):
response = await loop.run_in_executor(executor, requests.get, url)
return response.text
You do not have much freedom. If you need to call a function you need to find out if this is a usual function or a coroutine. You must use the await keyword if and only if the function you are calling is a coroutine.
If async functions are involved there should be an "event loop" which orchestrates these async functions. Strictly speaking it's not necessary, you can "manually" run the async method sending values to it, but probably you don't want to do it. The event loop keeps track of not-yet-finished coroutines and chooses the next one to continue running. asyncio module provides an implementation of event loop, but this is not the only possible implementation.
Consider these two lines of code:
x = get_x()
do_something_else()
and
x = await aget_x()
do_something_else()
Semantic is absolutely the same: call a method which produces some value, when the value is ready assign it to variable x and do something else. In both cases the do_something_else function will be called only after the previous line of code is finished. It doesn't even mean that before or after or during the execution of asynchronous aget_x method the control will be yielded to event loop.
Still there are some differences:
the second snippet can appear only inside another async function
aget_x function is not usual, but coroutine (that is either declared with async keyword or decorated as coroutine)
aget_x is able to "communicate" with the event loop: that is yield some objects to it. The event loop should be able to interpret these objects as requests to do some operations (f.e. to send a network request and wait for response, or just suspend this coroutine for n seconds). Usual get_x function is not able to communicate with event loop.

Categories