I'm trying to create a simple network monitoring app in Python. It should essentially:
Run multiple scripts (in this case, bash commands like "ping" and "traceroute") infinitely and simultaneously
Yield each line from the output of each subprocess; each line should then be consumed elsewhere in the program and sent to a Kafka topic
Do some extra processing on the topic and send the data to InfluxDB (but that's less relevant - I do it with Faust).
What I did:
I tried using an async generator:
async def run(command: str):
proc = await asyncio.create_subprocess_shell(
command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
while True:
line = await proc.stdout.readline()
if line: yield line
Then consume it elsewhere in the program:
...
async for output_line in run("some_command"):
# do something with line
This works fine for a single subprocess, however I'm not sure what to do when I need multiple async generators to run in parallel and be consumed in parallel - some like asyncio.gather, maybe, but for async generators.
What do you think would be the best approach to go about doing this? Upon searching I found the aiostream module, which can merge multiple async generators like so. I can then instead yield a tuple with the line and, say, the command I gave, to identify which generator the output line came from.
However, maybe there's a simpler solution, hopefully a native one?
Thanks!
What you are looking for is asyncio.gather, which runs multiple awaitable objects simultaneously.
To use it, I think your first task is to wrap your parsing code into a single function, like:
async def parse(cmd):
async for output_line in run(cmd):
# something
Then in another function/context, wrap the parse with gather:
result = await asyncio.gather(
parse("cmd1"),
parse("cmd2"),
parse("cmd3"),
)
Related
I'm having problems wrapping an external task to parallelize it. I'm a newbie with asyncio so maybe I'm doing something wrong:
I have an animate method that I have also declared as async.
But that calls an external library that uses various iterators etc.
I'm wondering if something in a library is able to block asyncio at the top level?
animate(item) is a problem. if i define another async task it will run multiple calls concurrently and 'gather' later.
So am I doing it wrong, or is it possible the library been written such that it can't simply be parallelized with asyncio?
I also tried wrapping the call to animate with another async method, without luck.
MAX_JOBS = 1 # how long for
ITEMS_PER_JOB = 4 # how many images per job/user request eg for packs
async def main():
for i in range(0, MAX_JOBS):
clogger.info('job index', i)
job = get_next()
await process_job(job)
async def process_job(job):
batch = generate_batch(job)
coros = [animate(item) for idx, item in enumerate(batch)]
asyncio.gather(*coros)
asyncio.run(main())
the animate func has some internals and like
async def animate(options):
for frame in tqdm(animator.render(), initial=animator.start_frame_idx, total=args.max_frames):
pass
OK NVM it seems all libraries have to be written with coroutines, but there are other options like
to_thread
run_in_executor
not sure which is best in 2023 tho
The tasks from asyncio.gather does not work concurrently
I have 200 pairs of paths to diff. I wrote a little function that will diff each pair and update a dictionary which itself is one of the arguments to the function. Assume MY_DIFFER is some diffing tool I am calling via subprocess under the hood.
async def do_diff(path1, path2, result):
result[f"{path1} {path2}"] = MY_DIFFER(path1, path2)
As you can see I have nothing to return from this async function. I am just capturing the result in result.
I call this function in parallel elsewhere using asyncio like so:
path_tuples = [("/path11", "/path12"), ("/path21", "/path22"), ... ]
result = {}
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(do_diff(path1, path2, result) for path1, path2 in path_tuples)
)
)
Questions:
I don't know where to put await in the do_diff function. But the code seems to work without it as well.
I am not sure if the diffs are really happening in parallel, because when I look at the output of ps -eaf in another terminal, I see only one instance of the underlying tool I am calling at a time.
The speed of execution is same as when I was doing the diffs sequentially
So I am clearly doing something wrong. How can I REALLY do the diffs in parallel?
PS: I am in Python 3.6
Remember that asyncio doesn't run things in parallel, it runs things concurrently, using a cooperative multitasking model -- which means that coroutines need to explicitly yield time to other coroutines for them to run. This is what the await command does; it says "go run some other coroutines while I'm waiting for something to finish".
If you're never awaiting on something, you're not getting concurrent execution.
What you want is for your do_diff method to be able to await on the execution of your external tool, but you can't do that with just the subprocess module. You can do that using the run_in_executor method, which arranges to run a synchronous command (e.g., subprocess.run) in a separate thread or process and wait asynchronously for the result. That might look something like:
async def do_diff(path1, path2, result):
loop = asyncio.get_event_loop()
result[f"{path1} {path2}"] = await loop.run_in_executor(None, MY_DIFFER, path1, path2)
This will by default run MY_DIFFER in a separate thread, although you can utilize a separate process instead by passing an explicit executor as the first argument to run_in_executor.
Per my comment, solving this with concurrent.futures might look something like this:
import concurrent.futures
import time
# dummy function that just sleeps for 2 seconds
# replace this with your actual code
def do_diff(path1, path2):
print(f"diffing path {path1} and {path2}")
time.sleep(2)
return path1, path2, "information about diff"
# create 200 path tuples for demonstration purposes
path_tuples = [(f"/path{x}.1", f"/path{x}.2") for x in range(200)]
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=100) as executor:
for path1, path2 in path_tuples:
# submit the job to the executor
futures.append(executor.submit(do_diff, path1, path2))
# read the results
for future in futures:
print(future.result())
I have the following code running in an event loop where I'm downloading a large number of files using asyncio and restricting the number of files downloaded using asyncio.queue:
download_tasks = asyncio.Queue()
for file in files:
# download_file() is an async function that downloads a file from Microsoft blob storage
# that is basically await blob.download_blob()
download_tasks.put_nowait(asyncio.create_task(download_file(file=file))
async def worker():
while not download_tasks.empty():
return await download_tasks.get_nowait()
worker_limit = 10
# each call to download_file() returns a pandas dataframe
df_list = await asyncio.gather(*[worker() for _ in range(worker_limit)], return_exceptions=True)
df = pd.concat(df_list)
This code seems to run fine, but I originally had the for loop defined as:
for file in files:
# download_file() is an async function that downloads a file from Microsoft blob storage
# that is basically await blob.download_blob()
download_tasks.put_nowait(download_file(file=file)
With this code, the result is the same but I get the following warning:
RuntimeWarning: coroutine 'download_file' was never awaited
Looking at asyncio examples, sometimes I see create_task() used when creating a list or queue of coroutines to be run in gather and sometimes I don't. Why is it needed in my case and what's the best practice for using it?
Edit: As #user2357112supportsMonica discourteously pointed out, the return statement within worker() doesn't really make sense. The point of this code is to limit concurrency because I may have to download thousands at a time and would like to limit it to 10 at a time using the queue. So my actual question is, how can I use gather to return all my results using this queue implementation?
Edit 2: I seemed to have found an easy solution that works using a semaphore instead of a queue with the following code adapted from this answer https://stackoverflow.com/a/61478547/4844593:
download_tasks = []
for file in files:
download_tasks.append(download_file(file=file))
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
df_list = await gather_with_concurrency(10, *download_tasks)
return pd.concat(df_list)
As "user2357112 supports Monica" notes, the original issue probably comes from the workers having a return so each worker will download one file then quit, meaning any coroutines after the first 10 will be ignored and never awaited (you can probably see that if you log information about download_tasks after the supposed completion of your processing).
The create_tasks defeats that because it will immediately schedule the downloading at the same time (defeating the attempted rate limiting / workers pool), then the incorrect worker code will just ignore anything after the first 10 items.
Anyway the difference between coroutines (e.g. bare async functions) and tasks is that tasks are independently scheduled. That is, once you've created a task it lives its life independently and you don't have to await it if you don't want its result. That is similar to Javascript's async functions.
coroutines, however, don't do anything until they are awaited, they will only progress if they are explicitelly polled and that is only done by awaiting them (directly or indirectly e.g. gather or wait will await/poll the objects they wrap).
class Class1():
def func1():
self.conn.send('something')
data = self.conn.recv()
return data
class Class2():
def func2():
[class1.func1() for class1 in self.classes]
How do I make that last line asynchronously in python? I've been googling but can't understand async/await and don't know which functions I should be putting async in front of. In my case, all the class1.func1 need to send before any of them can receive anything. I was also seeing that __aiter__ and __anext__ need to be implemented, but I don't know how those are used in this context. Thanks!
It is indeed possible to fire off multiple requests and asynchronously
wait for them. Because Python is traditionally a synchronous language,
you have to be very careful about what libraries you use with
asynchronous Python. Any library that blocks the main thread (such as
requests) will break your entire asynchronicity. aiohttp is a common
choice for asynchronously making web API calls in Python. What you
want is to create a bunch of future objects inside a Python list and
await it. A future is an object that represents a value that will
eventually resolve to something.
EDIT: Since the function that actually makes the API call is
synchronous and blocking and you don't have control over it, you will
have to run that function in a separate thread.
Async List Comprehensions in Python
import asyncio
async def main():
loop = asyncio.get_event_loop()
futures = [asyncio.ensure_future(loop.run_in_executor(None, get_data, data)) for data in data_name_list]
await asyncio.gather(*futures) # wait for all the future objects to resolve
# Do something with futures
# ...
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
Working on replacing my implementation of a server query tool that uses ThreadPoolExecutors with all asynchronous calls using asyncio and aiohttp. Most of the transition is straight forward since network calls are non-blocking IO, it's the saving of the responses that has me in a conundrum.
All the examples I am using, even the docs for both libraries, use asyncio.gather() which collects all the awaitable results. In my case, these results can be files in the many GB range, and I don't want to store them in memory.
Whats an appropriate way to solve this? Is it to use asyncio.as_completed() and then:
for f in as_completed(aws):
earliest_result = await f
# Assumes `loop` defined under `if __name__` block outside coroutine
loop = get_event_loop()
# Run the blocking IO in an exectuor and write to file
_ = await loop.run_in_executor(None, save_result, earliest_result)
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor by default) thus making this an asynchronous, multi-threaded program vice an asynchronous, single-threaded program?
Futher, does this ensure only 1 earliest_result is being written to file at any time? I dont want the call to await loop.run_in_executor(...) to be running, then another result comes in and I try to run to the same file; I could limit with a semaphore I suppose.
I'd suggest to make use of aiohttp Streaming API. Write your responses directly to the disk instead of RAM and return file names instead of responses itself from gather. Doing so won't use a lot of memory at all. This is a small demo of what I mean:
import asyncio
import aiofiles
from aiohttp import ClientSession
async def make_request(session, url):
response = await session.request(method="GET", url=url)
filename = url.split('/')[-1]
async for data in response.content.iter_chunked(1024):
async with aiofiles.open(filename, "ba") as f:
await f.write(data)
return filename
async def main():
urls = ['https://github.com/Tinche/aiofiles',
'https://github.com/aio-libs/aiohttp']
async with ClientSession() as session:
coros = [make_request(session, url) for url in urls]
result_files = await asyncio.gather(*coros)
print(result_files)
asyncio.run(main())
Very clever way of using the asyncio.gather method by #merrydeath.
I tweaked the helper function like below and got a big performance boost:
response = await session.get(url)
filename = url.split('/')[-1]
async with aiofiles.open(filename, "ba") as f:
await f.write(response.read())
Results may differ depending on the download connection speed.
In my case, these results can be files in the many GB range, and I don't want to store them in memory.
If I'm correct and in your code single aws means a downloading of a single file, you may face a following problem: while as_completed allows to swap data from RAM to HDD asap, all your aws running parallely storing each their data (buffer with partly downloaded file) in RAM simultaneously.
To avoid this you'll need to use semaphore to ensure not to much files are downloading parallely in the first place thus to prevent RAM overuse.
Here's example of using semaphore.
Doesn't this introduce a thread (assuming I use a ThreadPoolExecutor
by default) thus making this an asynchronous, multi-threaded program
vice an asynchronous, single-threaded program?
I'm not sure, I understand your question, but yes, your code will use threads, but only save_result will be executed inside those threads. All other code still runs in single main thread. Nothing bad here.
Futher, does this ensure only 1 earliest_result is being written to
file at any time?
Yes, it is[*]. To be precisely keyword await at last line of your snippet will ensure it:
_ = await loop.run_in_executor(None, save_result, earliest_result)
You can read it as: "Start executing run_in_executor asynchronously and suspend execution flow at this line until run_in_executor is done and returned result".
[*] Yes, if you don't run multiple for f in as_completed(aws) loops parallely in the first place.