I'm having problems wrapping an external task to parallelize it. I'm a newbie with asyncio so maybe I'm doing something wrong:
I have an animate method that I have also declared as async.
But that calls an external library that uses various iterators etc.
I'm wondering if something in a library is able to block asyncio at the top level?
animate(item) is a problem. if i define another async task it will run multiple calls concurrently and 'gather' later.
So am I doing it wrong, or is it possible the library been written such that it can't simply be parallelized with asyncio?
I also tried wrapping the call to animate with another async method, without luck.
MAX_JOBS = 1 # how long for
ITEMS_PER_JOB = 4 # how many images per job/user request eg for packs
async def main():
for i in range(0, MAX_JOBS):
clogger.info('job index', i)
job = get_next()
await process_job(job)
async def process_job(job):
batch = generate_batch(job)
coros = [animate(item) for idx, item in enumerate(batch)]
asyncio.gather(*coros)
asyncio.run(main())
the animate func has some internals and like
async def animate(options):
for frame in tqdm(animator.render(), initial=animator.start_frame_idx, total=args.max_frames):
pass
OK NVM it seems all libraries have to be written with coroutines, but there are other options like
to_thread
run_in_executor
not sure which is best in 2023 tho
The tasks from asyncio.gather does not work concurrently
Related
I'd like to use asyncio to do a lot of simultaneous non-blocking IO in Python. However, I want that use of asyncio to be abstracted away from the user--under the hood there's a lot of asychronous calls going on simultaneously to speed things up, but for the user there's a single, synchronous call.
Basically something like this:
async def _slow_async_fn(address):
data = await async_load_data(address)
return data
def synchronous_blocking_io()
addresses = ...
tasks = []
for address in addresses:
tasks.append(_slow_async_fn(address))
all_results = some_fn(asyncio.gather(*tasks))
return all_results
The problem is, how can I achieve this in a way that's agnostic to the user's running environment? I use a pattern like asyncio.get_event_loop().run_until_complete(), I run into issues if the code is being called inside an environment like Jupyter where there's already an event loop running. Is there a way to robustly gather the results of a set of asynchronous tasks that doesn't require pushing async/await statements all the way up the program?
The restriction on running loops is per thread, so running a new event loop is possible, as long as it is in a new thread.
import asyncio
import concurrent.futures
async def gatherer_of(tasks):
# It's necessary to wrap asyncio.gather() in a coroutine (reasons beyond scope)
return await asyncio.gather(*tasks)
def synchronous_blocking_io():
addresses = ...
tasks = []
for address in addresses:
tasks.append(_slow_async_fn(address))
loop = asyncio.new_event_loop()
return loop.run_until_complete(gatherer_of(tasks))
def synchronous_blocking_io_wrapper():
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
fut = executor.submit(synchronous_blocking_io)
return fut.result()
# Testing
async def async_runner():
# Simulating execution from a running loop
return synchronous_blocking_io_wrapper()
# Run from synchronous client
# print(synchronous_blocking_io_wrapper())
# Run from async client
# print(asyncio.run(async_runner()))
The same result can be achieved with the ProcessPoolExecutor, by manually running synchronous_blocking_io in a new thread and joining it, starting an entirely new process and so forth. As long as you are not in the same thread, you won't conflict with any running event loop.
Just can't wrap my head around solving this issue, so maybe someone here can enlighten me or maybe even tell me that what I want to achieve isn't possible. :)
Problem statement:
I have an asyncio event loop, on that loop I create a task by supplying my asynchronous coroutine work(). I could then go ahead and cancel the task by invoking its cancel() method - this works.
But in my very special case, the asynchronous task itself spawns another operation, which is an underlying blocking / synchronous function.
What happens now, if I decide to cancel the task, is that my asynchronous work() function will be cancelled appropriately, however, the synchronous function is still going to be executed as if nothing ever happened.
I tried to make an example as simple as possible to illustrate my problem:
import asyncio
import time
def sync_work():
time.sleep(10)
print("sync work completed")
return "sync_work_result"
async def work(loop):
result = await loop.run_in_executor(None, sync_work)
print(f"sync_work {result}")
print("work completed")
async def main(loop):
t1 = loop.create_task(work(loop))
await asyncio.sleep(4)
t1.cancel()
loop = asyncio.get_event_loop()
try:
asyncio.ensure_future(main(loop))
loop.run_forever()
except KeyboardInterrupt:
pass
finally:
print("loop closing")
loop.close()
This will print out sync work completed after about 10 seconds.
How would I invoke the synchronous function in a way, that would allow me to terminate it once my asynchronous task is cancelled? The tricky part is, that I would not have control over sync_work() as this comes from another external package.
I'm open to other approaches of calling my synchronous function from an asynchronous function that would allow it to be terminated properly in some kind of way.
I'm trying to speed up some code that calls an api_caller(), which is a generator that you can iterate over to get results.
My synchronous code looks something like this:
def process_comment_tree(p):
# time consuming breadth first search that makes another api call...
return
def process_post(p):
process_comment_tree(p)
def process_posts(kw):
for p in api_caller(query=kw): #possibly 1000s of results
process_post(p)
def process_kws(kws):
for kw in kws:
process_posts(kw)
process_kws(kws=['python', 'threads', 'music'])
When I run this code on a long list of kws, it takes around 18 minutes to complete.
When I use threads:
with concurrent.futures.ThreadPoolExecutor(max_workers=len(KWS)) as pool:
for result in pool.map(process_posts, ['python', 'threads', 'music']):
print(f'result: {result}')
the code completes in around 3 minutes.
Now, I'm trying to use Trio for the first time, but I'm having trouble.
async def process_comment_tree(p):
# same as before...
return
async def process_post(p):
await process_comment_tree(p)
async def process_posts(kw):
async with trio.open_nursery() as nursery:
for p in r.api.search_submissions(query=kw)
nursery.start_soon(process_post, p)
async def process_kws(kws):
async with trio.open_nursery() as nursery:
for kw in kws:
nursery.start_soon(process_posts, kw)
trio.run(process_kws, ['python', 'threads', 'music'])
This still takes around 18 minutes to execute. Am I doing something wrong here, or is something like trio/async not appropriate for my problem setup?
Trio, and async libraries in general, work by switching to a different task while waiting for something external, like an API call. In your code example, it looks like you start a bunch of tasks, but wait for something external. I would recommend reading this part of the tutorial; it gives an idea of what that means: https://trio.readthedocs.io/en/stable/tutorial.html#task-switching-illustrated
Basically, your code has to call a function that will pass control back to the run loop so that it can switch to a different task.
If your api_caller generator makes calls to an external API, that's likely to be something you can replace with async calls. You'll need to use an async http library, like HTTPX or hip
On the other hand, if there's nothing in your code that has to wait for something external, then async won't help your code go faster.
class Class1():
def func1():
self.conn.send('something')
data = self.conn.recv()
return data
class Class2():
def func2():
[class1.func1() for class1 in self.classes]
How do I make that last line asynchronously in python? I've been googling but can't understand async/await and don't know which functions I should be putting async in front of. In my case, all the class1.func1 need to send before any of them can receive anything. I was also seeing that __aiter__ and __anext__ need to be implemented, but I don't know how those are used in this context. Thanks!
It is indeed possible to fire off multiple requests and asynchronously
wait for them. Because Python is traditionally a synchronous language,
you have to be very careful about what libraries you use with
asynchronous Python. Any library that blocks the main thread (such as
requests) will break your entire asynchronicity. aiohttp is a common
choice for asynchronously making web API calls in Python. What you
want is to create a bunch of future objects inside a Python list and
await it. A future is an object that represents a value that will
eventually resolve to something.
EDIT: Since the function that actually makes the API call is
synchronous and blocking and you don't have control over it, you will
have to run that function in a separate thread.
Async List Comprehensions in Python
import asyncio
async def main():
loop = asyncio.get_event_loop()
futures = [asyncio.ensure_future(loop.run_in_executor(None, get_data, data)) for data in data_name_list]
await asyncio.gather(*futures) # wait for all the future objects to resolve
# Do something with futures
# ...
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
I'm getting the flow of using asyncio in Python 3.5 but I haven't seen a description of what things I should be awaiting and things I should not be or where it would be neglible. Do I just have to use my best judgement in terms of "this is an IO operation and thus should be awaited"?
By default all your code is synchronous. You can make it asynchronous defining functions with async def and "calling" these functions with await. A More correct question would be "When should I write asynchronous code instead of synchronous?". Answer is "When you can benefit from it". In cases when you work with I/O operations as you noted you will usually benefit:
# Synchronous way:
download(url1) # takes 5 sec.
download(url2) # takes 5 sec.
# Total time: 10 sec.
# Asynchronous way:
await asyncio.gather(
async_download(url1), # takes 5 sec.
async_download(url2) # takes 5 sec.
)
# Total time: only 5 sec. (+ little overhead for using asyncio)
Of course, if you created a function that uses asynchronous code, this function should be asynchronous too (should be defined as async def). But any asynchronous function can freely use synchronous code. It makes no sense to cast synchronous code to asynchronous without some reason:
# extract_links(url) should be async because it uses async func async_download() inside
async def extract_links(url):
# async_download() was created async to get benefit of I/O
html = await async_download(url)
# parse() doesn't work with I/O, there's no sense to make it async
links = parse(html)
return links
One very important thing is that any long synchronous operation (> 50 ms, for example, it's hard to say exactly) will freeze all your asynchronous operations for that time:
async def extract_links(url):
data = await download(url)
links = parse(data)
# if search_in_very_big_file() takes much time to process,
# all your running async funcs (somewhere else in code) will be frozen
# you need to avoid this situation
links_found = search_in_very_big_file(links)
You can avoid it calling long running synchronous functions in separate process (and awaiting for result):
executor = ProcessPoolExecutor(2)
async def extract_links(url):
data = await download(url)
links = parse(data)
# Now your main process can handle another async functions while separate process running
links_found = await loop.run_in_executor(executor, search_in_very_big_file, links)
One more example: when you need to use requests in asyncio. requests.get is just synchronous long running function, which you shouldn't call inside async code (again, to avoid freezing). But it's running long because of I/O, not because of long calculations. In that case, you can use ThreadPoolExecutor instead of ProcessPoolExecutor to avoid some multiprocessing overhead:
executor = ThreadPoolExecutor(2)
async def download(url):
response = await loop.run_in_executor(executor, requests.get, url)
return response.text
You do not have much freedom. If you need to call a function you need to find out if this is a usual function or a coroutine. You must use the await keyword if and only if the function you are calling is a coroutine.
If async functions are involved there should be an "event loop" which orchestrates these async functions. Strictly speaking it's not necessary, you can "manually" run the async method sending values to it, but probably you don't want to do it. The event loop keeps track of not-yet-finished coroutines and chooses the next one to continue running. asyncio module provides an implementation of event loop, but this is not the only possible implementation.
Consider these two lines of code:
x = get_x()
do_something_else()
and
x = await aget_x()
do_something_else()
Semantic is absolutely the same: call a method which produces some value, when the value is ready assign it to variable x and do something else. In both cases the do_something_else function will be called only after the previous line of code is finished. It doesn't even mean that before or after or during the execution of asynchronous aget_x method the control will be yielded to event loop.
Still there are some differences:
the second snippet can appear only inside another async function
aget_x function is not usual, but coroutine (that is either declared with async keyword or decorated as coroutine)
aget_x is able to "communicate" with the event loop: that is yield some objects to it. The event loop should be able to interpret these objects as requests to do some operations (f.e. to send a network request and wait for response, or just suspend this coroutine for n seconds). Usual get_x function is not able to communicate with event loop.