I'm trying to speed up some code that calls an api_caller(), which is a generator that you can iterate over to get results.
My synchronous code looks something like this:
def process_comment_tree(p):
# time consuming breadth first search that makes another api call...
return
def process_post(p):
process_comment_tree(p)
def process_posts(kw):
for p in api_caller(query=kw): #possibly 1000s of results
process_post(p)
def process_kws(kws):
for kw in kws:
process_posts(kw)
process_kws(kws=['python', 'threads', 'music'])
When I run this code on a long list of kws, it takes around 18 minutes to complete.
When I use threads:
with concurrent.futures.ThreadPoolExecutor(max_workers=len(KWS)) as pool:
for result in pool.map(process_posts, ['python', 'threads', 'music']):
print(f'result: {result}')
the code completes in around 3 minutes.
Now, I'm trying to use Trio for the first time, but I'm having trouble.
async def process_comment_tree(p):
# same as before...
return
async def process_post(p):
await process_comment_tree(p)
async def process_posts(kw):
async with trio.open_nursery() as nursery:
for p in r.api.search_submissions(query=kw)
nursery.start_soon(process_post, p)
async def process_kws(kws):
async with trio.open_nursery() as nursery:
for kw in kws:
nursery.start_soon(process_posts, kw)
trio.run(process_kws, ['python', 'threads', 'music'])
This still takes around 18 minutes to execute. Am I doing something wrong here, or is something like trio/async not appropriate for my problem setup?
Trio, and async libraries in general, work by switching to a different task while waiting for something external, like an API call. In your code example, it looks like you start a bunch of tasks, but wait for something external. I would recommend reading this part of the tutorial; it gives an idea of what that means: https://trio.readthedocs.io/en/stable/tutorial.html#task-switching-illustrated
Basically, your code has to call a function that will pass control back to the run loop so that it can switch to a different task.
If your api_caller generator makes calls to an external API, that's likely to be something you can replace with async calls. You'll need to use an async http library, like HTTPX or hip
On the other hand, if there's nothing in your code that has to wait for something external, then async won't help your code go faster.
Related
I'm having problems wrapping an external task to parallelize it. I'm a newbie with asyncio so maybe I'm doing something wrong:
I have an animate method that I have also declared as async.
But that calls an external library that uses various iterators etc.
I'm wondering if something in a library is able to block asyncio at the top level?
animate(item) is a problem. if i define another async task it will run multiple calls concurrently and 'gather' later.
So am I doing it wrong, or is it possible the library been written such that it can't simply be parallelized with asyncio?
I also tried wrapping the call to animate with another async method, without luck.
MAX_JOBS = 1 # how long for
ITEMS_PER_JOB = 4 # how many images per job/user request eg for packs
async def main():
for i in range(0, MAX_JOBS):
clogger.info('job index', i)
job = get_next()
await process_job(job)
async def process_job(job):
batch = generate_batch(job)
coros = [animate(item) for idx, item in enumerate(batch)]
asyncio.gather(*coros)
asyncio.run(main())
the animate func has some internals and like
async def animate(options):
for frame in tqdm(animator.render(), initial=animator.start_frame_idx, total=args.max_frames):
pass
OK NVM it seems all libraries have to be written with coroutines, but there are other options like
to_thread
run_in_executor
not sure which is best in 2023 tho
The tasks from asyncio.gather does not work concurrently
There are many posts on SO asking specific questions about asyncio, but I cannot grasp the right way on what to use for a given situation.
Let's say I want to parse and crawl a number of web pages in parallel. I can do this in at least 3 different ways with asyncio:
with pool.submit:
with ThreadPoolExecutor(max_workers=10) as pool:
result_futures = list(map(lambda x: pool.submit(my_func, x), my_list))
for future in as_completed(result_futures):
results.append(future.result())
return results
With asyncio.gather:
loop = asyncio.get_running_loop()
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [loop.run_in_executor(pool, my_func, x) for x in my_list]
results = await asyncio.gather(*futures)
With just pool.map:
with ThreadPoolExecutor(max_workers=10) as pool:
results = [x for x in pool.map(my_func, arg_list)]
my_func is something like
async def my_func(arg):
async with aiohttp.ClientSession() as session:
async with session.post(...):
...
Could somebody help me understand what would be the differences between those 3 approaches? I understand that I can, for example, handle exceptions independently in the first one, but any other differences?
None of these. ThreadPoolExecutor and run_in_executor will all execute your code in another thread, no matter you use the asyncio loop to watch for their execution. And at that point you might just as well not use asyncio at all: the idea of async is exactly managing to run everything on a single thread - getting some CPU cycles and easing a lot on race-conditions that emerge on multi-threaded code.
If your my_func is using async correctly, all the way (it looks like it is, but the code is incomplete), you have to create an asyncio Task for each call to your "async defined" function. On that, maybe the shortest path is indeed using asyncio.gather:
import asyncio
import aiohttp, ... # things used inside "my_func"
def my_func(x):
...
my_list = ...
results = asyncio.run(asyncio.gather(*(my_func(x) for x in my_list)))
An that is all there is for it.
Now going back to your code, and checking the differences:
your code work almost by chance, as in, you really just passed the async functiona and its parameters to the threadpool executor: on calling any async function in this way, they return imediatelly, with no work done. That means nothing (but some thin boiler plate inner code used to create the co-routines) is executed in your threadpool executors. The values returned by the call that runs in the target threads (i.e. the actual my_func(x) call) are the "co-routines": these are the objects that are to be awaited in the main thread and that will actually performe the network I/O. That is: your "my_func" is a "co-routine function" and when called it retoruns immediately with a "co-routine object". When the co-routine object is awaited the code inside "my_func" is actually executed.
Now, with that out of the way: in your first snippet you call future.result on the concurrent.futures Future: that will jsut give you the co-routine object: that code does not work - if you would write results.append(await future.result()) then, yes, if there are no exceptions in the execution, it would work, but would make all the calls in sequence: "await" stops the execution of the current thread until the awaited object resolves, and since awaiting for the other results would happen in this same code, they will queue and be executed in order, with zero parallelism.
Your pool.map code does the same, and your asyncio.gather code is wrong in a different way: the loop.run_in_executor code will take your call and run it on another thread, and gives you an awaitable object which is suitable to be used with gather. However, awaiting on it will return you the "co-routine object", not the result of the HTTP call.
Your real options regarding getting the exceptions raised in the parallel code are either using asyncio.gather, asyncio.wait or asyncio.as_completed. Check the docs here: https://docs.python.org/3/library/asyncio-task.html
I've recently been having a run-up with asynchronous functions in Python, and I wonder how one could make a synchronous function into an asynchronous one.
For example, there is the library for translation via google api pygoogletranslation. One could most possibly wonder, how to translate many different words asynchronously. Of course, you could place it into one request, but then google api would consider it a text and treat it accordingly, which will cause incorrect results.
How could one turn this code:
from pygoogletranslation import Translator
translator = Translator()
translations = []
words = ['partying', 'sightseeing', 'sleeping', 'catering']
for word in words:
translations.append(translator.translate(word, src='en', dest='es'))
print(translations)
Into this:
from pygoogletranslation import Translator
import asyncio
translator = Translator()
translation_tasks = []
words = ['partying', 'sightseeing', 'sleeping', 'catering']
for word in words:
asyncio.create_task(translator.translate(word, src='en', dest='es'))
translations = asyncio.run(
asyncio.gather(translation_tasks, return_exceptions=True)
)
print(translations)
Considering the function translate doesn't have a built-in async implementation?
You will have to create an async function and then run it. Though if translate doesn't have built in async support or is blocking, using async will not make it faster. It's probably better to use multithreading/multiprocessing as suggested in the comments.
async def main():
async def one_iteration(word):
output.append(translator.translate(word, src='en', dest='es'))
coros = [one_iteration(word) for word in words]
await asyncio.gather(*coros)
asyncio.run(main())
As mentioned in other answers, calling a blocking function is useless with ayncio. In this particular case, I suggest you use google-cloud-translate, which is the official translate library from Google.
You could have done something like this in your current library:
async def do_task(word):
return translator.translate(word, ...)
def main():
# Create translator
...
asyncio.gather(do_task(word) for word in [])
But this will just run the task in the same way without asyncio. The real gain in asyncio is that, when is something pending or waiting, it can do something else. eg, while waiting for response from server, it can send another request.
How will Python know that some work is pending? Only when the function (coroutine here) notifies the event loop via await keyword. So you definitely need to use a library that natively supports async operations. The above mentioned google-cloud-translate is such a library. You can do:
from google.cloud import translate
async def main():
# Async-supported google translator client
client = translate.TranslationServiceAsyncClient()
words = ['partying', 'sightseeing', 'sleeping', 'catering']
results = await asyncio.gather(*[client.translate_text(parent=f"projects/{project_name}", contents=[word], source_language_code="en", target_language_code="es") for word in words])
print(results)
asyncio.run(main())
You can see that this client actually takes list of strings as input, so you could directly pass the list of strings here. According to docs, the limit for that is 1024. So if your list is bigger, you have to use this for loop.
You might have to set up credentials etc for this client though, which is outside the scope of this question.
To make a function async, you need to define it with async def and change it to use other async functions for anything that might block - for example, instead of requests you'd use aiohttp, and so on. The point of the effort is that the function can then be executed by an event loop along with other such functions. Whenever an async function needs to wait for something, as signaled by the await keyword, it suspends to the event loop and gives others a chance to execute. The event loop will seamlessly coordinate concurrent execution of a possibly large number of such async functions. See e.g. this answer for more details.
If a critical blocking function that you are depending on doesn't have an async implementation, you can use run_in_executor (or, beginning with Python 3.9, asyncio.to_thread) to make it async. Note, however, that such solutions are "cheating" because they use threads under the hood, so they will not provide benefits normally associated by asyncio such as ability to scale beyond the number of threads in the thread pool, or ability to cancel execution of coroutines.
This is gonna be a bad explanation but I don't know how else to word this so bear with me please.
I have one function:
async def request():
# this can only be called n times at once
But as it says it can only be called n times at once. Is it possible to have some sort of pool with a limited number of objects so I can do this:
async def request():
await with poolOfOneHundred.acquire():
# do something
and then python would acquire 100 of these, and then once it gets to the 101th it would wait at the await with statement until another request() function was finished, then one lock in the pool would be free.
Is this a thing? If not, how could I implement something like this?
Does this make any sense?
You're looking for asyncio.Semaphore.
Here's an example of how to use it.
I'm getting the flow of using asyncio in Python 3.5 but I haven't seen a description of what things I should be awaiting and things I should not be or where it would be neglible. Do I just have to use my best judgement in terms of "this is an IO operation and thus should be awaited"?
By default all your code is synchronous. You can make it asynchronous defining functions with async def and "calling" these functions with await. A More correct question would be "When should I write asynchronous code instead of synchronous?". Answer is "When you can benefit from it". In cases when you work with I/O operations as you noted you will usually benefit:
# Synchronous way:
download(url1) # takes 5 sec.
download(url2) # takes 5 sec.
# Total time: 10 sec.
# Asynchronous way:
await asyncio.gather(
async_download(url1), # takes 5 sec.
async_download(url2) # takes 5 sec.
)
# Total time: only 5 sec. (+ little overhead for using asyncio)
Of course, if you created a function that uses asynchronous code, this function should be asynchronous too (should be defined as async def). But any asynchronous function can freely use synchronous code. It makes no sense to cast synchronous code to asynchronous without some reason:
# extract_links(url) should be async because it uses async func async_download() inside
async def extract_links(url):
# async_download() was created async to get benefit of I/O
html = await async_download(url)
# parse() doesn't work with I/O, there's no sense to make it async
links = parse(html)
return links
One very important thing is that any long synchronous operation (> 50 ms, for example, it's hard to say exactly) will freeze all your asynchronous operations for that time:
async def extract_links(url):
data = await download(url)
links = parse(data)
# if search_in_very_big_file() takes much time to process,
# all your running async funcs (somewhere else in code) will be frozen
# you need to avoid this situation
links_found = search_in_very_big_file(links)
You can avoid it calling long running synchronous functions in separate process (and awaiting for result):
executor = ProcessPoolExecutor(2)
async def extract_links(url):
data = await download(url)
links = parse(data)
# Now your main process can handle another async functions while separate process running
links_found = await loop.run_in_executor(executor, search_in_very_big_file, links)
One more example: when you need to use requests in asyncio. requests.get is just synchronous long running function, which you shouldn't call inside async code (again, to avoid freezing). But it's running long because of I/O, not because of long calculations. In that case, you can use ThreadPoolExecutor instead of ProcessPoolExecutor to avoid some multiprocessing overhead:
executor = ThreadPoolExecutor(2)
async def download(url):
response = await loop.run_in_executor(executor, requests.get, url)
return response.text
You do not have much freedom. If you need to call a function you need to find out if this is a usual function or a coroutine. You must use the await keyword if and only if the function you are calling is a coroutine.
If async functions are involved there should be an "event loop" which orchestrates these async functions. Strictly speaking it's not necessary, you can "manually" run the async method sending values to it, but probably you don't want to do it. The event loop keeps track of not-yet-finished coroutines and chooses the next one to continue running. asyncio module provides an implementation of event loop, but this is not the only possible implementation.
Consider these two lines of code:
x = get_x()
do_something_else()
and
x = await aget_x()
do_something_else()
Semantic is absolutely the same: call a method which produces some value, when the value is ready assign it to variable x and do something else. In both cases the do_something_else function will be called only after the previous line of code is finished. It doesn't even mean that before or after or during the execution of asynchronous aget_x method the control will be yielded to event loop.
Still there are some differences:
the second snippet can appear only inside another async function
aget_x function is not usual, but coroutine (that is either declared with async keyword or decorated as coroutine)
aget_x is able to "communicate" with the event loop: that is yield some objects to it. The event loop should be able to interpret these objects as requests to do some operations (f.e. to send a network request and wait for response, or just suspend this coroutine for n seconds). Usual get_x function is not able to communicate with event loop.