Retrieving data from python's coroutine object - python

I am trying to learn async, and now I am trying to get whois information for a batch of domains. I found this lib aiowhois, but there are only a few strokes of information, not enough for such newbie as I am.
This code works without errors, but I don't know how to print data from parsed whois variable, which is coroutine object.
resolv = aiowhois.Whois(timeout=10)
async def coro(url, sem):
parsed_whois = await resolv.query(url)
async def main():
tasks = []
sem = asyncio.Semaphore(4)
for url in domains:
task = asyncio.Task(coro(url, sem))
tasks.append(task)
await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

You can avoid using tasks. Just apply gather to the coroutine directly.
In case you are confused about the difference, this SO QA might help you (especially the second answer).
You can have each coroutine return its result, without resorting to global variables:
async def coro(url):
return await resolv.query(url)
async def main():
domains = ...
ops = [coro(url) for url in domains]
rets = await asyncio.gather(*ops)
print(rets)
Please see the official docs to learn more about how to use gather or wait or even more options
Note: if you are using the latest python versions, you can also simplify the loop running with just
asyncio.run(main())
Note 2: I have removed the semaphore from my code, as it's unclear why you need it and where.

all_parsed_whois = [] # make a global
async def coro(url, sem):
all_parsed_whois.append(await resolv.query(url))
If you want the data as soon as it is available you could task.add_done_callback()
python asyncio add_done_callback with async def

Related

Parallelize requests call with asyncio in Python

I'm trying to create an interface to an API, and I want to have the option to easily run the requests sync or asynchronously, and I came up with the following code.
import asyncio
import requests
def async_run(coro_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, asyncio.run, coro) for coro in coro_list]
result = loop.run_until_complete(asyncio.gather(*futures))
return result
def sync_get(url):
return requests.get(url)
async def async_get(url):
return sync_get(url)
coro_list = [async_get("https://google.com"), async_get("https://google.com")]
responses = async_run(coro_list)
print(responses)
For me it's very intuitive to either call sync_get or create a list of async_get and call async_run, and requires no knowledge of async Python to understand how it works.
The only problem is that loop.run_in_executor(None, asyncio.run, coro) doesn't sound too optimal, and I couldn't find anyone else running this code on Github. So I'm wondering, is there a simpler way to accomplish the objective of abstracting these threading and asyncio concepts in some similar way, or is this code already optimal?
asyncio.run() is usually used as the main entry to run async code from sync code.
loop.run_in_executor(None, asyncio.run, coro) cause an event loop created in executor threads to run coro in coro_list. Why not directly run sync_get in executor threads?
import asyncio
import requests
def async_run(url_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, sync_get, url) for url in url_list]
result = await asyncio.gather(*futures)
return result
def sync_get(url):
return requests.get(url)
#
# async def async_get(url):
# return sync_get(url)
url_list = ["https://google.com", "https://google.com"]
responses = asyncio.run(async_run(url_list))
print(responses)
There are async libaries, eg. aiohttp and httpx, to accomplish similar work.
At the end I chose not to cover completely asyncio under my interface.
Still with the goal of having not having to manage 2 "requests" functions, I made the API async first, and run the synchronous one with asyncio and I ended up with something like this.
def sync_request():
return asyncio.run(async_request(...))
async def async_request():
return await aiohttp.request(...) # pseudo code

When to Use Await Keywork in Python?

I'm currently trying to learn asyncio in Python. I know that the await keyword tells the loop that it can switch coroutines. However, when should I actually use it? Why not put it before everything?
Additionally, why is the await before 'response.text()', why not before the session.get(url)?
async def print_preview(url):
# connect to the server
async with aiohttp.ClientSession() as session:
# create get request
async with session.get(url) as response:
# wait for response
response = await response.text()
# print first 3 not empty lines
count = 0
lines = list(filter(lambda x: len(x) > 0, response.split('\n')))
print('-'*80)
for line in lines[:3]:
print(line)
print()
You use await with functions that are marked as coroutine in the documentation. For example, ClientResponse.text is marked as coroutine, while ClientResponse.close is not, which means you must await the former and must not await the latter. If you forget to await a coroutine, it simply won't execute and its return value will be a "coroutine object", which is useless (except for use with await).
session.get() returns an async context manager. When passed to async with, the coroutines it implements are awaited behind the scenes.
Also note that awaiting is not the only thing you can do with coroutines, the other is converting them into tasks, which allows them to run in parallel (without additional cost on the OS level). For more information, consult a tutorial on asyncio.

asyncio gather yielding results as they come in

I want to be able to yield results from a set of tasks run by gather as they come in for further processing.
# Not real code, but an example
async for response in asyncio.gather(*[aiohttp.get(url) for url in ['https://www.google.com', 'https://www.amazon.com']]):
await process_response(response)
At present, I can use the gather method to run all get concurrently, but must wait until they're all complete to process them. I'm still new to Python async/await, so maybe there's some obvious way of doing this I'm missing.
# What I can do now
responses = await asyncio.gather(*[aiohttp.get(url) for url in ['https://www.google.com', 'https://www.amazon.com']])
await asyncio.gather(*[process_response(response) for response in responses])
Thanks!
gather as you already noted will wait until all coroutines are done, thus you need to find another way.
For example you can use function asyncio.as_completed that seems to do exactly what you want.
import asyncio
async def echo(t):
await asyncio.sleep(t)
return t
async def main():
coros = [
echo(3),
echo(2),
echo(1),
]
for first_completed in asyncio.as_completed(coros):
res = await first_completed
print(f'Done {res}')
asyncio.run(main())
Result:
Done 1
Done 2
Done 3
[Finished in 3 sec]

Parallel asynchronous IO in Python's coroutines

Simple example: I need to make two unrelated HTTP requests in parallel. What's the simplest way to do that? I expect it to be like that:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://httpbin.org/get')
coro_2 = session.get('http://httpbin.org/ip')
return combine_responses(await coro_1, await coro_2)
In other words, I want to initiate IO operations and wait for their results so they effectively run in parallel. This can be achieved with asyncio.gather:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://example.com/get')
coro_2 = session.get('http://example.org/tp')
return combine_responses(*(await asyncio.gather(coro_1, coro_2)))
Next, I want to have some complex dependency structure. I want to start operations when I have all prerequisites for them and get results when I need the results. Here helps asyncio.ensure_future which makes separate task from coroutine which is managed by event loop separately:
async def do_the_job():
with aiohttp.ClientSession() as session:
fut_1 = asyncio.ensure_future(session.get('http://httpbin.org/ip'))
coro_2 = session.get('http://httpbin.org/get')
coro_3 = session.post('http://httpbin.org/post', data=(await coro_2)
coro_3_result = await coro_3
return combine_responses(await fut_1, coro_3_result)
Is it true that, to achieve parallel non-blocking IO with coroutines in my logic flow, I have to use either asyncio.ensure_future or asyncio.gather (which actually uses asyncio.ensure_future)? Is there a less "verbose" way?
Is it true that normally developers have to think what coroutines should become separate tasks and use aforementioned functions to gain optimal performance?
Is there a point in using coroutines without multiple tasks in event loop?
How "heavy" are event loop tasks in real life? Surely, they're "lighter" than OS threads or processes. To what extent should I strive for minimal possible number of such tasks?
I need to make two unrelated HTTP requests in parallel. What's the
simplest way to do that?
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request('http://httpbin.org/delay/1'),
request('http://httpbin.org/delay/1'),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Yes, you may achieve concurrency with asyncio.gather or creating task with asyncio.ensure_future.
Next, I want to have some complex dependency structure? I want to
start operations when I have all prerequisites for them and get
results when I need the results.
While code you provided will do job, it would be nicer to split concurrent flows on different coroutines and again use asyncio.gather:
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def get_ip():
return await request('http://httpbin.org/ip')
async def post_from_get():
async with aiohttp.ClientSession() as session:
async with session.get('http://httpbin.org/get') as resp:
get_res = await resp.text()
async with session.post('http://httpbin.org/post', data=get_res) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
get_ip(),
post_from_get(),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Is it true that normally developers have to think what coroutines
should become separate tasks and use aforementioned functions to gain
optimal performance?
Since you use asyncio you probably want to run some jobs concurrently to gain performance, right? asyncio.gather is a way to say - "run these jobs concurrently to get their results faster".
In case you shouldn't have to think what jobs should be ran concurrently to gain performance you may be ok with plain sync code.
Is there a point in using coroutines without multiple tasks in event
loop?
In your code you don't have to create tasks manually if you don't want it: both snippets in this answer don't use asyncio.ensure_future. But internally asyncio uses tasks constantly (for example, as you noted asyncio.gather uses tasks itself).
How "heavy" are event loop tasks in real life? Surely, they're
"lighter" than OS threads or processes. To what extent should I strive
for minimal possible number of such tasks?
Main bottleneck in async program is (almost always) network: you shouldn't worry about number of asyncio coroutines/tasks at all.

Python 3.5 async/await with real code example

I've read tons of articles and tutorial about Python's 3.5 async/await thing. I have to say I'm pretty confused, because some use get_event_loop() and run_until_complete(), some use ensure_future(), some use asyncio.wait(), and some use call_soon().
It seems like I have a lot choices, but I have no idea if they are completely identical or there are cases where you use loops and there are cases where you use wait().
But the thing is all examples work with asyncio.sleep() as simulation of real slow operation which returns an awaitable object. Once I try to swap this line for some real code the whole thing fails. What the heck are the differences between approaches written above and how should I run a third-party library which is not ready for async/await. I do use the Quandl service to fetch some stock data.
import asyncio
import quandl
async def slow_operation(n):
# await asyncio.sleep(1) # Works because it's await ready.
await quandl.Dataset(n) # Doesn't work because it's not await ready.
async def main():
await asyncio.wait([
slow_operation("SIX/US9884981013EUR4"),
slow_operation("SIX/US88160R1014EUR4"),
])
# You don't have to use any code for 50 requests/day.
quandl.ApiConfig.api_key = "MY_SECRET_CODE"
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I hope you get the point how lost I feel and how simple thing I would like to have running in parallel.
If a third-party library is not compatible with async/await then obviously you can't use it easily. There are two cases:
Let's say that the function in the library is asynchronous and it gives you a callback, e.g.
def fn(..., clb):
...
So you can do:
def on_result(...):
...
fn(..., on_result)
In that case you can wrap such functions into the asyncio protocol like this:
from asyncio import Future
def wrapper(...):
future = Future()
def my_clb(...):
future.set_result(xyz)
fn(..., my_clb)
return future
(use future.set_exception(exc) on exception)
Then you can simply call that wrapper in some async function with await:
value = await wrapper(...)
Note that await works with any Future object. You don't have to declare wrapper as async.
If the function in the library is synchronous then you can run it in a separate thread (probably you would use some thread pool for that). The whole code may look like this:
import asyncio
import time
from concurrent.futures import ThreadPoolExecutor
# Initialize 10 threads
THREAD_POOL = ThreadPoolExecutor(10)
def synchronous_handler(param1, ...):
# Do something synchronous
time.sleep(2)
return "foo"
# Somewhere else
async def main():
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
]
await asyncio.wait(futures)
for future in futures:
print(future.result())
with THREAD_POOL:
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
If you can't use threads for whatever reason then using such a library simply makes entire asynchronous code pointless.
Note however that using synchronous library with async is probably a bad idea. You won't get much and yet you complicate the code a lot.
You can take a look at the following simple working example from here. By the way it returns a string worth reading :-)
import aiohttp
import asyncio
async def fetch(client):
async with client.get('https://docs.aiohttp.org/en/stable/client_reference.html') as resp:
assert resp.status == 200
return await resp.text()
async def main():
async with aiohttp.ClientSession() as client:
html = await fetch(client)
print(html)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Categories