Parallelize requests call with asyncio in Python

Parallelize requests call with asyncio in Python - python

I'm trying to create an interface to an API, and I want to have the option to easily run the requests sync or asynchronously, and I came up with the following code.
import asyncio
import requests
def async_run(coro_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, asyncio.run, coro) for coro in coro_list]
result = loop.run_until_complete(asyncio.gather(*futures))
return result
def sync_get(url):
return requests.get(url)
async def async_get(url):
return sync_get(url)
coro_list = [async_get("https://google.com"), async_get("https://google.com")]
responses = async_run(coro_list)
print(responses)
For me it's very intuitive to either call sync_get or create a list of async_get and call async_run, and requires no knowledge of async Python to understand how it works.
The only problem is that loop.run_in_executor(None, asyncio.run, coro) doesn't sound too optimal, and I couldn't find anyone else running this code on Github. So I'm wondering, is there a simpler way to accomplish the objective of abstracting these threading and asyncio concepts in some similar way, or is this code already optimal?

asyncio.run() is usually used as the main entry to run async code from sync code.
loop.run_in_executor(None, asyncio.run, coro) cause an event loop created in executor threads to run coro in coro_list. Why not directly run sync_get in executor threads?
import asyncio
import requests
def async_run(url_list):
loop = asyncio.get_event_loop()
futures = [loop.run_in_executor(None, sync_get, url) for url in url_list]
result = await asyncio.gather(*futures)
return result
def sync_get(url):
return requests.get(url)
#
# async def async_get(url):
# return sync_get(url)
url_list = ["https://google.com", "https://google.com"]
responses = asyncio.run(async_run(url_list))
print(responses)
There are async libaries, eg. aiohttp and httpx, to accomplish similar work.

At the end I chose not to cover completely asyncio under my interface.
Still with the goal of having not having to manage 2 "requests" functions, I made the API async first, and run the synchronous one with asyncio and I ended up with something like this.
def sync_request():
return asyncio.run(async_request(...))
async def async_request():
return await aiohttp.request(...) # pseudo code

Related

Fetch file from localhost without web server

I want to fetch file from localhost without web server asynchronously. Seems that it is possible to do using file:// scheme. The following code sample is taken from documentation, but obviously it doesn't work:
import aiohttp
import asyncio
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
async with aiohttp.ClientSession() as session:
html = await fetch(session, 'file://localhost/Users/user/test.txt')
print(html)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
How to make it work?
The one way I see is to use "curl file://path" in separate thread pool using run_in_executor, but I think there should be a way to fix code

If you need to obtain the contents of a local file, you can do it with ordinary Python built-ins, such as:
with open('Users/user/test.txt') as rd:
html = rd.read()
If the file is not very large, and is stored on a local filesystem, you don't even need to make it async, as reading it will be fast enough not to disturb the event loop. If the file is large or reading it might be slow for other reasons, you should read it through run_in_executor to prevent it from blocking other asyncio code. For example (untested):
def read_file_sync(file_name):
with open('Users/user/test.txt') as rd:
return rd.read()
async def read_file(file_name):
loop = asyncio.get_event_loop()
html = await loop.run_in_executor(None, read_file_sync, file_name)
return html

python ThreadPoolExecutor close all threads when I get a result

I am running a webscraper class who's method name is self.get_with_random_proxy_using_chain.
I am trying to send multithreaded calls to the same url, and would like that once there is a result from any thread, the method returns a response and closes other still active threads.
So far my code looks like this (probably naive):
from concurrent.futures import ThreadPoolExecutor, as_completed
# class initiation etc
max_workers = cpu_count() * 5
urls = [url_to_open] * 50
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_url=[]
for url in urls: # i had to do a loop to include sleep not to overload the proxy server
future_to_url.append(executor.submit(self.get_with_random_proxy_using_chain,
url,
timeout,
update_proxy_score,
unwanted_keywords,
unwanted_status_codes,
random_universe_size,
file_path_to_save_streamed_content))
sleep(0.5)
for future in as_completed(future_to_url):
if future.result() is not None:
return future.result()
But it runs all the threads.
Is there a way to close all threads once the first future has completed.
I am using windows and python 3.7x
So far I found this link, but I don't manage to make it work (pogram still runs for a long time).

As far as I know, running futures cannot be cancelled. Quite a lot has been written about this. And there are even some workarounds.
But I would suggest taking a closer look at the asyncio module. It is quite well suited for such tasks.
Below is a simple example, when several concurrent requests are made, and upon receiving the first result, the rest are canceled.
import asyncio
from typing import Set
from aiohttp import ClientSession
async def fetch(url, session):
async with session.get(url) as response:
return await response.read()
async def wait_for_first_response(tasks):
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
for p in pending:
p.cancel()
return done.pop().result()
async def request_one_of(*urls):
tasks = set()
async with ClientSession() as session:
for url in urls:
task = asyncio.create_task(fetch(url, session))
tasks.add(task)
return await wait_for_first_response(tasks)
async def main():
response = await request_one_of("https://wikipedia.org", "https://apple.com")
print(response)
asyncio.run(main())

Retrieving data from python's coroutine object

I am trying to learn async, and now I am trying to get whois information for a batch of domains. I found this lib aiowhois, but there are only a few strokes of information, not enough for such newbie as I am.
This code works without errors, but I don't know how to print data from parsed whois variable, which is coroutine object.
resolv = aiowhois.Whois(timeout=10)
async def coro(url, sem):
parsed_whois = await resolv.query(url)
async def main():
tasks = []
sem = asyncio.Semaphore(4)
for url in domains:
task = asyncio.Task(coro(url, sem))
tasks.append(task)
await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

You can avoid using tasks. Just apply gather to the coroutine directly.
In case you are confused about the difference, this SO QA might help you (especially the second answer).
You can have each coroutine return its result, without resorting to global variables:
async def coro(url):
return await resolv.query(url)
async def main():
domains = ...
ops = [coro(url) for url in domains]
rets = await asyncio.gather(*ops)
print(rets)
Please see the official docs to learn more about how to use gather or wait or even more options
Note: if you are using the latest python versions, you can also simplify the loop running with just
asyncio.run(main())
Note 2: I have removed the semaphore from my code, as it's unclear why you need it and where.

all_parsed_whois = [] # make a global
async def coro(url, sem):
all_parsed_whois.append(await resolv.query(url))
If you want the data as soon as it is available you could task.add_done_callback()
python asyncio add_done_callback with async def

Parallel asynchronous IO in Python's coroutines

Simple example: I need to make two unrelated HTTP requests in parallel. What's the simplest way to do that? I expect it to be like that:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://httpbin.org/get')
coro_2 = session.get('http://httpbin.org/ip')
return combine_responses(await coro_1, await coro_2)
In other words, I want to initiate IO operations and wait for their results so they effectively run in parallel. This can be achieved with asyncio.gather:
async def do_the_job():
with aiohttp.ClientSession() as session:
coro_1 = session.get('http://example.com/get')
coro_2 = session.get('http://example.org/tp')
return combine_responses(*(await asyncio.gather(coro_1, coro_2)))
Next, I want to have some complex dependency structure. I want to start operations when I have all prerequisites for them and get results when I need the results. Here helps asyncio.ensure_future which makes separate task from coroutine which is managed by event loop separately:
async def do_the_job():
with aiohttp.ClientSession() as session:
fut_1 = asyncio.ensure_future(session.get('http://httpbin.org/ip'))
coro_2 = session.get('http://httpbin.org/get')
coro_3 = session.post('http://httpbin.org/post', data=(await coro_2)
coro_3_result = await coro_3
return combine_responses(await fut_1, coro_3_result)
Is it true that, to achieve parallel non-blocking IO with coroutines in my logic flow, I have to use either asyncio.ensure_future or asyncio.gather (which actually uses asyncio.ensure_future)? Is there a less "verbose" way?
Is it true that normally developers have to think what coroutines should become separate tasks and use aforementioned functions to gain optimal performance?
Is there a point in using coroutines without multiple tasks in event loop?
How "heavy" are event loop tasks in real life? Surely, they're "lighter" than OS threads or processes. To what extent should I strive for minimal possible number of such tasks?

I need to make two unrelated HTTP requests in parallel. What's the
simplest way to do that?
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
request('http://httpbin.org/delay/1'),
request('http://httpbin.org/delay/1'),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Yes, you may achieve concurrency with asyncio.gather or creating task with asyncio.ensure_future.
Next, I want to have some complex dependency structure? I want to
start operations when I have all prerequisites for them and get
results when I need the results.
While code you provided will do job, it would be nicer to split concurrent flows on different coroutines and again use asyncio.gather:
import asyncio
import aiohttp
async def request(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
return await resp.text()
async def get_ip():
return await request('http://httpbin.org/ip')
async def post_from_get():
async with aiohttp.ClientSession() as session:
async with session.get('http://httpbin.org/get') as resp:
get_res = await resp.text()
async with session.post('http://httpbin.org/post', data=get_res) as resp:
return await resp.text()
async def main():
results = await asyncio.gather(
get_ip(),
post_from_get(),
)
print(len(results))
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
loop.run_until_complete(loop.shutdown_asyncgens())
finally:
loop.close()
Is it true that normally developers have to think what coroutines
should become separate tasks and use aforementioned functions to gain
optimal performance?
Since you use asyncio you probably want to run some jobs concurrently to gain performance, right? asyncio.gather is a way to say - "run these jobs concurrently to get their results faster".
In case you shouldn't have to think what jobs should be ran concurrently to gain performance you may be ok with plain sync code.
Is there a point in using coroutines without multiple tasks in event
loop?
In your code you don't have to create tasks manually if you don't want it: both snippets in this answer don't use asyncio.ensure_future. But internally asyncio uses tasks constantly (for example, as you noted asyncio.gather uses tasks itself).
How "heavy" are event loop tasks in real life? Surely, they're
"lighter" than OS threads or processes. To what extent should I strive
for minimal possible number of such tasks?
Main bottleneck in async program is (almost always) network: you shouldn't worry about number of asyncio coroutines/tasks at all.

Python 3.5 async/await with real code example

I've read tons of articles and tutorial about Python's 3.5 async/await thing. I have to say I'm pretty confused, because some use get_event_loop() and run_until_complete(), some use ensure_future(), some use asyncio.wait(), and some use call_soon().
It seems like I have a lot choices, but I have no idea if they are completely identical or there are cases where you use loops and there are cases where you use wait().
But the thing is all examples work with asyncio.sleep() as simulation of real slow operation which returns an awaitable object. Once I try to swap this line for some real code the whole thing fails. What the heck are the differences between approaches written above and how should I run a third-party library which is not ready for async/await. I do use the Quandl service to fetch some stock data.
import asyncio
import quandl
async def slow_operation(n):
# await asyncio.sleep(1) # Works because it's await ready.
await quandl.Dataset(n) # Doesn't work because it's not await ready.
async def main():
await asyncio.wait([
slow_operation("SIX/US9884981013EUR4"),
slow_operation("SIX/US88160R1014EUR4"),
])
# You don't have to use any code for 50 requests/day.
quandl.ApiConfig.api_key = "MY_SECRET_CODE"
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
I hope you get the point how lost I feel and how simple thing I would like to have running in parallel.

If a third-party library is not compatible with async/await then obviously you can't use it easily. There are two cases:
Let's say that the function in the library is asynchronous and it gives you a callback, e.g.
def fn(..., clb):
...
So you can do:
def on_result(...):
...
fn(..., on_result)
In that case you can wrap such functions into the asyncio protocol like this:
from asyncio import Future
def wrapper(...):
future = Future()
def my_clb(...):
future.set_result(xyz)
fn(..., my_clb)
return future
(use future.set_exception(exc) on exception)
Then you can simply call that wrapper in some async function with await:
value = await wrapper(...)
Note that await works with any Future object. You don't have to declare wrapper as async.
If the function in the library is synchronous then you can run it in a separate thread (probably you would use some thread pool for that). The whole code may look like this:
import asyncio
import time
from concurrent.futures import ThreadPoolExecutor
# Initialize 10 threads
THREAD_POOL = ThreadPoolExecutor(10)
def synchronous_handler(param1, ...):
# Do something synchronous
time.sleep(2)
return "foo"
# Somewhere else
async def main():
loop = asyncio.get_event_loop()
futures = [
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
loop.run_in_executor(THREAD_POOL, synchronous_handler, param1, ...),
]
await asyncio.wait(futures)
for future in futures:
print(future.result())
with THREAD_POOL:
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
If you can't use threads for whatever reason then using such a library simply makes entire asynchronous code pointless.
Note however that using synchronous library with async is probably a bad idea. You won't get much and yet you complicate the code a lot.

You can take a look at the following simple working example from here. By the way it returns a string worth reading :-)
import aiohttp
import asyncio
async def fetch(client):
async with client.get('https://docs.aiohttp.org/en/stable/client_reference.html') as resp:
assert resp.status == 200
return await resp.text()
async def main():
async with aiohttp.ClientSession() as client:
html = await fetch(client)
print(html)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize requests call with asyncio in Python - python

Related

Fetch file from localhost without web server

python ThreadPoolExecutor close all threads when I get a result

Retrieving data from python's coroutine object

Parallel asynchronous IO in Python's coroutines

Python 3.5 async/await with real code example

Categories

Resources