aiohttp: rate limiting parallel requests

aiohttp: rate limiting parallel requests - python

APIs often have rate limits that users have to follow. As an example let's take 50 requests/second. Sequential requests take 0.5-1 second and thus are too slow to come close to that limit. Parallel requests with aiohttp, however, exceed the rate limit.
To poll the API as fast as allowed, one needs to rate limit parallel calls.
Examples that I found so far decorate session.get, approximately like so:
session.get = rate_limited(max_calls_per_second)(session.get)
This works well for sequential calls. Trying to implement this in parallel calls does not work as intended.
Here's some code as example:
async with aiohttp.ClientSession() as session:
session.get = rate_limited(max_calls_per_second)(session.get)
tasks = (asyncio.ensure_future(download_coroutine(
timeout, session, url)) for url in urls)
process_responses_function(await asyncio.gather(*tasks))
The problem with this is that it will rate-limit the queueing of the tasks. The execution with gather will still happen more or less at the same time. Worst of both worlds ;-).
Yes, I found a similar question right here aiohttp: set maximum number of requests per second, but neither replies answer the actual question of limiting the rate of requests. Also the blog post from Quentin Pradet works only on rate-limiting the queueing.
To wrap it up: How can one limit the number of requests per second for parallel aiohttp requests?

If I understand you well, you want to limit the number of simultaneous requests?
There is a object inside asyncio named Semaphore, it works like an asynchronous RLock.
semaphore = asyncio.Semaphore(50)
#...
async def limit_wrap(url):
async with semaphore:
# do what you want
#...
results = asyncio.gather([limit_wrap(url) for url in urls])
updated
Suppose I make 50 concurrent requests, and they all finish in 2 seconds. So, it doesn't touch the limitation(only 25 requests per seconds).
That means I should make 100 concurrent requests, and they all finish in 2 seconds too(50 requests per seconds). But before you actually make those requests, how could you determine how long will they finish?
Or if you doesn't mind finished requests per second but requests made per second. You can:
async def loop_wrap(urls):
for url in urls:
asyncio.ensure_future(download(url))
await asyncio.sleep(1/50)
asyncio.ensure_future(loop_wrap(urls))
loop.run_forever()
The code above will create a Future instance every 1/50 second.

I approached the problem by creating a subclass of aiohttp.ClientSession() with a ratelimiter based on the leaky-bucket algorithm. I use asyncio.Queue() for ratelimiting instead of Semaphores. I’ve only overridden the _request() method. I find this approach cleaner since you only replace session = aiohttp.ClientSession() with session = ThrottledClientSession(rate_limit=15).
class ThrottledClientSession(aiohttp.ClientSession):
"""
Rate-throttled client session class inherited from aiohttp.ClientSession)
USAGE:
replace `session = aiohttp.ClientSession()`
with `session = ThrottledClientSession(rate_limit=15)`
see https://stackoverflow.com/a/60357775/107049
"""
MIN_SLEEP = 0.1
def __init__(self, rate_limit: float = None, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.rate_limit = rate_limit
self._fillerTask = None
self._queue = None
self._start_time = time.time()
if rate_limit is not None:
if rate_limit <= 0:
raise ValueError('rate_limit must be positive')
self._queue = asyncio.Queue(min(2, int(rate_limit) + 1))
self._fillerTask = asyncio.create_task(self._filler(rate_limit))
def _get_sleep(self) -> Optional[float]:
if self.rate_limit is not None:
return max(1 / self.rate_limit, self.MIN_SLEEP)
return None
async def close(self) -> None:
"""Close rate-limiter's "bucket filler" task"""
if self._fillerTask is not None:
self._fillerTask.cancel()
try:
await asyncio.wait_for(self._fillerTask, timeout=0.5)
except asyncio.TimeoutError as err:
print(str(err))
await super().close()
async def _filler(self, rate_limit: float = 1):
"""Filler task to fill the leaky bucket algo"""
try:
if self._queue is None:
return
self.rate_limit = rate_limit
sleep = self._get_sleep()
updated_at = time.monotonic()
fraction = 0
extra_increment = 0
for i in range(0, self._queue.maxsize):
self._queue.put_nowait(i)
while True:
if not self._queue.full():
now = time.monotonic()
increment = rate_limit * (now - updated_at)
fraction += increment % 1
extra_increment = fraction // 1
items_2_add = int(min(self._queue.maxsize - self._queue.qsize(), int(increment) + extra_increment))
fraction = fraction % 1
for i in range(0, items_2_add):
self._queue.put_nowait(i)
updated_at = now
await asyncio.sleep(sleep)
except asyncio.CancelledError:
print('Cancelled')
except Exception as err:
print(str(err))
async def _allow(self) -> None:
if self._queue is not None:
# debug
# if self._start_time == None:
# self._start_time = time.time()
await self._queue.get()
self._queue.task_done()
return None
async def _request(self, *args, **kwargs) -> aiohttp.ClientResponse:
"""Throttled _request()"""
await self._allow()
return await super()._request(*args, **kwargs)

I liked #sraw's approached this with asyncio, but their answer didn't quite cut it for me. Since I don't know if my calls to download are going to each be faster or slower than the rate limit I want to have the option to run many in parallel when requests are slow and run one at a time when requests are very fast so that I'm always right at the rate limit.
I do this by using a queue with a producer that produces new tasks at the rate limit, then many consumers that will either all wait on the next job if they're fast, or there will be work backed up in the queue if they are slow, and will run as fast as the processor/network allow:
import asyncio
from datetime import datetime
async def download(url):
# download or whatever
task_time = 1/10
await asyncio.sleep(task_time)
result = datetime.now()
return result, url
async def producer_fn(queue, urls, max_per_second):
for url in urls:
await queue.put(url)
await asyncio.sleep(1/max_per_second)
async def consumer(work_queue, result_queue):
while True:
url = await work_queue.get()
result = await download(url)
work_queue.task_done()
await result_queue.put(result)
urls = range(20)
async def main():
work_queue = asyncio.Queue()
result_queue = asyncio.Queue()
num_consumer_tasks = 10
max_per_second = 5
consumers = [asyncio.create_task(consumer(work_queue, result_queue))
for _ in range(num_consumer_tasks)]
producer = asyncio.create_task(producer_fn(work_queue, urls, max_per_second))
await producer
# wait for the remaining tasks to be processed
await work_queue.join()
# cancel the consumers, which are now idle
for c in consumers:
c.cancel()
while not result_queue.empty():
result, url = await result_queue.get()
print(f'{url} finished at {result}')
asyncio.run(main())

I developed a library named octopus-api (https://pypi.org/project/octopus-api/), that enables you to rate limit and set the number of connections (parallel) calls to the endpoint using aiohttp under the hood. The goal of it is to simplify all the aiohttp setup needed.
Here is an example of how to use it, where the get_ethereum is the user-defined request function:
from octopus_api import TentacleSession, OctopusApi
from typing import Dict, List
if __name__ == '__main__':
async def get_ethereum(session: TentacleSession, request: Dict):
async with session.get(url=request["url"], params=request["params"]) as response:
body = await response.json()
return body
client = OctopusApi(rate=50, resolution="sec", connections=6)
result: List = client.execute(requests_list=[{
"url": "https://api.pro.coinbase.com/products/ETH-EUR/candles?granularity=900&start=2021-12-04T00:00:00Z&end=2021-12-04T00:00:00Z",
"params": {}}] * 1000, func=get_ethereum)
print(result)
The TentacleSession works the same as how you write POST, GET, PUT and PATCH for aiohttp.ClientSession.
Let me know if it helps your issue related to rate limits and parallel calls.

As far as the question here regarding n requests being sent at the same time when gather() is called, the key is using create_task() with an await asyncio.sleep(1.1) before every call. Any task created with create_task is immediately run:
for i in range(THREADS):
await asyncio.sleep(1.1)
tasks.append(
asyncio.create_task(getData(session, q, ''.join(random.choice(string.ascii_lowercase) for i in range(10))))
)
await asyncio.gather(*tasks)
The other issue of limiting # of simultaneous connections is also solved in the below example by using ClientSession() context in async_payload_wrapper and setting the connector with a limit.
With this setup I can run 25 coroutines (THREADS=25) that each loop over a queue of URLS and not violate a 25 concurrent connection rule:
async def send_request(session, url, routine):
start_time = time.time()
print(f"{routine}, sending request: {datetime.now()}")
params = {
'api_key': 'nunya',
'url': '%s' % url,
'render_js': 'false',
'premium_proxy': 'false',
'country_code':'us'
}
try:
async with session.get(url='http://yourAPI.com',params=params,) as response:
data = await response.content.read()
print(f"{routine}, done request: {time.time() - start_time} seconds")
return data
except asyncio.TimeoutError as e:
print('timeout---------------------')
errors.append(url)
except aiohttp.ClientResponseError as e:
print('request failed - Server Error')
errors.append(url)
except Exception as e:
errors.append(url)
async def getData(session, q, test):
while True:
if not q.empty():
url = q.get_nowait()
resp = await send_request(session, url ,test)
if resp is not None:
processData(resp, test, url)
else:
print(f'{test} queue empty')
break
async def async_payload_wrapper():
tasks = []
q = asyncio.Queue()
for url in urls:
await q.put(url)
async with ClientSession(connector=aiohttp.TCPConnector(limit=THREADS), timeout=ClientTimeout(total=61), raise_for_status=True) as session:
for i in range(THREADS):
await asyncio.sleep(1.1)
tasks.append(
asyncio.create_task(getData(session, q, ''.join(random.choice(string.ascii_lowercase) for i in range(10))))
)
await asyncio.gather(*tasks)
if __name__ == '__main__':
start_time = time.time()
asyncio.run(async_payload_wrapper())

Related

Why does my asyncio function stop after the first task?

This is my first attempt at asynchronous programming in Python, but I am running into a problem where my results stop after the first task is finished, as opposed to returning all of the results after every task has finished executing.
In api.py, I have a search_async function that ultimately makes the request using the aiohttp.ClientSession object being passed around. Then the search_value_async function as a wrapper that's being called in app.py
# api.py
async def search_async(self, session, offset=0):
endpoint = 'https://example.com'
query_string = urlencode({ 'offset': offset })
lookup_url = f'{endpoint}?{query_string)}'
async with session.get(lookup_url, headers=self.get_resource_headers()) as response:
if response.status not in range(200, 299):
return {
'Status': response.status
}
return await response.json()
async def search_value_async(self, session, offset=0):
return await self.search_async(session, offset)
# app.py
async def get_recommendations(queries):
async with aiohttp.ClientSession() as session:
data = await get_all_queries(session, queries)
return data
async def get_all_queries(session, queries):
tasks = []
for query in queries:
for offset in range(0, 1000, 50):
tasks.append(asyncio.create_task(api.search_value_async(session, query, offset)))
results = await asyncio.gather(*tasks)
return results
def main():
# queries = ...
data = []
results = asyncio.run(get_recommendations(queries))
data.extend(results)
recommendations = normalize_data(data)
return data
So far, I've confirmed that the correct number of coroutines are being created, and I was able to diagnose that the number of results I get back when running asynchronously is equivalent to only the first task being ran.
I'm new to this, so my understanding could be wrong, but if all my tasks are being created, I would expect the results from await aysncio.gather(*tasks) to give me the results from all of my completed tasks, not just the first one.

How can I make Async IO work on a non async function?

I have a complex function Vehicle.set_data, which has many nested functions, API calls, DB calls, etc. For the sake of this example, I will simplify it.
I am trying to use Async IO to run Vehicle.set_data on multiple vehicles at once. Here is my Vehicle model:
class Vehicle:
def __init__(self, token):
self.token = token
# Works async
async def set_data(self):
await asyncio.sleep(random.random() * 10)
# Does not work async
# def set_data(self):
# time.sleep(random.random() * 10)
And here is my Async IO routinue:
async def set_vehicle_data(vehicle):
# sleep for T seconds on average
await vehicle.set_data()
def get_random_string():
return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
async def producer(queue):
count = 0
while True:
count += 1
# produce a token and send it to a consumer
token = get_random_string()
vehicle = Vehicle(token)
print(f'produced {vehicle.token}')
await queue.put(vehicle)
if count > 3:
break
async def consumer(queue):
while True:
vehicle = await queue.get()
# process the token received from a producer
print(f'Starting consumption for vehicle {vehicle.token}')
await set_vehicle_data(vehicle)
queue.task_done()
print(f'Ending consumption for vehicle {vehicle.token}')
async def main():
queue = asyncio.Queue()
# #todo now, do I need multiple producers
producers = [asyncio.create_task(producer(queue))
for _ in range(3)]
consumers = [asyncio.create_task(consumer(queue))
for _ in range(3)]
# with both producers and consumers running, wait for
# the producers to finish
await asyncio.gather(*producers)
print('---- done producing')
# wait for the remaining tasks to be processed
await queue.join()
# cancel the consumers, which are now idle
for c in consumers:
c.cancel()
asyncio.run(main())
In the example above, this commented section of code does not allow multiple vehicles to process at once:
# Does not work async
# def set_data(self):
# time.sleep(random.random() * 10)
Because this is such a complex query in our actual codebase, it would be a tremendous refactor to go flag every single nested function with async and await. Is there any way I can make this function work async without marking up my whole codebase with async?

You can run the function in a separate thread with asyncio.to_thread
await asyncio.to_thread(self.set_data)
If you're using python <3.9 use loop.run_in_executor
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self.set_data)

Python parallelising "async for"

I have the following method in my Tornado handler:
async def get(self):
url = 'url here'
try:
async for batch in downloader.fetch(url):
self.write(batch)
await self.flush()
except Exception as e:
logger.warning(e)
This is the code for downloader.fetch():
async def fetch(url, **kwargs):
timeout = kwargs.get('timeout', aiohttp.ClientTimeout(total=12))
response_validator = kwargs.get('response_validator', json_response_validator)
extractor = kwargs.get('extractor', json_extractor)
try:
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get(url) as resp:
response_validator(resp)
async for batch in extractor(resp):
yield batch
except aiohttp.client_exceptions.ClientConnectorError:
logger.warning("bad request")
raise
except asyncio.TimeoutError:
logger.warning("server timeout")
raise
I would like yield the "batch" object from multiple downloaders in paralel.
I want the first available batch from the first downloader and so on until all downloaders finished. Something like this (this is not working code):
async for batch in [downloader.fetch(url1), downloader.fetch(url2)]:
....
Is this possible? How can I modify what I am doing in order to be able to yield from multiple coroutines in parallel?

How can I modify what I am doing in order to be able to yield from multiple coroutines in parallel?
You need a function that merges two async sequences into one, iterating over both in parallel and yielding elements from one or the other, as they become available. While such a function is not included in the current standard library, you can find one in the aiostream package.
You can also write your own merge function, as shown in this answer:
async def merge(*iterables):
iter_next = {it.__aiter__(): None for it in iterables}
while iter_next:
for it, it_next in iter_next.items():
if it_next is None:
fut = asyncio.ensure_future(it.__anext__())
fut._orig_iter = it
iter_next[it] = fut
done, _ = await asyncio.wait(iter_next.values(),
return_when=asyncio.FIRST_COMPLETED)
for fut in done:
iter_next[fut._orig_iter] = None
try:
ret = fut.result()
except StopAsyncIteration:
del iter_next[fut._orig_iter]
continue
yield ret
Using that function, the loop would look like this:
async for batch in merge(downloader.fetch(url1), downloader.fetch(url2)):
....

Edit:
As mentioned in the comment, below method does not execute given routines in parallel.
Checkout aitertools library.
import asyncio
import aitertools
async def f1():
await asyncio.sleep(5)
yield 1
async def f2():
await asyncio.sleep(6)
yield 2
async def iter_funcs():
async for x in aitertools.chain(f2(), f1()):
print(x)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(iter_funcs())
It seems that, functions being iterated must be couroutine.

Python asyncio task list generation without executing the function

While working in asyncio, I'm trying to use a list comprehension to build my task list. The basic form of the function is as follows:
import asyncio
import urllib.request as req
#asyncio.coroutine
def coro(term):
print(term)
google = "https://www.google.com/search?q=" + term.replace(" ", "+") + "&num=100&start=0"
request = req.Request(google, None, headers)
(some beautiful soup stuff)
My goal is to use a list of terms to create my task list:
terms = ["pie", "chicken" ,"things" ,"stuff"]
tasks=[
coro("pie"),
coro("chicken"),
coro("things"),
coro("stuff")]
My initial thought was:
loop = asyncio.get_event_loop()
tasks = [my_coroutine(term) for term in terms]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
This doesn't create the task list it runs the function during the list comprehension. Is there a way to use a shortcut to create the task list wihout writing every task?

Your HTTP client does not support asyncio, and you will not get the expected results. Try this to see .wait() does work as you expected:
import asyncio
import random
#asyncio.coroutine
def my_coroutine(term):
print("start", term)
yield from asyncio.sleep(random.uniform(1, 3))
print("end", term)
terms = ["pie", "chicken", "things", "stuff"]
loop = asyncio.get_event_loop()
tasks = [my_coroutine(term) for term in terms]
print("Here we go!")
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
If you use asyncio.gather() you get one future encapsulating all your tasks, which can be easily canceled with .cancel(), here demonstrated with python 3.5+ async def/await syntax (but works the same with #coroutine and yield from):
import asyncio
import random
async def my_coroutine(term):
print("start", term)
n = random.uniform(0.2, 1.5)
await asyncio.sleep(n)
print("end", term)
return "Term {} slept for {:.2f} seconds".format(term, n)
async def stop_all():
"""Cancels all still running tasks after one second"""
await asyncio.sleep(1)
print("stopping")
fut.cancel()
return ":-)"
loop = asyncio.get_event_loop()
terms = ["pie", "chicken", "things", "stuff"]
tasks = (my_coroutine(term) for term in terms)
fut = asyncio.gather(stop_all(), *tasks, return_exceptions=True)
print("Here we go!")
loop.run_until_complete(fut)
for task_result in fut.result():
if not isinstance(task_result, Exception):
print("OK", task_result)
else:
print("Failed", task_result)
loop.close()
And finally, if you want to use an async HTTP client, try aiohttp. First install it with:
pip install aiohttp
then try this example, which uses asyncio.as_completed:
import asyncio
import aiohttp
async def fetch(session, url):
print("Getting {}...".format(url))
async with session.get(url) as resp:
text = await resp.text()
return "{}: Got {} bytes".format(url, len(text))
async def fetch_all():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "http://httpbin.org/delay/{}".format(delay))
for delay in (1, 1, 2, 3, 3)]
for task in asyncio.as_completed(tasks):
print(await task)
return "Done."
loop = asyncio.get_event_loop()
resp = loop.run_until_complete(fetch_all())
print(resp)
loop.close()

this works in python 3.5 (added the new async-await syntax):
import asyncio
async def coro(term):
for i in range(3):
await asyncio.sleep(int(len(term))) # just sleep
print("cor1", i, term)
terms = ["pie", "chicken", "things", "stuff"]
tasks = [coro(term) for term in terms]
loop = asyncio.get_event_loop()
cors = asyncio.wait(tasks)
loop.run_until_complete(cors)
should't your version yield from req.Request(google, None, headers)? and (what library is that?) is this library even made for use with asyncio?
(here is the same code with the python <= 3.4 syntax; the missing parts are the same as above):
#asyncio.coroutine
def coro(term):
for i in range(3):
yield from asyncio.sleep(int(len(term))) # just sleep
print("cor1", i, term)

Create queue and run event loop
def main():
while terms:
tasks.append(asyncio.create_task(terms.pop())
responses = asyncio.gather(*tasks, return_exception=True)
loop = asyncio.get_event_loop()
loop.run_until_complete(responses)

aiohttp: set maximum number of requests per second

How can I set maximum number of requests per second (limit them) in client side using aiohttp?

Although it's not exactly a limit on the number of requests per second, note that since v2.0, when using a ClientSession, aiohttp automatically limits the number of simultaneous connections to 100.
You can modify the limit by creating your own TCPConnector and passing it into the ClientSession. For instance, to create a client limited to 50 simultaneous requests:
import aiohttp
connector = aiohttp.TCPConnector(limit=50)
client = aiohttp.ClientSession(connector=connector)
In case it's better suited to your use case, there is also a limit_per_host parameter (which is off by default) that you can pass to limit the number of simultaneous connections to the same "endpoint". Per the docs:
limit_per_host (int) – limit for simultaneous connections to the same endpoint. Endpoints are the same if they are have equal (host, port, is_ssl) triple.
Example usage:
import aiohttp
connector = aiohttp.TCPConnector(limit_per_host=50)
client = aiohttp.ClientSession(connector=connector)

I found one possible solution here: http://compiletoi.net/fast-scraping-in-python-with-asyncio.html
Doing 3 requests at the same time is cool, doing 5000, however, is not so nice. If you try to do too many requests at the same time, connections might start to get closed, or you might even get banned from the website.
To avoid this, you can use a semaphore. It is a synchronization tool that can be used to limit the number of coroutines that do something at some point. We'll just create the semaphore before creating the loop, passing as an argument the number of simultaneous requests we want to allow:
sem = asyncio.Semaphore(5)
Then, we just replace:
page = yield from get(url, compress=True)
by the same thing, but protected by a semaphore:
with (yield from sem):
page = yield from get(url, compress=True)
This will ensure that at most 5 requests can be done at the same time.

This is an example without aiohttp, but you can wrap any async method or aiohttp.request using the Limit decorator
import asyncio
import time
class Limit(object):
def __init__(self, calls=5, period=1):
self.calls = calls
self.period = period
self.clock = time.monotonic
self.last_reset = 0
self.num_calls = 0
def __call__(self, func):
async def wrapper(*args, **kwargs):
if self.num_calls >= self.calls:
await asyncio.sleep(self.__period_remaining())
period_remaining = self.__period_remaining()
if period_remaining <= 0:
self.num_calls = 0
self.last_reset = self.clock()
self.num_calls += 1
return await func(*args, **kwargs)
return wrapper
def __period_remaining(self):
elapsed = self.clock() - self.last_reset
return self.period - elapsed
#Limit(calls=5, period=2)
async def test_call(x):
print(x)
async def worker():
for x in range(100):
await test_call(x + 1)
asyncio.run(worker())

Because none of the solution works from the other answers (I've already tried) if the API request limits the time since the end of the request. I'm posting a new one that should work:
class Limiter:
def __init__(self, calls_limit: int = 5, period: int = 1):
self.calls_limit = calls_limit
self.period = period
self.semaphore = asyncio.Semaphore(calls_limit)
self.requests_finish_time = []
async def sleep(self):
if len(self.requests_finish_time) >= self.calls_limit:
sleep_before = self.requests_finish_time.pop(0)
if sleep_before >= time.monotonic():
await asyncio.sleep(sleep_before - time.monotonic())
def __call__(self, func):
async def wrapper(*args, **kwargs):
async with self.semaphore:
await self.sleep()
res = await func(*args, **kwargs)
self.requests_finish_time.append(time.monotonic() + self.period)
return res
return wrapper
Usage:
#Limiter(calls_limit=5, period=1)
async def api_call():
...
async def main():
tasks = [asyncio.create_task(api_call(url)) for url in urls]
asyncio.gather(*tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop_policy().get_event_loop()
loop.run_until_complete(main())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

aiohttp: rate limiting parallel requests - python

Related

Why does my asyncio function stop after the first task?

How can I make Async IO work on a non async function?

Python parallelising "async for"

Python asyncio task list generation without executing the function

aiohttp: set maximum number of requests per second

Categories

Resources