Problem I'm trying to solve:
I'm making many api requests to a server. I'm trying to create delays bewtween async api calls to comply with the server's rate limit policy.
What I want it to do
I want it to behave like this:
Make api request #1
wait 0.1 seconds
Make api request #2
wait 0.1 seconds
... and so on ...
repeat until all requests are made
gather the responses and return the results in one object (results)
Issue:
When when I introduced asyncio.sleep() or time.sleep() in the code, it still made api requests almost instantaneously. It seemed to delay the execution of print(), but not the api requests. I suspect that I have to create the delays within the loop, not at the fetch_one() or fetch_all(), but couldn't figure out how to do so.
Code block:
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
#time.sleep(delay)
#asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Versions I'm using:
python 3.8.5
aiohttp 3.7.4
asyncio 3.4.3
I would appreciate any tips on guiding me to the right direction!
The call to asyncio.gather will launch all requests "simultaneously" - and on the other hand, if you would simply use a lock or await for each task, you would not gain anything from using parallelism at all.
The simplest thing to do, if you know the rate you can issue the requests, is simply to increase the asynchronous pause before each request in sucession - a simple global variable can do that:
next_delay = 0.1
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
global next_delay
next_delay += delay
await asyncio.sleep(next_delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Now, if you want like, issue 5 requests and then issue the next 5, you could use a synchronization primitive like asyncio.Condition, using its wait_for on an expression which checks how many api calls are active:
active_calls = 0
MAX_CALLS = 5
async def fetch_all(loop, urls, delay):
event = asyncio.Event()
event.set()
results = await asyncio.gather(*[fetch_one(loop, url, delay, event) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay, cond):
global active_calls
active_calls += 1
if active_calls > MAX_CALLS:
event.clear()
await event.wait()
try:
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
finally:
active_calls -= 1
if active_calls == 0:
event.set()
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
For both examples, should your task avoid global variables in the design (actually,these are "module" variables) - you could either move all funtions to a class, and work on an instance, and promote the global variables to instance attributes, or use a mutable container, such as a list for holding the active_calls value in its first item, and pass that as a parameter.
When you use asyncio.gather you run all fetch_one coroutines concurrently. All of them wait for delay together, than make API calls instantaneously together.
To solve the issue, you should either await fetch_one in one by one in fetch_all or to use Semaphore to signalize next shouldn't start before previous is done.
Here's the idea:
import asyncio
_sem = asyncio.Semaphore(1)
async def fetch_all(loop, urls, delay):
results = await asyncio.gather(*[fetch_one(loop, url, delay) for url in urls], return_exceptions=True)
return results
async def fetch_one(loop, url, delay):
async with _sem: # next coroutine(s) will stuck here until the previous is done
await asyncio.sleep(delay)
async with aiohttp.ClientSession(loop=loop) as session:
async with session.get(url, ssl=SSLContext()) as resp:
# print("An api call to ", url, " is made at ", time.time())
# print(resp)
return await resp
delay = 0.1
urls = ['some string list of urls']
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_all(loop, urls, delay))
Related
Do I need to use threads too? I'm new to async, I was doing it creating 1 thread for every request that I wanted to do, but it was very "inefficient".
I want something like:
def async_request(i):
response = requests.post(url, headers=headers, json=body(i), timeout=5)
r = response.json()
if r['result'] > 50:
do something
i = 0
while True:
async_request(i)
i += 1
time.sleep(0.01) # 10 ms delay
I want to send the request every 10ms and I don't want to wait for one request response to send another, and after I receive the request inside the function "async_request" I want to do something with the result instantly if meet the requirements.
I think that using async for your problem is a good idea. However, for this, you have to change multiple things. First of all, I would include asyncio, which lets you place tasks on the event loop for processing. Secondly, aiohttp for non-blocking HTTP requests. The following code uses both to achieve your goal of sending multiple requests in short intervals:
import asyncio
import aiohttp
import json
async def send_post_request(session, url, headers, body):
async with session.get(url, headers=headers, data=body) as resp:
json = await resp.json()
return json
async def async_request(session, i):
# TODO change to your params
test_url = 'https://jsonplaceholder.typicode.com/todos/1'
body = json.dumps({'title': 'foo', 'body': 'bar', 'userId': i})
headers = {'content-type': 'application/json'}
result = await send_post_request(session, test_url, headers, body)
# TODO process JSON result
print(result)
async def main():
session = aiohttp.ClientSession()
loop = asyncio.get_event_loop()
i = 0
while True:
# Add new request to your event loop
loop.create_task(async_request(session, i))
i += 1
# TODO change sleeping period to what you want
await asyncio.sleep(0.1)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
P.S.: I used similar code in a project and noticed that you can clearly send more requests with asyncio than with traditional threads in a short amount of time.
I first make a simple request to get a JSON containing all the names, then I iterate over all the names and make asynchronous awaitable calls corresponding to each name, and store them in a list called "tasks", and then I gather all of them.
The problem is, the response server has a limit to the api responses per minute, and no matter how low I keep the semaphore value, this code takes the same time (small enough to not meet the server's expectations) to make the API calls, as if the semaphore doesn't exist at all. How do I control the API call rate?
<some code>
url = http://example.com/
response = requests.request("GET", url, headers=headers)
async def get_api(session, url_dev):
async with session.get(url_dev, headers = headers) as resp:
result = await resp.json()
return result
async def main():
async with aiohttp.ClientSession() as session:
sem = asyncio.Semaphore(1)
tasks = []
for i in response.json()["Names"]:
url_dev = "https://example.com/example/" + str(i["Id"])
await sem.acquire()
async with sem:
tasks.append(asyncio.create_task(get_api(session, url_dev)))
full_list = list()
async with sem:
full_list = await asyncio.gather(*tasks)
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
asyncio.run(main())
Semaphore here really isn't the right tool to manage rate limiting unless you are going to increment the semaphore in a separate loop, or add a sleep inside the critical section. You could also schedule a follow up task to sleep and then deque the semaphore.
Further, you've queued all of the tasks inside the critical section, but the execution happens async to the critical section because you queued it as a task. You need to have the semaphore inside the get_api method.
Also, you're acquiring the semaphore twice; either use the acquire method and try/ finally, or use async with, but not both. See the docs
Here is a simple script to illustrate how you can have a task loop that does not exceed starting more than 5 tasks per 5 second interval:
async def dequeue(sem, sleep):
"""Wait for a duration and then increment the semaphore"""
try:
await asyncio.sleep(sleep)
finally:
sem.release()
async def task(sem, sleep, data):
"""Decrement the semaphore, schedule an increment, and then work"""
await sem.acquire()
asyncio.create_task(dequeue(sem, sleep))
# logic here
print(data)
async def main():
max_concurrent = 5
sleep = 5
sem = asyncio.Semaphore(max_concurrent)
tasks = [asyncio.create_task(task(sem, sleep, i)) for i in range(15)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
asyncio.run(main())
You could also wrap this logic in a decorator if you want to get really fancy:
def rate_limited(max_concurrent, duration):
def decorator(func):
semaphore = asyncio.Semaphore(max_concurrent)
async def dequeue():
try:
await asyncio.sleep(duration)
finally:
semaphore.release()
#functools.wraps(func)
async def wrapper(*args, **kwargs):
await semaphore.acquire()
asyncio.create_task(dequeue())
return await func(*args, **kwargs)
return wrapper
return decorator
Then the code becomes the follow (note semaphore was created outside of asyncio.run, so you need to query the default loop for it to work properly):
#rate_limited(max_concurrent=5, duration=5)
async def task(i):
print(i)
async def main():
tasks = [asyncio.create_task(task(i)) for i in range(7)]
await asyncio.gather(*tasks)
if __name__ == "__main__":
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
You should acquire and release the semaphore object when you run the request to the API endpoint in get_api, instead of when you create the tasks and gather the results. Also, based on your sample use case, there should be no need to manually call sem.acquire and sem.release when you use its context manager instead:
async def get_api(session, sem:asyncio.Semaphore, url_dev):
#below, using both the semaphore and session.get in a context manager
#now, the semaphore will properly block requests when the limit has been reached, until others have finished
async with sem, session.get(url_dev, headers = headers) as resp:
result = await resp.json()
return result
async def main():
sem = asyncio.Semaphore(1)
async with aiohttp.ClientSession() as session:
tasks = []
for i in response.json()["Names"]:
url_dev = "https://example.com/example/" + str(i["Id"])
#passing the semaphore instance to get_api
tasks.append(asyncio.create_task(get_api(session, sem, url_dev)))
full_list = await asyncio.gather(*tasks)
I'm writing a code to get some links from a list of input urls using asyncio, aiohttp and BeautifulSoup.
Here's a snippet of the relevant code:
def async_get_jpg_links(links):
def extractLinks(ep_num, html):
soup = bs4.BeautifulSoup(html, 'lxml',
parse_only = bs4.SoupStrainer('article'))
main = soup.findChildren('img')
return ep_num, [img_link.get('data-src') for img_link in main]
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
async def get_jpg_links(ep_links):
async with aiohttp.ClientSession() as session:
tasks = [get_htmllinks(session, num, link)
for num, link in enumerate(ep_links, 1)]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
return loop.run_until_complete(get_jpg_links(links))
I then later call jpgs_links = dict(async_get_jpg_links(hrefs)), where hrefs is a bunch of links (~170 links).
jpgs_links should be a dictionary with numerical keys and a bunch of lists as values. Some of the values come back as empty lists (which should instead be filled with data). When I cut down the numbers of links in hrefs, more of the lists come back full.
For the photo below, I reran the same code with a minute between, and as you can see, I get different lists that come back empty and different ones that come back full.
Could it be that asyncio.gather is not waiting for all the tasks to finish?
How can I get asyncio to get me to return no empty lists, while keeping the number of links in hrefs high?
So, turns out that some of the urls I sent in threw up the error:
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 504, message='Gateway Time-out',...
So I changed
async def get_htmllinks(session, ep_num, ep_link):
async with session.get(ep_link) as response:
html_txt = await response.text()
return extractLinks(ep_num, html_txt)
to
async def get_htmllinks(session, ep_num, ep_link):
html_txt = None
while not html_txt:
try:
async with session.get(ep_link) as response:
response.raise_for_status()
html_txt = await response.text()
except aiohttp.ClientResponseError:
await asyncio.sleep(1)
return extractLinks(ep_num, html_txt)
What this does is that it retries the connection after sleeping for a second (the await asyncio.sleep(1) does that).
Nothing to do with asyncio or BeautifulSoup, apparently.
I need to make an API request for several pieces of data, and then process each result. The request is paginated, so I'm currently doing
def get_results():
while True:
response = api(num_results=5)
if response is None: # No more results
break
yield response
def process_data():
for page in get_results():
for result in page:
do_stuff(result)
process_data()
I'm hoping to use asyncio to retrieve the next page of results from the API while I'm processing the current one, instead of waiting for results, processing them, then waiting again. I've modified the code to
import asyncio
async def get_results():
while True:
response = api(num_results=5)
if response is None: # No more results
break
yield response
async def process_data():
async for page in get_results():
for result in page:
do_stuff(result)
asyncio.run(process_data())
I'm not sure if this is doing what I intend it to. Is this the right way to make processing the current page of API results and getting the next page of results asynchronous?
Maybe you can use Asyncio.Queue to refactor your code to Producer/Consumer Pattern
import asyncio
import random
q = asyncio.Queue()
async def api(num_results):
# you could use aiohttp to fetch api
# fake content
await asyncio.sleep(1)
fake_response = random.random()
if fake_response < 0.1:
return None
return fake_response
async def get_results(q):
while True:
response = await api(num_results=5)
if response is None:
# indicate producer done
print('Producer Done')
await q.put(None)
break
print('Producer: ', response)
await q.put(response)
async def process_data():
while True:
data = await q.get()
if not data:
print('Consumer Done')
break
# process data whatever you want, but if its cpu intensive, you can call loop.run_in_executor
# fake the process needs a little time
await asyncio.sleep(3)
print('Consume', data)
loop = asyncio.get_event_loop()
loop.create_task(get_results(q))
loop.run_until_complete(process_data())
Come back to the question
Is this the right way to make processing the current page of API results and getting the next page of results asynchronous?
Its not the right way, because get_results() is iterated each time your do_stuff(result) done
I have a script that checks the status code for a couple hundred thousand supplied websites, and I was trying to integrate a Semaphore to the flow to speed up processing. The problem is that whenever I integrate a Semaphore, I just get a list populated with None objects, and I'm not entirely sure why.
I have been mostly copying code from other sources as I don't fully grok asynchronous programming fully yet, but it seems like when I debug I should be getting results out of the function, but something is going wrong when I gather the results. I've tried juggling around my looping, my gathering, ensuring futures, etc, but nothing seems to return a list of things that work.
async def fetch(session, url):
try:
async with session.head(url, allow_redirects=True) as resp:
return url, resp.real_url, resp.status, resp.reason
except Exception as e:
return url, None, e, 'Error'
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
async def run(urls):
timeout = 15
tasks = []
sem = asyncio.Semaphore(100)
conn = aiohttp.TCPConnector(limit=64, ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
for url in urls:
task = asyncio.wait_for(bound_fetch(sem, session, url), timeout)
tasks.append(task)
responses = await asyncio.gather(*tasks)
# responses = [await f for f in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks))]
return responses
urls = ['https://google.com', 'https://yahoo.com']
loop = asyncio.ProactorEventLoop()
data = loop.run_until_complete(run(urls))
I've commented out the progress bar component, but that implementation returns the desired results when there is no semaphore.
Any help would be greatly appreciated. I am furiously reading up on asynchronous programming, but I can't wrap my mind around it yet.
You should explicitly return results of awaiting coroutines.
Replace this code...
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
... with this:
async def bound_fetch(sem, session, url):
async with sem:
return await fetch(session, url)