Handling bad requests in aiohttp - python

I'm trying to figure out a way to handle bad requests from an API within an asynchronous function using aiohttp. This is what I've got for testing:
async def fetch(session):
url = 'http://httpbin.org/status/404'
async with session.request('GET', url) as response:
if response.status == 200:
try:
r = await response.json()
return r
except ValueError:
return
else:
return None
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [asyncio.ensure_future(fetch(session, project)) for project in project_list]
for future in await asyncio.gather(*tasks):
output += future
return output
def get_data(project_list):
loop = asyncio.get_event_loop()
futures = asyncio.ensure_future(fetch_all(project_list))
output = loop.run_until_complete(futures)
return output
In this example, project_list is just a list of integers.
In this instance, fetch() should return none since the response will undoubtedly be 404. The problem arises in fetch_all() where I tell it to += future. I get a TypeError: 'coroutine' object is not iterable. Basically I'd like this to return nothing and in this case, += nothing to that list. In a perfect world I'd receive a proper json response every time, but I'd like to account for a random instance wherein I receive a bad response from the server.
From what I've read, #asyncio.coroutine would return None but async values have to be awaited if I'm understanding it correctly.

At first, you don't need to wrap task in ensure_future if you want to use gather. Second, you are trying to add fetch tasks with 2 arguments: session and project, but you have only session in your fetch function definition. And one more thing you can change is removing the loop where you are iterating through gather result because it already gives you the output you want to return.
Code will be like that:
async def fetch(session, project):
url = 'http://httpbin.org/status/404'
async with session.request('GET', url) as response:
if response.status == 200:
try:
return await response.json()
except ValueError:
pass
return None
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [fetch(session, project) for project in project_list]
return await asyncio.gather(*tasks)
def get_data(project_list):
loop = asyncio.get_event_loop()
output = loop.run_until_complete(futures)
return output
As for advice: use ensure_future only if you want to run coroutine instantly but the result will be needed only in the future.

I'm not completely sure this is the best way to do it but I put this together and it worked. If anyone can correct me, please do.
async def fetch_all(project_list):
output = []
async with ClientSession() as session:
tasks = [asyncio.ensure_future(fetch(session, project)) for project in project_list]
for future in await asyncio.gather(*tasks):
if future is not None: #Check if the future is None before adding it
output += future
return output

Related

Asyncio + Aiohttp Memory Leak when running async function in for loop (python)

I am making a python function which makes a lot of requests to an api. The function works like this:
async def get_one(session, url):
try:
with session.get(url) as resp:
resp = await resp.json()
except:
resp = None
return resp, url
async def get_all(session, urls):
tasks = [asyncio.create_task(get_one(session, url)) for url in urls]
results = await asyncio.gather(*tasks)
return results
async def make_requests(urls):
timeout = aiohttp.ClientTimeout(sock_read=10, sock_connect=10, total=0.1*len(urls))
connector = aiohttp.TCPConnector(limit=125)
async with aiohttp.ClientSession(connector=connector, skip_auto_headers=['User-Agent'], timeout=timeout) as session:
data = await get_all(session, ids)
return data
def main(urls):
results = []
while urls:
retry = []
response = asyncio.run(make_requests(urls))
for resp, url in response:
if resp is not None:
results.append(resp)
else:
retry.append(url)
urls = retry
return results
The problem is my function keeps building up memory, especially when there are more errors in the try-except block inside the 'get_one' function, the more times I have to retry, the more memory it consumes (something is preventing python from collecting the garbage).
I have come accross an old answer (Asyncio with memory leak (Python)) stating that create_task() is responsible for this (or ensure_future), as it keeps a reference to the original task.
But it is still not clear to me if this is really the case, or how to solve this issue if it is. Any help will appreciated, thank you!

On Python Asyncio I am trying to return a value from one function to another

On Python Asyncio I am trying to return a value from one function to another .
So when the if statement in the end of func "check" is True the returned will value
will go the the "print" func .
async def check(i):
async with aiohttp.ClientSession() as session:
url = f'https://pokeapi.co/api/v2/pokemon/{i}'
async with session.get(url) as resp:
data = await resp.text()
if data['state'] == 'yes':
return data['state']
## how do i keeping the structre of the asyncio and pass this result to the "print" function
async def print(here should be the return of "check" funtion is there is):
print()
await asyncio.sleep(0)
async def main():
for i in range(0,5):
await asyncio.gather(check(i),
print() )
Thank You (-:
Your code is going to run everything synchronously. You need to restructure things a bit to see any value from asyncio.
async def check(i):
async with aiohttp.ClientSession() as session:
url = f'https://pokeapi.co/api/v2/pokemon/{i}'
async with session.get(url) as resp:
data = await resp.text()
if data['state'] == 'yes':
return data['state']
async def main():
aws = [check(i) for i in range(5)]
results = await asyncio.gather(*aws)
for result in results:
print(result)
This will allow your aiohttp requests to run asynchronously. Assuming print is really just a wrapper around the builtin, you don't need it and can just use the builtin.
If, however, print actually does something else, you should use asyncio.as_completed instead of asyncio.gather.
async def my_print(result):
print(result)
await asyncio.sleep(0)
async def main():
aws = [check(i) for i in range(5)]
for coro in asyncio.as_completed(aws):
result = await coro
await my_print(result)
Simple solution: do not run the two functions concurrently. One of them clearly needs the other to finish.
async def print_it(i):
value = await check(i)
if value is not None:
print(value)
There is an implicit return None when a function finishes its last statement, i.e. when return data['state'] is NOT executed in check(). In that case nothing is printed - adjust the code if that is not correct.
Of course, you should start only print_it coroutines, without starting checks.
If you really need to run the functions concurrently for some reason, use a Queue. The producer puts the data into the queue, the consument gets the value when it is available.

Python Asyncio - Running the same task in parallel

I'm trying to run this code in parallel:
availablePrefix = {"http://URL-to-somewehere.com": "true", "http://URL-to-somewehere-else.com": "true"}
def main():
while True:
prefixUrl = getFreePrefix() # Waits until new url is free
sendRequest("https://stackoverflow.com/", prefixUrl)
def getFreePrefix():
while True:
for prefix in self.availablePrefix.keys():
if availablePrefix.get(prefix) == "true":
availablePrefix[prefix] = "false" # Can't be used for another request
return prefix
async def sendRequest(self, prefix, suffix):
url = prefix + "/" + suffix
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
response = await resp.text()
availablePrefix[prefix] = "true" # Can be used again
return json.loads(response)
Basically, I'm trying to run the main() function in parallel.
The main() function is stuck until getFreePrefix() returns a new prefix (URL to my server). With the help of this prefix we can access my server and start a request.
If this Prefix is used, it is set to false, to indicate that it can't be used for another request right now (If request is completed, it is set to true again).
What I want to achieve is, that every time a new prefix is ready, a new request is run in parallel.
Thanks for helping!
With your inconsistent use of self, I can't tell whether whether parts of your code is supposed to be part of a class or not. It also appears that your intention is for function main to run in an infinite loop and as soon as a key of availablePrefix has been processed, it is available for processing again. In your current, non-concurrent code, I believe that this could have been accomplished more simply as:
# simple list:
availablePrefix = ["http://URL-to-somewehere.com", "http://URL-to-somewehere-else.com"]
def main():
while True:
for prefixUrl in availablePrefix:
sendRequest("https://stackoverflow.com/", prefixUrl)
And you get rid of method getFreePrefix and remove the code from sendRequest that updates the heretofore availablePrefix dictionary, which is now a list. The other improvement I would make is to have the aiohttp.ClientSession() instance created only once in main and passed as an argument to whomever needs it.
Moving on. To repeatedly process the prefixes concurrently, the simplest way I know is:
import asyncio
import aiohttp
availablePrefix = ["http://URL-to-somewehere.com", "http://URL-to-somewehere-else.com"]
async def main():
# create the session instance once and pass it as an argument:
async with aiohttp.ClientSession() as session:
while True:
tasks = {asyncio.create_task(sendRequest(session, "https://stackoverflow.com/", prefixUrl)) for prefixUrl in availablePrefix}
for task in asyncio.as_completed(tasks):
result = await task
async def sendRequest(session, prefix, suffix):
url = prefix + "/" + suffix
async with session.get(url) as resp:
response = await resp.text()
return json.loads(response)
await(main())

aiohttp with asyncio and Semaphores returning a list filled with Nones

I have a script that checks the status code for a couple hundred thousand supplied websites, and I was trying to integrate a Semaphore to the flow to speed up processing. The problem is that whenever I integrate a Semaphore, I just get a list populated with None objects, and I'm not entirely sure why.
I have been mostly copying code from other sources as I don't fully grok asynchronous programming fully yet, but it seems like when I debug I should be getting results out of the function, but something is going wrong when I gather the results. I've tried juggling around my looping, my gathering, ensuring futures, etc, but nothing seems to return a list of things that work.
async def fetch(session, url):
try:
async with session.head(url, allow_redirects=True) as resp:
return url, resp.real_url, resp.status, resp.reason
except Exception as e:
return url, None, e, 'Error'
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
async def run(urls):
timeout = 15
tasks = []
sem = asyncio.Semaphore(100)
conn = aiohttp.TCPConnector(limit=64, ssl=False)
async with aiohttp.ClientSession(connector=conn) as session:
for url in urls:
task = asyncio.wait_for(bound_fetch(sem, session, url), timeout)
tasks.append(task)
responses = await asyncio.gather(*tasks)
# responses = [await f for f in tqdm.tqdm(asyncio.as_completed(tasks), total=len(tasks))]
return responses
urls = ['https://google.com', 'https://yahoo.com']
loop = asyncio.ProactorEventLoop()
data = loop.run_until_complete(run(urls))
I've commented out the progress bar component, but that implementation returns the desired results when there is no semaphore.
Any help would be greatly appreciated. I am furiously reading up on asynchronous programming, but I can't wrap my mind around it yet.
You should explicitly return results of awaiting coroutines.
Replace this code...
async def bound_fetch(sem, session, url):
async with sem:
await fetch(session, url)
... with this:
async def bound_fetch(sem, session, url):
async with sem:
return await fetch(session, url)

How to initiate next request before yielding in asynchronous generator in python

I'm attempting to get some data from a paginated API (specifically github's, but the API doesn't matter for this question). I'm using a python asynchronous generator to yield each individual row from each page. The code looks something like this:
async def get_data():
cursor = None
with aiohttp.ClientSession() as session:
while True:
async with session.get(build_url(cursor)):
data = await response.json()
yield from get_rows(data)
if not has_next_page(data):
return
cursor = get_next_cursor(data)
So, this basically works. However, one of the minor flaws is that it doesn't initiate the next request until after all the rows have been yielded from the current page. Is there a good way to initiate that processing inside of this loop, before starting to yield? In particular, I want to make sure that the async with is still evaluated correctly when doing asyncio.ensure_future, which is the API for initiating background work.
You'll need at least one extra coroutine to achieve that, and bridge the two with an asyncio.Queue:
async def get_data():
queue = asyncio.Queue()
async def fetch_all_pages():
cursor = None
with aiohttp.ClientSession() as session:
while True:
async with session.get(build_url(cursor)):
data = await response.json()
await queue.put(data)
if not has_next_page(data):
# signal the peer to exit
await queue.put(None)
break
cursor = get_next_cursor(data)
asyncio.ensure_future(fetch_all_pages())
while True:
data = await queue.get()
if not data:
break
yield from get_rows(data)

Categories