Aiohttp session timeout doesn't cancel the request - python

I have this piece of code where I send a POST request and set it a maximum timeout using the aiohttp package:
from aiohttp import ClientTimeout, ClientSession
response_code = None
timeout = ClientTimeout(total=2)
async with ClientSession(timeout=timeout) as session:
try:
async with session.post(
url="some url", json=post_payload, headers=headers,
) as response:
response_code = response.status
except Exception as err:
logger.error(err)
That part works, however the request appears to not be canceled whenever the timeout and respectively the except clause are reached - I still receive it on the other end, even though an exception has been raised. I would like for the request to automatically be canceled whenever the timeout has been reached. Thanks in advance.

Related

Python: best way to check for list of URLs

I have a file defining a list of RSS feeds:
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
I wrote the following test:
import requests
from feeds import RSS_FEEDS
for rssfeed in RSS_FEEDS:
response = requests.get(rssfeed)
assert response.status_code == 200
Are there more efficient (download less stuff) ways?
How would you handle a slow response vs a dead link?
The above would just tell me if the URL is fetchable, but how could I assess if it's a valid RSS stream?
You could solve it using the aiohttp library also together with asyncio, like this:
from aiohttp import ClientSession
from asyncio import gather, create_task, run, set_event_loop, set_event_loop_policy
from traceback import format_exc
import sys
# This is necessary on my Windows computer
if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'): # Check for operating system
from asyncio import ProactorEventLoop, WindowsSelectorEventLoopPolicy
set_event_loop(ProactorEventLoop())
set_event_loop_policy(WindowsSelectorEventLoopPolicy()) # Bug is not present in Linux
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
async def GetRessource(url: str, session: ClientSession) -> dict:
try:
async with session.get(url) as response:
if response.status == 200:
return(response.status)
else:
r: str = await response.text()
print(f"Error, got response code: {response.status} message: {r}")
except Exception:
print(f"General Exception:\n{format_exc()}")
return({})
async def GetUrls() -> None:
async with ClientSession() as session:
Tasks: list = [create_task(GetRessource(url, session)) for url in RSS_FEEDS]
Results: list = await gather(*Tasks, return_exceptions=False)
for result in Results:
assert result == 200
async def main():
await GetUrls()
if __name__ == "__main__":
run(main())
Result of Results:
200
200
200
200
It's checking the URLs in parallel.
To optimize network usage, add a timeout parameter to the get request to limit the wait time for a response and a stream parameter to the get request to only download a portion of the response in chunks rather than the entire file.
To handle a slow/dead link, add a timeout parameter to the get request to raise an exception if the response takes too long, and catch and handle exceptions raised by the get request such as TimeoutError, ConnectionError, and HTTPError (e.g. retry, log error)
To validate an RSS stream, use a library like feedparser to parse the response and determine whether it's a valid RSS feed, as well as look for specific elements/attributes in the response (e.g. channel, item, title, link) that are required for an RSS feed.
import requests
import feedparser
from requests.exceptions import Timeout, ConnectionError, HTTPError
for rssfeed in RSS_FEEDS:
try:
response = requests.get(rssfeed, timeout=5)
response.raise_for_status()
feed = feedparser.parse(response.content)
if not feed.bozo:
# feed is valid
else:
# feed is invalid
except (Timeout, ConnectionError, HTTPError) as e:
# handle exceptions here (e.g. retry, log error)
pass

Why am I not getting a connection error when my API request fails on dropped wifi?

I am pulling data down from an API that has a limit of 250 records per call. There are a total of 100,000 records I need to pull down doing it 250 a time. I run my application leveraging the get_stats function below. It works fine for awhile but when my wifi drops and I am in the middle of the get request the request will hang and I won't get an exception back causing the rest of the application to hang as well.
I have tested turning off my wifi when the function is NOT in the middle of the get request and it does return back the ConnectionError exception.
How do I go about handling the situation where my app is in the middle of the get request and my wifi drops? I am thinking I need to do a timeout to give my wifi time to reconnect and then retry but how do I go about doing that? Or is there another way?
def get_stats(url, version):
headers = {
"API_version": version,
"API_token": "token"
}
try:
r = requests.get(url, headers=headers)
print(f"Status code: 200")
return json.loads(r.text)
except requests.exceptions.Timeout:
# Maybe set up for a retry, or continue in a retry loop
print("Error here in timeout")
except requests.exceptions.TooManyRedirects:
# Tell the user their URL was bad and try a different one
print("Redirect errors here")
except requests.exceptions.ConnectionError as r:
print("Connection error")
r = "Connection Error"
return r
except requests.exceptions.RequestException as e:
# catastrophic error. bail.
print("System errors here")
raise SystemExit(e)
To set a timeout on the request, call requests.get like this
r = requests.get(url, headers=headers, timeout=10)
The end goal is to get the data, so just make the call again with a possible sleep after failing
edit: I would say that the timeout is the sleep

Downloading Images with Python aiohttp: ClientPayloadError: Response payload is not completed

Prerequisites:
Python 3.9.5
aiohttp 3.7.4.post0
Hello! I am trying to download images from the given URL, and 99% of the time it works just fine, here is the snippet:
async def download_image(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status != 200:
raise exceptions.FileNotFound()
data = await response.read()
img = Image.open(io.BytesIO(data))
return img
But sometimes, on the step data = await resp.read() function throws aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed. Exception can be raised on a certain image, and on the second attempt to load this exact image it works again.
aiohttp documentation states:
This exception can only be raised while reading the response payload
if one of these errors occurs:
invalid compression
malformed chunked encoding
not enough data that satisfy Content-Length HTTP header.
What can I do to debug precisely what raises an exception? For me it seems some data in the process of session.get(url) gets corrupted, bits flip here and there. Is there a better way to retry image download than catching error while calling download_image function and repeating it?

Combining async and sync requests in python?

I am trying to make a request to server A, where the response will be a list of requests, which I will make to server B.
Currently request to server A is just a simple sync request like this:
import requests
req = requests.get('https://server-a.com')
data = req.json()
list_of_requests = data['requests'] # requests for server B
Since list_of_requests can be a few thousand items long, I would like to use async to speed up the requests to B.
I've looked at several examples of async HTTP requests using aiohttp, such as from
https://towardsdatascience.com/fast-and-async-in-python-accelerate-your-requests-using-asyncio-62dafca83c33
import aiohttp
import asyncio
import os
from aiohttp import ClientSession
GOOGLE_BOOKS_URL = "https://www.googleapis.com/books/v1/volumes?q=isbn:"
LIST_ISBN = [
'9780002005883',
'9780002238304',
'9780002261982',
'9780006163831',
'9780006178736',
'9780006280897',
'9780006280934',
'9780006353287',
'9780006380832',
'9780006470229',
]
def extract_fields_from_response(response):
"""Extract fields from API's response"""
item = response.get("items", [{}])[0]
volume_info = item.get("volumeInfo", {})
title = volume_info.get("title", None)
subtitle = volume_info.get("subtitle", None)
description = volume_info.get("description", None)
published_date = volume_info.get("publishedDate", None)
return (
title,
subtitle,
description,
published_date,
)
async def get_book_details_async(isbn, session):
"""Get book details using Google Books API (asynchronously)"""
url = GOOGLE_BOOKS_URL + isbn
try:
response = await session.request(method='GET', url=url)
response.raise_for_status()
print(f"Response status ({url}): {response.status}")
except HTTPError as http_err:
print(f"HTTP error occurred: {http_err}")
except Exception as err:
print(f"An error ocurred: {err}")
response_json = await response.json()
return response_json
async def run_program(isbn, session):
"""Wrapper for running program in an asynchronous manner"""
try:
response = await get_book_details_async(isbn, session)
parsed_response = extract_fields_from_response(response)
print(f"Response: {json.dumps(parsed_response, indent=2)}")
except Exception as err:
print(f"Exception occured: {err}")
pass
async with ClientSession() as session:
await asyncio.gather(*[run_program(isbn, session) for isbn in LIST_ISBN])
However, all of the examples I have looked at start with the list of requests already defined. My question is, what is the proper pythonic way/pattern of combining a single sync request and then using that request to 'spawn' async tasks?
Thanks a bunch!

Correctly catch aiohttp TimeoutError when using asyncio.gather

It is my first question here on Stack Overflow so I apologize if I did something stupid or missed something.
I am trying to make asynchronous aiohttp GET requests to many api endpoints at a time to check the status of these pages: the result should be a triple of the form
(url, True, "200") in case of a working link and (url, False, response_status) in case of a "problematic link". This is the atomic function for each call:
async def ping_url(url, session, headers, endpoint):
try:
async with session.get((url + endpoint), timeout=5, headers=headers) as response:
return url, (response.status == 200), str(response.status)
except Exception as e:
test_logger.info(url + ": " + e.__class__.__name__)
return url, False, repr(e)
These are wrapped into a function using asyncio.gather() which also creates the aiohttp Session:
async def ping_urls(urllist, endpoint):
headers = ... # not relevant
async with ClientSession() as session:
try:
results = await asyncio.gather(*[ping_url(url, session, headers, endpoint) \
for url in urllist],return_exceptions=True)
except Exception as e:
print(repr(e))
return results
The whole called from a main that looks like this:
urls = ... # not relevant
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(ping_urls(urls, endpoint))
except Exception as e:
pass
finally:
loop.close()
This works most of the time, but if the list is pretty long, I noticed that as soon as I get one
TimeoutError
the execution loop stops and I get TimeoutError for all other urls after the first one that timed out. If I omit the timeout in the innermost function I get somehow better results, but then it is not that fast anymore. Is there a way to control the Timeouts for the single api calls instead of a big general timeout for the whole list of urls?
Any kind of help would be extremely appreciated, I got stuck with my bachelor thesis because of this issue.
You may want to try setting a session timeout for your client session. This can be done like:
async def ping_urls(urllist, endpoint):
headers = ... # not relevant
timeout = ClientTimeout(total=TIMEOUT_SECONDS)
async with ClientSession(timeout=timeout) as session:
try:
results = await asyncio.gather(
*[
ping_url(url, session, headers, endpoint)
for url in urllist
],
return_exceptions=True
)
except Exception as e:
print(repr(e))
return results
This should set the ClientSession instance to have TIMEOUT_SECONDS as the timeout. Obviously you will need to set that value to something appropriate!
I struggled with the exceptions as well. I then found the hint, that I can also show the type of the Exception. And with that create appropriate Exception handling.
try: ...
except Exception as e:
print(f'Error: {e} of Type: {type(e)}')
So, with this you can find out, what kind of errors occur and you can catch and handle them individually.
e.g.
try: ...
except aiohttp.ClientConnectionError as e:
# deal with this type of exception
except aiohttp.ClientResponseError as e:
# handle individually
except asyncio.exceptions.TimeoutError as e:
# these kind of errors happened to me as well

Categories