Downloading Images with Python aiohttp: ClientPayloadError: Response payload is not completed - python

Prerequisites:
Python 3.9.5
aiohttp 3.7.4.post0
Hello! I am trying to download images from the given URL, and 99% of the time it works just fine, here is the snippet:
async def download_image(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
if response.status != 200:
raise exceptions.FileNotFound()
data = await response.read()
img = Image.open(io.BytesIO(data))
return img
But sometimes, on the step data = await resp.read() function throws aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed. Exception can be raised on a certain image, and on the second attempt to load this exact image it works again.
aiohttp documentation states:
This exception can only be raised while reading the response payload
if one of these errors occurs:
invalid compression
malformed chunked encoding
not enough data that satisfy Content-Length HTTP header.
What can I do to debug precisely what raises an exception? For me it seems some data in the process of session.get(url) gets corrupted, bits flip here and there. Is there a better way to retry image download than catching error while calling download_image function and repeating it?

Related

Python: best way to check for list of URLs

I have a file defining a list of RSS feeds:
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
I wrote the following test:
import requests
from feeds import RSS_FEEDS
for rssfeed in RSS_FEEDS:
response = requests.get(rssfeed)
assert response.status_code == 200
Are there more efficient (download less stuff) ways?
How would you handle a slow response vs a dead link?
The above would just tell me if the URL is fetchable, but how could I assess if it's a valid RSS stream?
You could solve it using the aiohttp library also together with asyncio, like this:
from aiohttp import ClientSession
from asyncio import gather, create_task, run, set_event_loop, set_event_loop_policy
from traceback import format_exc
import sys
# This is necessary on my Windows computer
if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'): # Check for operating system
from asyncio import ProactorEventLoop, WindowsSelectorEventLoopPolicy
set_event_loop(ProactorEventLoop())
set_event_loop_policy(WindowsSelectorEventLoopPolicy()) # Bug is not present in Linux
RSS_FEEDS = [
"https://www.fanpage.it/feed/",
"https://www.ilfattoquotidiano.it/feed/",
"https://forbes.it/feed/",
"https://formiche.net/feed/",
]
async def GetRessource(url: str, session: ClientSession) -> dict:
try:
async with session.get(url) as response:
if response.status == 200:
return(response.status)
else:
r: str = await response.text()
print(f"Error, got response code: {response.status} message: {r}")
except Exception:
print(f"General Exception:\n{format_exc()}")
return({})
async def GetUrls() -> None:
async with ClientSession() as session:
Tasks: list = [create_task(GetRessource(url, session)) for url in RSS_FEEDS]
Results: list = await gather(*Tasks, return_exceptions=False)
for result in Results:
assert result == 200
async def main():
await GetUrls()
if __name__ == "__main__":
run(main())
Result of Results:
200
200
200
200
It's checking the URLs in parallel.
To optimize network usage, add a timeout parameter to the get request to limit the wait time for a response and a stream parameter to the get request to only download a portion of the response in chunks rather than the entire file.
To handle a slow/dead link, add a timeout parameter to the get request to raise an exception if the response takes too long, and catch and handle exceptions raised by the get request such as TimeoutError, ConnectionError, and HTTPError (e.g. retry, log error)
To validate an RSS stream, use a library like feedparser to parse the response and determine whether it's a valid RSS feed, as well as look for specific elements/attributes in the response (e.g. channel, item, title, link) that are required for an RSS feed.
import requests
import feedparser
from requests.exceptions import Timeout, ConnectionError, HTTPError
for rssfeed in RSS_FEEDS:
try:
response = requests.get(rssfeed, timeout=5)
response.raise_for_status()
feed = feedparser.parse(response.content)
if not feed.bozo:
# feed is valid
else:
# feed is invalid
except (Timeout, ConnectionError, HTTPError) as e:
# handle exceptions here (e.g. retry, log error)
pass

Python-Aiohttp/Asyncio API request returning ContentTypeError - JSON with unexpected mimetype, but not always

I am attempting to make an API request, pull down specific chunks of the response and ultimately save it into a file for later processing. I also first want to mention that the script works full, until I begin to pull larger sets of data.
When I open the params to a larger date range, I receive:
ContentTypeError(
aiohttp.client_exceptions.ContentTypeError: 0, message='Attempt to decode JSON with unexpected mimetype: text/html'
async def get_dataset(session, url):
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
async with aiohttp.ClientSession() as session:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url)))
dataset = await asyncio.gather(*tasks)
If I keep my params small enough, then it works without issue. But too large of a date range and the error pops up, and anything past the snippet I shared above does not run
More for reference:
url_address = "https://api.harvestapp.com/v2/time_entries/"
headers = {
"Content-Type": 'application/json',
"Authorization": authToken,
"Harvest-Account-ID": accountID
}
params = {
"from": StartDate,
"to": EndDate
}
Any ideas on what would cause this to work on certain data sizes but fail on larger sets? I am assuming the JSON is becoming malformed at some point, but I am unsure of how to examine that and/or prevent it from happening, since I am able to pull multiple pages from the API and successfully appending on the smaller data pulls.
OP: Thank you to the others who gave answers. I discovered the issue and implemented a solution. A friend pointed out that aiohttp can return that error message if the response is of an error page instead of the expected json content i.e. a html page giving a 429 HTTP too many requests. I looked up the API limits and found they do have it set to 100 requests per 15 seconds.
My solution was to implement the asyncio-throttle module which allowed me to directly limit the requests and time period. You can find this on the devs GitHub
Here is my updated code with the implementation, very simple! For my instance I needed to limit my requests to 100 per 15 seconds which you can see below as well.
async def get_dataset(session, url, throttler):
while True:
async with throttler:
async with session.get(url=url, headers=headers, params=params) as resp:
dataset = await resp.json()
return dataset['time_entries']
async def main():
tasks = []
throttler = Throttler(rate_limit=100, period=15)
async with aiohttp.ClientSession() as session:
try:
for page in range(1, total_pages):
url = "https://api.harvestapp.com/v2/time_entries?page=" + str(page)
tasks.append(asyncio.ensure_future(get_dataset(session, url, throttler)))
dataset = await asyncio.gather(*tasks)

Aiohttp session timeout doesn't cancel the request

I have this piece of code where I send a POST request and set it a maximum timeout using the aiohttp package:
from aiohttp import ClientTimeout, ClientSession
response_code = None
timeout = ClientTimeout(total=2)
async with ClientSession(timeout=timeout) as session:
try:
async with session.post(
url="some url", json=post_payload, headers=headers,
) as response:
response_code = response.status
except Exception as err:
logger.error(err)
That part works, however the request appears to not be canceled whenever the timeout and respectively the except clause are reached - I still receive it on the other end, even though an exception has been raised. I would like for the request to automatically be canceled whenever the timeout has been reached. Thanks in advance.

AIOHTTP having request body/content/text when calling raise_for_status

I'm using FastAPI with aiohttp, I built a singleton for a persistent session and I'm using it for opening the session at startup and closing it at shutdown.
Demand: The response body is precious in case of a failure I must log it with the other details.
Because how raise_for_status behave I had to write those ugly functions which handle each HTTP method, this is one of them:
async def post(self, url: str, json: dict, headers: dict) -> ClientResponse:
response = await self.session.post(url=url, json=json, headers=headers)
response_body = await response.text()
try:
response.raise_for_status()
except Exception:
logger.exception('Request failed',
extra={'url': url, 'json': json, 'headers': headers, 'body': response_body})
raise
return response
If I could count on raise_for_status to return also the body (response.text()),
I just could initiate the session ClientSession(raise_for_status=True) and write a clean code:
response = await self.session.post(url=url, json=json, headers=headers)
Is there a way to force somehow raise_for_status to return also the payload/body, maybe in the initialization of the ClientSession?
Thanks for the help.
It is not possible for aiohttp and raise_for_status. As #Andrew Svetlov answered here:
Consider response as closed after raising an exception.
Technically it can contain a partial body but there is no any guarantee.
There is no reason to read it, the body could be very huge, 1GiB is not a limit.
If you need a response content for non-200 -- read it explicitly.
Alternatively, consider using the httpx library in this way. (It is widely used in conjunction with FastAPI):
def raise_on_4xx_5xx(response):
response.raise_for_status()
async with httpx.AsyncClient(event_hooks={'response': [raise_on_4xx_5xx]}) as client:
try:
r = await client.get('http://httpbin.org/status/418')
except httpx.HTTPStatusError as e:
print(e.response.text)

How can I upload files through aiohttp using response from get request?

To start off, I am writing an async wrapper for the WordPress REST API. I have a Wordpress site hosted on Bluehost. I am working with the endpoint for media (image) uploads. I have successfully managed to upload an image but there are 2 changes I would like to make. The second change is what I really want, but out of curiosity, I would like to know how to implement change 1 too. I'll provide the code first and then some details.
Working code
async def upload_local_pic2(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title': title, 'status':'publish'}
mpwriter.append_json(json)
with open(local_url, 'rb') as f:
print(f)
payload = mpwriter.append(f)
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
Change 1
The first change is uploading using aiofiles.open() instead of just using open() as I expect to be processing lots of files. The following code does not work.
async def upload_local_pic(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title': title, 'status':'publish'}
mpwriter.append_json(json)
async with aiofiles.open(local_url, 'rb') as f:
print(f)
payload = mpwriter.append(f)
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
Change 2
My other change is that I would like to have another function that can upload the files directly to the WordPress server without downloading them locally. So instead of getting a local picture, I want to pass in the url of an image online. The following code also does not work.
async def upload_pic(self, image_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title':title, 'status':'publish'}
mpwriter.append_json(json)
async with self.session.get(image_url) as image_response:
image_content = image_response.content
print(image_content)
payload = mpwriter.append(image_content)
async with self.session.post(url, data = payload) as response:
x = await response.read()
print(x)
Details/Debugging
I'm trying to figure out why each one won't work. I think the key is the calls to print(image_content)
and print(f) that show what exactly I am inputting to mpwriter.append
In the example that works where I just use the standard Python open() function, I am apparently passing in <_io.BufferedReader name='/redactedfilepath/index.jpeg'>
In the change 1 example with aiofile, I am passing in <aiofiles.threadpool.binary.AsyncBufferedReader object at 0x7fb803122250>
Wordpress will return this html:
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
And finally, in change 2 where I try to pass in what the get request to the url gives me I get
<StreamReader 292 bytes>. The response returned by WordPress is the same as above with Mod Security.
Any idea how I can make these examples work? It seems like they are all some type of io reader but I guess the underlying aiohttp code treats them differently.
Also this shouldn't really matter, but this is the url I am passing into the change 2 example.
Ok, so I figured out both changes.
For the first change when trying to read a file with aiofiles, I need to just read the whole file instead of passing in the file handler. Also, I need to set the content disposition manually.
async def upload_local_pic(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'status':'publish'}
mpwriter.append_json(json)
async with aiofiles.open(local_url, mode='rb') as f:
contents = await f.read()
payload = mpwriter.append(contents)
payload.set_content_disposition('attachment', filename= title+'.jpg')
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
For the second change, it's a similar concept with just uploading a file directly from the URL. Instead of passing in the handler that will read the content, I need to read the entire content first. I also need to set the content-disposition manually.
async def upload_pic(self, image_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'status':'publish'}
mpwriter.append_json(json)
async with self.session.get(image_url) as image_response:
image_content = await image_response.read()
payload = mpwriter.append(image_content)
payload.set_content_disposition('attachment', filename=title+'.jpg')
async with self.session.post(url, data = payload) as response:
x = await response.read()
print(x)
I will answer only to the title of the post (and not the questions that are in between).
The following code should give a short example of how to upload a file from URL#1 to URL#2 (without the need to download the file to the local machine and only then do the upload).
I will give two examples here:
Read all the content of the file into the memory (not downloading). This is of-course not so good when working with huge files...
Read and send the file in chunks (so we won't read all the file content at once).
Example #1: Reading all file content AT ONCE and uploading
import asyncio
import aiohttp
async def http_upload_from_url(src, dst):
async with aiohttp.ClientSession() as session:
src_resp = await session.get(src)
#print(src_resp)
dst_resp = await session.post(dst, data=src_resp.content)
#print(dst_resp)
try:
asyncio.run(http_upload_from_url(SRC_URL, DST_URL))
except Exception as e:
print(e)
Example #2: Reading file content IN CHUNKS and uploading
import asyncio
import aiohttp
async def url_sender(url=None, chunk_size=65536):
async with aiohttp.ClientSession() as session:
resp = await session.get(url)
#print(resp)
async for chunk in resp.content.iter_chunked(chunk_size):
#print(f"send chunk with size {len(chunk)}")
yield chunk
async def chunked_http_upload_from_url(src, dst):
async with aiohttp.ClientSession() as session:
resp = await session.post(dst, data=url_sender(src))
#print(resp)
#print(await resp.text())
try:
asyncio.run(chunked_http_upload_from_url(SRC_URL, DST_URL))
except Exception as e:
print(e)
Some notes:
You need to define SRC_URL and DST_URL.
I've only added the prints for debug (in-case you don't get a [200 OK] response).

Categories