FastAPI - How to get the response body in Middleware

FastAPI - How to get the response body in Middleware - python

Is there any way to get the response content in a middleware?
The following code is a copy from here.
#app.middleware("http")
async def add_process_time_header(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
response.headers["X-Process-Time"] = str(process_time)
return response

The response body is an iterator, which once it has been iterated through, it cannot be re-iterated again. Thus, you either have to save all the iterated data to a list (or bytes variable) and use that to return a custom Response, or initiate the iterator again. The options below demonstrate both approaches. In case you would like to get the request body inside the middleware as well, please have a look at this answer.
Option 1
Save the data to a list and use iterate_in_threadpool to initiate the iterator again, as described here - which is what StreamingResponse uses, as shown here.
from starlette.concurrency import iterate_in_threadpool
#app.middleware("http")
async def some_middleware(request: Request, call_next):
response = await call_next(request)
response_body = [chunk async for chunk in response.body_iterator]
response.body_iterator = iterate_in_threadpool(iter(response_body))
print(f"response_body={response_body[0].decode()}")
return response
Note 1: If your code uses StreamingResponse, response_body[0] would return only the first chunk of the response. To get the entire response body, you should join that list of bytes (chunks), as shown below (.decode() returns a string representation of the bytes object):
print(f"response_body={(b''.join(response_body)).decode()}")
Note 2: If you have a StreamingResponse streaming a body that wouldn't fit into your server's RAM (for example, a response of 30GB), you may run into memory errors when iterating over the response.body_iterator (this applies to both options listed in this answer), unless you loop through response.body_iterator (as shown in Option 2), but instead of storing the chunks in an in-memory variable, you store it somewhere on the disk. However, you would then need to retrieve the entire response data from that disk location and load it into RAM, in order to send it back to the client (which could extend the delay in responding to the client even more)—in that case, you could load the contents into RAM in chunks and use StreamingResponse, similar to what has been demonstrated here, here, as well as here, here and here (in Option 1, you can just pass your iterator/generator function to iterate_in_threadpool). However, I would not suggest following that approach, but instead have such endpoints returning large streaming responses excluded from the middleware, as described in this answer.
Option 2
The below demosntrates another approach, where the response body is stored in a bytes object (instead of a list, as shown above), and is used to return a custom Response directly (along with the status_code, headers and media_type of the original response).
#app.middleware("http")
async def some_middleware(request: Request, call_next):
response = await call_next(request)
response_body = b""
async for chunk in response.body_iterator:
response_body += chunk
print(f"response_body={response_body.decode()}")
return Response(content=response_body, status_code=response.status_code,
headers=dict(response.headers), media_type=response.media_type)

Related

How to Upload a large File (≥3GB) to FastAPI backend?

I am trying to upload a large file (≥3GB) to my FastAPI server, without loading the entire file into memory, as my server has only 2GB of free memory.
Server side:
async def uploadfiles(upload_file: UploadFile = File(...):
Client side:
m = MultipartEncoder(fields = {"upload_file":open(file_name,'rb')})
prefix = "http://xxx:5000"
url = "{}/v1/uploadfiles".format(prefix)
try:
req = requests.post(
url,
data=m,
verify=False,
)
which returns:
HTTP 422 {"detail":[{"loc":["body","upload_file"],"msg":"field required","type":"value_error.missing"}]}
I am not sure what MultipartEncoder actually sends to the server, so that the request does not match. Any ideas?

With requests-toolbelt library, you have to pass the filename as well, when declaring the field for upload_file, as well as set the Content-Type header—which is the main reason for the error you get, as you are sending the request without setting the Content-Type header to multipart/form-data, followed by the necessary boundary string—as shown in the documentation. Example:
filename = 'my_file.txt'
m = MultipartEncoder(fields={'upload_file': (filename, open(filename, 'rb'))})
r = requests.post(url, data=m, headers={'Content-Type': m.content_type})
print(r.request.headers) # confirm that the 'Content-Type' header has been set
However, I wouldn't recommend using a library (i.e., requests-toolbelt) that hasn't provided a new release for over three years now. I would suggest using Python requests instead, as demonstrated in this answer and that answer (also see Streaming Uploads and Chunk-Encoded Requests), or, preferably, use the HTTPX library, which supports async requests (if you had to send multiple requests simultaneously), as well as streaming File uploads by default, meaning that only one chunk at a time will be loaded into memory (see the documentation). Examples are given below.
Option 1 (Fast) - Upload File and Form data using .stream()
As previously explained in detail in this answer, when you declare an UploadFile object, FastAPI/Starlette, under the hood, uses a SpooledTemporaryFile with the max_size attribute set to 1MB, meaning that the file data is spooled in memory until the file size exceeds the max_size, at which point the contents are written to disk; more specifically, to a temporary file on your OS's temporary directory—see this answer on how to find/change the default temporary directory—that you later need to read the data from, using the .read() method. Hence, this whole process makes uploading file quite slow; especially, if it is a large file (as you'll see in Option 2 below later on).
To avoid that and speed up the process, as the linked answer above suggested, one can access the request body as a stream. As per Starlette documentation, if you use the .stream() method, the (request) byte chunks are provided without storing the entire body to memory (and later to a temporary file, if the body size exceeds 1MB). This method allows you to read and process the byte chunks as they arrive. The below takes the suggested solution a step further, by using the streaming-form-data library, which provides a Python parser for parsing streaming multipart/form-data input chunks. This means that not only you can upload Form data along with File(s), but you also don't have to wait for the entire request body to be received, in order to start parsing the data. The way it's done is that you initialise the main parser class (passing the HTTP request headers that help to determine the input Content-Type, and hence, the boundary string used to separate each body part in the multipart payload, etc.), and associate one of the Target classes to define what should be done with a field when it has been extracted out of the request body. For instance, FileTarget would stream the data to a file on disk, whereas ValueTarget would hold the data in memory (this class can be used for either Form or File data as well, if you don't need the file(s) saved to the disk). It is also possible to define your own custom Target classes. I have to mention that streaming-form-data library does not currently support async calls to I/O operations, meaning that the writing of chunks happens synchronously (within a def function). Though, as the endpoint below uses .stream() (which is an async function), it will give up control for other tasks/requests to run on the event loop, while waiting for data to become available from the stream. You could also run the function for parsing the received data in a separate thread and await it, using Starlette's run_in_threadpool()—e.g., await run_in_threadpool(parser.data_received, chunk)—which is used by FastAPI internally when you call the async methods of UploadFile, as shown here. For more details on def vs async def, please have a look at this answer.
You can also perform certain validation tasks, e.g., ensuring that the input size is not exceeding a certain value. This can be done using the MaxSizeValidator. However, as this would only be applied to the fields you defined—and hence, it wouldn't prevent a malicious user from sending extremely large request body, which could result in consuming server resources in a way that the application may end up crashing—the below incorporates a custom MaxBodySizeValidator class that is used to make sure that the request body size is not exceeding a pre-defined value. The both validators desribed above solve the problem of limiting upload file (as well as the entire request body) size in a likely better way than the one desribed here, which uses UploadFile, and hence, the file needs to be entirely received and saved to the temporary directory, before performing the check (not to mention that the approach does not take into account the request body size at all)—using as ASGI middleware such as this would be an alternative solution for limiting the request body. Also, in case you are using Gunicorn with Uvicorn, you can also define limits with regards to, for example, the number of HTTP header fields in a request, the size of an HTTP request header field, and so on (see the documentation). Similar limits can be applied when using reverse proxy servers, such as Nginx (which also allows you to set the maximum request body size using the client_max_body_size directive).
A few notes for the example below. Since it uses the Request object directly, and not UploadFile and Form objects, the endpoint won't be properly documented in the auto-generated docs at /docs (if that's important for your app at all). This also means that you have to perform some checks yourself, such as whether the required fields for the endpoint were received or not, and if they were in the expected format. For instance, for the data field, you could check whether data.value is empty or not (empty would mean that the user has either not included that field in the multipart/form-data, or sent an empty value), as well as if isinstance(data.value, str). As for the file(s), you can check whether file_.multipart_filename is not empty; however, since a filename could likely not be included in the Content-Disposition by some user, you may also want to check if the file exists in the filesystem, using os.path.isfile(filepath) (Note: you need to make sure there is no pre-existing file with the same name in that specified location; otherwise, the aforementioned function would always return True, even when the user did not send the file).
Regarding the applied size limits, the MAX_REQUEST_BODY_SIZE below must be larger than MAX_FILE_SIZE (plus all the Form values size) you expcect to receive, as the raw request body (that you get from using the .stream() method) includes a few more bytes for the --boundary and Content-Disposition header for each of the fields in the body. Hence, you should add a few more bytes, depending on the Form values and the number of files you expect to receive (hence the MAX_FILE_SIZE + 1024 below).
app.py
from fastapi import FastAPI, Request, HTTPException, status
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import FileTarget, ValueTarget
from streaming_form_data.validators import MaxSizeValidator
import streaming_form_data
from starlette.requests import ClientDisconnect
import os
MAX_FILE_SIZE = 1024 * 1024 * 1024 * 4 # = 4GB
MAX_REQUEST_BODY_SIZE = MAX_FILE_SIZE + 1024
app = FastAPI()
class MaxBodySizeException(Exception):
def __init__(self, body_len: str):
self.body_len = body_len
class MaxBodySizeValidator:
def __init__(self, max_size: int):
self.body_len = 0
self.max_size = max_size
def __call__(self, chunk: bytes):
self.body_len += len(chunk)
if self.body_len > self.max_size:
raise MaxBodySizeException(body_len=self.body_len)
#app.post('/upload')
async def upload(request: Request):
body_validator = MaxBodySizeValidator(MAX_REQUEST_BODY_SIZE)
filename = request.headers.get('Filename')
if not filename:
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
detail='Filename header is missing')
try:
filepath = os.path.join('./', os.path.basename(filename))
file_ = FileTarget(filepath, validator=MaxSizeValidator(MAX_FILE_SIZE))
data = ValueTarget()
parser = StreamingFormDataParser(headers=request.headers)
parser.register('file', file_)
parser.register('data', data)
async for chunk in request.stream():
body_validator(chunk)
parser.data_received(chunk)
except ClientDisconnect:
print("Client Disconnected")
except MaxBodySizeException as e:
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f'Maximum request body size limit ({MAX_REQUEST_BODY_SIZE} bytes) exceeded ({e.body_len} bytes read)')
except streaming_form_data.validators.ValidationError:
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f'Maximum file size limit ({MAX_FILE_SIZE} bytes) exceeded')
except Exception:
raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail='There was an error uploading the file')
if not file_.multipart_filename:
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail='File is missing')
print(data.value.decode())
print(file_.multipart_filename)
return {"message": f"Successfuly uploaded {filename}"}
As mentioned earlier, to upload the data (on client side), you can use the HTTPX library, which supports streaming file uploads by default, and thus allows you to send large streams/files without loading them entirely into memory. You can pass additional Form data as well, using the data argument. Below, a custom header, i.e., Filename, is used to pass the filename to the server, so that the server instantiates the FileTarget class with that name (you could use the X- prefix for custom headers, if you wish; however, it is not officially recommended anymore).
To upload multiple files, use a header for each file (or, use random names on server side, and once the file has been fully uploaded, you can optionally rename it using the file_.multipart_filename attribute), pass a list of files, as described in the documentation (Note: use a different field name for each file, so that they won't overlap when parsing them on server side, e.g., files = [('file', open('bigFile.zip', 'rb')),('file_2', open('bigFile2.zip', 'rb'))], and finally, define the Target classes on server side accordingly.
test.py
import httpx
import time
url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}
with httpx.Client() as client:
start = time.time()
r = client.post(url, data=data, files=files, headers=headers)
end = time.time()
print(f'Time elapsed: {end - start}s')
print(r.status_code, r.json(), sep=' ')
Upload both File and JSON body
In case you would like to upload both file(s) and JSON instead of Form data, you can use the approach described in Method 3 of this answer, thus also saving you from performing manual checks on the received Form fields, as explained earlier (see the linked answer for more details). To do that, make the following changes in the code above.
app.py
#...
from fastapi import Form
from pydantic import BaseModel, ValidationError
from typing import Optional
from fastapi.encoders import jsonable_encoder
class Base(BaseModel):
name: str
point: Optional[float] = None
is_accepted: Optional[bool] = False
def checker(data: str = Form(...)):
try:
model = Base.parse_raw(data)
except ValidationError as e:
raise HTTPException(detail=jsonable_encoder(e.errors()), status_code=status.HTTP_422_UNPROCESSABLE_ENTITY)
return model
#...
#app.post('/upload')
async def upload(request: Request):
#...
# place this after the try-except block
model = checker(data.value.decode())
print(model.dict())
test.py
#...
import json
data = {'data': json.dumps({"name": "foo", "point": 0.13, "is_accepted": False})}
#...
Option 2 (Slow) - Upload File and Form data using UploadFile and Form
If you would like to use a normal def endpoint instead, see this answer.
app.py
from fastapi import FastAPI, File, UploadFile, Form, HTTPException, status
import aiofiles
import os
CHUNK_SIZE = 1024 * 1024 # adjust the chunk size as desired
app = FastAPI()
#app.post("/upload")
async def upload(file: UploadFile = File(...), data: str = Form(...)):
try:
filepath = os.path.join('./', os.path.basename(file.filename))
async with aiofiles.open(filepath, 'wb') as f:
while chunk := await file.read(CHUNK_SIZE):
await f.write(chunk)
except Exception:
raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail='There was an error uploading the file')
finally:
await file.close()
return {"message": f"Successfuly uploaded {file.filename}"}
As mentioned earlier, using this option would take longer for the file upload to complete, and as HTTPX uses a default timeout of 5 seconds, you will most likely get a ReadTimeout exception (as the server will need some time to read the SpooledTemporaryFile in chunks and write the contents to a permanent location on the disk). Thus, you can configure the timeout (see the Timeout class in the source code too), and more specifically, the read timeout, which "specifies the maximum duration to wait for a chunk of data to be received (for example, a chunk of the response body)". If set to None instead of some positive numerical value, there will be no timeout on read.
test.py
import httpx
import time
url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}
timeout = httpx.Timeout(None, read=180.0)
with httpx.Client(timeout=timeout) as client:
start = time.time()
r = client.post(url, data=data, files=files, headers=headers)
end = time.time()
print(f'Time elapsed: {end - start}s')
print(r.status_code, r.json(), sep=' ')

Return File/Streaming response from online video URL in FastAPI

I am using FastAPI to return a video response from googlevideo.com. This is the code I am using:
#app.get(params.api_video_route)
async def get_api_video(url=None):
def iter():
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as resp:
yield from io.BytesIO(resp.read())
return StreamingResponse(iter(), media_type="video/mp4")
but this is not working
I want this Nodejs to be converted into Python FastAPI:
app.get("/download-video", function(req, res) {
http.get(decodeURIComponent(req.query.url), function(response) {
res.setHeader("Content-Length", response.headers["content-length"]);
if (response.statusCode >= 400)
res.status(500).send("Error");
response.on("data", function(chunk) { res.write(chunk); });
response.on("end", function() { res.end(); }); }); });

I encountered similar issues but solved all. The main idea is to create a session with requests.Session(), and yield a chunk one by one, instead of getting all the content and yield it at once. This works very nicely without making any memory issue at all.
#app.get(params.api_video_route)
async def get_api_video(url=None):
def iter():
session = requests.Session()
r = session.get(url, stream=True)
r.raise_for_status()
for chunk in r.iter_content(1024*1024):
yield chunk
return StreamingResponse(iter(), media_type="video/mp4")

The quick solution would be to replace yield from io.BytesIO(resp.read()) with the one below (see FastAPI documentation - StreamingResponse for more details).
yield from resp
However, instead of using urllib.request and resp.read() (which would read the entire file contents into memory, hence the reason for taking too long to respond), I would suggest you use the HTTPX library, which, among other things, provides async support as well. Also, it supports Streaming Responses (see async Streaming Responses too), and thus, you can avoid loading the entire response body into memory at once (especially, when dealing with large files). Below are provided examples in both synchronous and asynchronous ways on how to stream a video from a given URL.
Note: Both versions below would allow multiple clients to connect to the server and get the video stream without being blocked, as a normal def endpoint in FastAPI is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server)—thus ensuring that FastAPI will still work asynchronously. Even if you defined the endpoint of the first example below with async def instead, it would still not block the server, as StreamingResponse will run the code (for sending the body chunks) in an external threadpool that is then awaited (have a look at this comment and the source code here), if the function for streaming the response body (i.e., iterfile() in the examples below) is a normal generator/iterator (as in the first example) and not an async one (as in the second example). However, if you had some other I/O or CPU blocking operations inside that endpoint, it would result in blocking the server, and hence, you should drop the async definition on that endpooint. The second example demonstrates how to implement the video streaming in an async def endpoint, which is useful when you have to call other async functions inside the endpoint that you have to await, as well as you thus save FastAPI from running the endpoint in an external threadpool. For more details on def vs async def, please have a look at this answer.
The below examples use iter_bytes() and aiter_bytes() methods, respectively, to get the response body in chunks. These functions, as described in the documentation links above and in the source code here, can handle gzip, deflate, and brotli encoded responses. One can alternatively use the iter_raw() method to get the raw response bytes, without applying content decoding (if is not needed). This method, in contrast to iter_bytes(), allows you to optionally define the chunk_size for streaming the response content, e.g., iter_raw(1024 * 1024). However, this doesn't mean that you read the body in chunks of that size from the server (that is serving the file) directly. If you had a closer look at the source code of iter_raw(), you would see that it just uses a ByteChunker that stores the byte contents into memory (using BytesIO() stream) and returns the content in fixed-size chunks, depending the chunk size you passed to the function (whereas raw_stream_bytes, as shown in the linked source code above, contains the actual byte chunk read from the stream).
Using HTTPX with def endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
#app.get('/video')
def get_video(url: str):
def iterfile():
with httpx.stream("GET", url) as r:
for chunk in r.iter_bytes():
yield chunk
return StreamingResponse(iterfile(), media_type="video/mp4")
Using HTTPX with async def endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
#app.get('/video')
async def get_video(url: str):
async def iterfile():
async with httpx.AsyncClient() as client:
async with client.stream("GET", url) as r:
async for chunk in r.aiter_bytes():
yield chunk
return StreamingResponse(iterfile(), media_type="video/mp4")
You can use public videos provided here to test the above. Example:
http://127.0.0.1:8000/video?url=http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
If you would like to return a custom Response or FileResponse instead—which I wouldn't really recommend in case you are dealing with large video files, as you should either read the entire contents into memory, or save the contents to a temporary file on disk that you later have to read again into memory, in order to send it back to the client—please have a look at this answer and this answer.

httpx: How to access specific responses from gathered request tasks?

I want to use HTTPX (within FastAPI, if that matters) to make asynchronous http requests to an outside API and store the responses as individual variables for processing in slightly different ways depending on which URL was fetched. I'm modifying the code from this StackOverflow answer.
import asyncio
import httpx
async def perform_request(client, url):
response = await client.get(url)
return response.text
async def gather_tasks(*urls):
async with httpx.AsyncClient() as client:
tasks = [perform_request(client, url) for url in urls]
result = await asyncio.gather(*tasks)
return result
async def f():
url1 = "https://api.com/object=562"
url2 = "https://api.com/object=383"
url3 = "https://api.com/object=167"
url4 = "https://api.com/object=884"
result = await gather_tasks(url1, url2, url3, url4)
# print(result[0])
# print(result[1])
# DO THINGS WITH url2, SOMETHING ELSE WITH url4, ETC.
if __name__ == '__main__':
asyncio.run(f())
What's the best way to access the individual responses? (If I use result[n] I wouldn't know which response I'm working with.)
And I'm pretty new to httpx and async operations in general so please share if you have any suggestions for how to achieve it in a better way.

Regardless of AsyncIO, I would probably put the logic inside gather_tasks. There you know the response, and you can define all the if else logic you want to proceed with the right path.
In my opinion you have two options:
1 - Process the request right away
In this case f would only initialize the urls and trigger the processing, everything else would happen inside gather_tasks.
2 - "Enrich" the response
In gather_tasks you can understand which kind of operation to do next, and "attach" to the response some sort of code to define it. For example, you could return a dict with two keys: response and operation. This would be the most explicit way of doing this, but you could also use a list or a tuple, you just need to know where the response and the "next step code" is within them.
This is useful if the further processing must happen later instead of right away.
Makes sense?

Long running requests with asyncio and aiohttp

Apologies for asking with what may be considered redundant, but I'm finding it extremely difficult to figure out what are the current recommended best practices for using asyncio and aiohttp.
I'm working with an API that ultimately returns a link to a generated CSV file. There are two steps in using the API.
Submit request the triggers a long running process and returns a status URL.
Poll the status URL until the status_code is 201 and then get the URL of the CSV file from the headers.
Here's a stripped down example of how I can successfully do this synchronously with requests.
import time
import requests
def submit_request(id):
"""Submit request to create CSV for specified id"""
body = {'id': id}
response = requests.get(
url='https://www.example.com/endpoint',
json=body
)
response.raise_for_status()
return response
def get_status(request_response):
"""Check whether the CSV has been created."""
status_response = requests.get(
url=request_response.headers['Location']
)
status_response.raise_for_status()
return status_response
def get_data_url(id, poll_interval=10):
"""Submit request to create CSV for specified ID, wait for it to finish,
and return the URL of the CSV.
Wait between status requests based on poll_interval.
"""
response = submit_request(id)
while True:
status_response = get_status(response)
if status_response.status_code == 201:
break
time.sleep(poll_interval)
data_url = status_response.headers['Location']
return data_url
What I'd like to do is be able to submit a group of requests at once, and then wait on all of them to be finished. But I'm not clear on how to structure this with asyncio and aiohttp.
One option would be to first submit all of the requests and then use await.gather (or something) to get all of the status URLs. Then start another event loop where I continuously poll the status_urls until they have all completed and I end up with a list of data URLs.
Alternatively, I suppose I could create a single function that submits the request, gets the status URL, and then polls that until it completes. In that case I would just have a single event loop where I submit each of the IDs that I want processed.
If some pseudo code for those options would be useful I can try to provide it. I've looked at a lot of different examples where you submit requests for a bunch of URLs asynchronously -- this for example -- but I'm finding that I get a bit lost when trying to translate them to this slightly more complicated scenario where I submit the request and then get back a new URL to poll.

FYI based on the comments above my current solution is something like this.
import asyncio
import aiohttp
async def get_data_url(session, id):
url = 'https://www.example.com/endpoint'
body = {'id': id}
async with session.post(url=url, json=body) as response:
response.raise_for_status()
status_url = response.headers['Location']
while True:
async with session.get(url=status_url) as status_response:
status_response.raise_for_status()
if status_response.status == 201:
return status_response.headers['Location']
await asyncio.sleep(10)
async def main(access_token, id):
headers = {'token': access_token}
async with aiohttp.ClientSession(headers=headers) as session:
data_url = await get_data_url(session, id)
return data_url
This works though I'm still not sure on best practices for submitting a set of IDs. I think asyncio.gather would work but it looks like it's deprecated. Ideally I would have a queue of say 100 IDs and only have 5 requests running at any given time. I've found some examples like this but they depend on asyncio.Queue which is also deprecated.

aiohttp: How to efficiently check HTTP headers before downloading response body?

I am writing a web crawler using asyncio/aiohttp. I want the crawler to only want to download HTML content, and skip everything else. I wrote a simple function to filter URLS based on extensions, but this is not reliable because many download links do not include a filename/extension in them.
I could use aiohttp.ClientSession.head() to send a HEAD request, check the Content-Type field to make sure it's HTML, and then send a separate GET request. But this will increase the latency by requiring two separate requests per page (one HEAD, one GET), and I'd like to avoid that if possible.
Is it possible to just send a regular GET request, and set aiohttp into "streaming" mode to download just the header, and then proceed with the body download only if the MIME type is correct? Or is there some (fast) alternative method for filtering out non-HTML content that I should consider?
UPDATE
As requested in the comments, I've included some example code of what I mean by making two separate HTTP requests (one HEAD request and one GET request):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com']
results = []
async def get_urls_async(urls):
loop = asyncio.get_running_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
print(f"This is the first (HEAD) request we send for {u}")
tasks.append(loop.create_task(session.get(u)))
results = []
for t in asyncio.as_completed(tasks):
response = await t
url = response.url
if "text/html" in response.headers["Content-Type"]:
print("Sending the 2nd (GET) request to retrive body")
r = await session.get(url)
results.append((url, await r.read()))
else:
print(f"Not HTML, rejecting: {url}")
return results
results = asyncio.run(get_urls_async(urls))

This is a protocol problem, if you do a GET, the server wants to send the body. If you don't retrieve the body you have to discard the connection (this is in fact what it does if you don't do a read() before __aexit__ on the response).
So the above code should do more of less what you want. NOTE the server may send in the first chunk already more than just the headers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

FastAPI - How to get the response body in Middleware - python

Related

How to Upload a large File (≥3GB) to FastAPI backend?

Return File/Streaming response from online video URL in FastAPI

httpx: How to access specific responses from gathered request tasks?

Long running requests with asyncio and aiohttp

aiohttp: How to efficiently check HTTP headers before downloading response body?

Categories

Resources