How to make client request to external server avoiding cache using aiohttp - python

We are using aiohttp to make multiple requests to various website vendors to grab their latest data.
Some of the content providers serve the data from a cache. Is it possible to request the data from the server directly? We have tried to pass in the headers parameter with no luck.
async def fetch(url):
global response
headers = {'Cache-Control': 'no-cache'}
async with ClientSession() as session:
async with session.get(url, headers=headers, proxy="OUR-PROXY") as response:
return await response.read()
The goal is to get the last-modified date header, which is not provided from the cache request.

Try to add some additional variable with dynamic value to URL (e.g. timestamp).
This will prevent caching on the server side even if it ignores Cache-Control.
Example:
from: https://example.com/test
to: https://example.com/test?timestamp=20180724181234

Related

Python Request URL response to slow, how make more quickle?

I have this code in python:
session = requests.Session()
for i in range(0, len(df_1)):
page = session.head(df_1['listing_url'].loc[i], allow_redirects=False, stream=True)
if page.status_code == 200:
df_1['condition'][i] = 'active'
else:
df_1['condition'][i] = 'false'
df_1 is my data frame and the column "listing_url" have more than 500 lines.
I want to Request if the URL list is active and append this in my data frame. But this code demands a long time. How can I reduce my time?
The problem with your current approach is that requests runs sequentially (synchronously), which means that a new request can't be sent before the prior one is finished.
What you are looking for is handling those requests asynchronously. Sadly, requests library does not support asynchronous requests. A newer library that has similar API to requests but can do that is httpx. aiohttp is another popular choice. With httpx you can do something like this:
import asyncio
import httpx
listing_urls = list(df_1['listing_url'])
async def do_tasks():
async with httpx.AsyncClient() as client:
tasks = [client.head(url) for url in listing_urls]
responses = await asyncio.gather(*tasks)
return {r.url: r.status_code for r in responses}
url_2_status = asyncio.run(do_tasks())
This will give you a mapping of {url: status_code}. You should be able to go from there.
This solution assumes you are using Python3.7 or newer. Also remember to install httpx.

When using proxy my request is set to HTTP/1

Using the library httpx which allows me to make HTTP/2 request to target sites.
However when I use the proxy it seem to automatically set my request to HTTP/1.
I.e
async def main():
client = httpx.AsyncClient(http2=True)
response = await client.get('someurl', headers=headers)
print(response.http_version)
This prints HTTP/2
But same thing using proxy like so client = httpx.AsyncClient(http2=True, proxies=someproxydictionary)
it prints HTTP/1
Why is this behavior happening only when routing through proxies?

aiohttp / Getting response object out of context manager

I'm currently doing my first "baby-steps" with aiohttp (coming from the requests module).
I tried to simplify the requests a bit so I wont have to use a context manager for each request in my main module.
Therefore I tried this:
async def get(session, url, headers, proxies=None):
async with session.get(url, headers=headers, proxy=proxies) as response:
response_object = response
return response_object
But it resulted in:
<class 'aiohttp.client_exceptions.ClientConnectionError'> - Connection closed
The request is available in the context manager. When I try to access it within the context manager in the mentioned function, all works.
But shouldn't it also be able to be saved in the variable <response_object> and then be returned afterwards so I can access it outside of the context manager?
Is there any workaround to this?
If you don't care for the data being loaded during the get method, perhaps you could try loading it inside it:
async def get(session, url, headers, proxies=None):
async with session.get(url, headers=headers, proxy=proxies) as response:
await response.read()
return response
And the using the body that was read like:
resp = get(session, 'http://python.org', {})
print(await resp.text())
Under the hood, the read method caches the body in a member named _body and when trying to call json, aiohttp first checks whether the body was already read or not.

Sending files using python 'aiohttp' produce "There was an error parsing the body"

I am trying to make two services communicate. The first API is exposed to the user.
The second is hidden and can process files. So the first can redirect requests.
I want to make of the post request asynchronus using aiohttp but i am facing this error : "There was an error parsing the body"
To recreate the error :
Lets say this is the server code
from fastapi import FastAPI
from fastapi import UploadFile, File
app = FastAPI()
#app.post("/upload")
async def transcript_file(file: UploadFile = File(...)):
pass
And this is the client code :
from fastapi import FastAPI
import aiohttp
app = FastAPI()
#app.post("/upload_client")
async def async_call():
async with aiohttp.ClientSession() as session:
headers = {'accept': '*/*',
'Content-Type': 'multipart/form-data'}
file_dict = {"file": open("any_file","rb")}
async with session.post("http://localhost:8000/upload", headers=headers, data=file_dict) as response:
return await response.json()
Description :
Run the server on port 8000 and the client on any port you like
Open the browser and open docs on the client.
Execute the post request and see the error
Environment :
aiohttp = 3.7.4
fastapi = 0.63.0
uvicorn = 0.13.4
python-multipart = 0.0.2
Python version: 3.8.8
From this answer:
If you are using one of multipart/* content types, you are actually required to specify the boundary parameter in the Content-Type header, otherwise the server (in the case of an HTTP request) will not be able to parse the payload.
You need to remove the explicit setting of the Content-Type header, the client aiohttp will add it implicitly for you, including the boundary parameter.

aiohttp: How to efficiently check HTTP headers before downloading response body?

I am writing a web crawler using asyncio/aiohttp. I want the crawler to only want to download HTML content, and skip everything else. I wrote a simple function to filter URLS based on extensions, but this is not reliable because many download links do not include a filename/extension in them.
I could use aiohttp.ClientSession.head() to send a HEAD request, check the Content-Type field to make sure it's HTML, and then send a separate GET request. But this will increase the latency by requiring two separate requests per page (one HEAD, one GET), and I'd like to avoid that if possible.
Is it possible to just send a regular GET request, and set aiohttp into "streaming" mode to download just the header, and then proceed with the body download only if the MIME type is correct? Or is there some (fast) alternative method for filtering out non-HTML content that I should consider?
UPDATE
As requested in the comments, I've included some example code of what I mean by making two separate HTTP requests (one HEAD request and one GET request):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com']
results = []
async def get_urls_async(urls):
loop = asyncio.get_running_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
print(f"This is the first (HEAD) request we send for {u}")
tasks.append(loop.create_task(session.get(u)))
results = []
for t in asyncio.as_completed(tasks):
response = await t
url = response.url
if "text/html" in response.headers["Content-Type"]:
print("Sending the 2nd (GET) request to retrive body")
r = await session.get(url)
results.append((url, await r.read()))
else:
print(f"Not HTML, rejecting: {url}")
return results
results = asyncio.run(get_urls_async(urls))
This is a protocol problem, if you do a GET, the server wants to send the body. If you don't retrieve the body you have to discard the connection (this is in fact what it does if you don't do a read() before __aexit__ on the response).
So the above code should do more of less what you want. NOTE the server may send in the first chunk already more than just the headers

Categories