Python Aiohttp client - disable encoding redirected url - python

I'm using aiohttp client to send request to an URL and collecting the redirected URLs.
In my case, the redirected URL contains Unicode text in it. I want it unmodified.
For eg the actual redirected URL is example.com/förderprojekte-e-v, but aiohttp client auto encodes it & returns me example.com/f\udcf6rderprojekte-e-v.
How to make aiohttp to disable auto encoding of redirected urls.
For requests module, this solution works, but I need help for aiohttp.
My code:
async fetch(url):
#url = 'https://www.example.com/test/123'
async with client.get(url, allow_redirects = True ) as resp:
html = await resp.read()
redir_url = str(resp.url)
#example.com/f\udcf6rderprojekte-e-v
or atleast tell me how to convert \udcf6 to ö

async def handle_redirected_url(request):
# https://
scheme = request.scheme
# example.com
host = request.host
# /f\udcf6rderprojekte-e-v
unencoded_path = request.raw_path
unencoded_url = scheme+host+unencoded_path
return web.json_response(status=200, data={'unencoded_url': unencoded_url)
See request object attributes in https://docs.aiohttp.org/en/stable/web_reference.html

Related

When using proxy my request is set to HTTP/1

Using the library httpx which allows me to make HTTP/2 request to target sites.
However when I use the proxy it seem to automatically set my request to HTTP/1.
I.e
async def main():
client = httpx.AsyncClient(http2=True)
response = await client.get('someurl', headers=headers)
print(response.http_version)
This prints HTTP/2
But same thing using proxy like so client = httpx.AsyncClient(http2=True, proxies=someproxydictionary)
it prints HTTP/1
Why is this behavior happening only when routing through proxies?

Sending files using python 'aiohttp' produce "There was an error parsing the body"

I am trying to make two services communicate. The first API is exposed to the user.
The second is hidden and can process files. So the first can redirect requests.
I want to make of the post request asynchronus using aiohttp but i am facing this error : "There was an error parsing the body"
To recreate the error :
Lets say this is the server code
from fastapi import FastAPI
from fastapi import UploadFile, File
app = FastAPI()
#app.post("/upload")
async def transcript_file(file: UploadFile = File(...)):
pass
And this is the client code :
from fastapi import FastAPI
import aiohttp
app = FastAPI()
#app.post("/upload_client")
async def async_call():
async with aiohttp.ClientSession() as session:
headers = {'accept': '*/*',
'Content-Type': 'multipart/form-data'}
file_dict = {"file": open("any_file","rb")}
async with session.post("http://localhost:8000/upload", headers=headers, data=file_dict) as response:
return await response.json()
Description :
Run the server on port 8000 and the client on any port you like
Open the browser and open docs on the client.
Execute the post request and see the error
Environment :
aiohttp = 3.7.4
fastapi = 0.63.0
uvicorn = 0.13.4
python-multipart = 0.0.2
Python version: 3.8.8
From this answer:
If you are using one of multipart/* content types, you are actually required to specify the boundary parameter in the Content-Type header, otherwise the server (in the case of an HTTP request) will not be able to parse the payload.
You need to remove the explicit setting of the Content-Type header, the client aiohttp will add it implicitly for you, including the boundary parameter.

How to use OAuth1 with aiohttp

I have successfully implemented OAuth1 with the regular requests module like this:
import requests
from requests_oauthlib import OAuth1
oauth = OAuth1(client_key=oauth_cred["consumer_key"], client_secret=oauth_cred["consumer_secret"], resource_owner_key=oauth_cred["access_token"], resource_owner_secret=oauth_cred["access_token_secret"])
session = requests.Session()
session.auth = oauth
When trying to transfer this to aiohttp, I have not been able to get it to work. Substituting aiohttp.ClientSession() for requests.Session() gives me {'errors': [{'code': 215, 'message': 'Bad Authentication data.'}]}.
I have looked at some solutions on the internet like https://github.com/klen/aioauth-client, but this seems to be a different approach. I just want it to function exactly like in my example above.
I tried
import aiohttp
from aioauth_client import TwitterClient
oauth = TwitterClient(consumer_key=oauth_cred["consumer_key"], consumer_secret=oauth_cred["consumer_secret"], oauth_token=oauth_cred["access_token"], oauth_token_secret=oauth_cred["access_token_secret"])
session = aiohttp.ClientSession()
session.auth = oauth
but I got the same error.
How can I get this to work?
Using oauthlib:
import oauthlib.oauth1, aiohttp, asyncio
async def main():
# Create the Client. This can be reused for multiple requests.
client = oauthlib.oauth1.Client(
client_key = oauth_cred['consumer_key'],
client_secret = oauth_cred['consumer_secret'],
resource_owner_key = oauth_cred['access_token'],
resource_owner_secret = oauth_cred['access_token_secret']
)
# Define your request. In my code I'm POSTing so that's what I have here,
# but if you're doing something different you'll need to change this a bit.
uri = '...'
http_method = 'POST'
body = '...'
headers = {
'Content-Type': 'application/x-www-form-urlencoded'
}
# Sign the request data. This needs to be called for each request you make.
uri,headers,body = client.sign(
uri = uri,
http_method = http_method,
body = body,
headers = headers
)
# Make your request with the signed data.
async with aiohttp.ClientSession() as session:
async with session.post(uri, data=body, headers=headers, raise_for_status=True) as r:
...
# asyncio.run has a bug on Windows in Python 3.8 https://bugs.python.org/issue39232
#asyncio.run(main())
asyncio.get_event_loop().run_until_complete(main())
The oauthlib.oauth1.Client constructor takes a bunch more parameters too if you need them (for basic use you don't). The official documentation isn't very thorough, but the doc comment on the method itself is pretty good.
The doc comment on the Client.sign method has more information about the parameters it takes.

aiohttp: How to efficiently check HTTP headers before downloading response body?

I am writing a web crawler using asyncio/aiohttp. I want the crawler to only want to download HTML content, and skip everything else. I wrote a simple function to filter URLS based on extensions, but this is not reliable because many download links do not include a filename/extension in them.
I could use aiohttp.ClientSession.head() to send a HEAD request, check the Content-Type field to make sure it's HTML, and then send a separate GET request. But this will increase the latency by requiring two separate requests per page (one HEAD, one GET), and I'd like to avoid that if possible.
Is it possible to just send a regular GET request, and set aiohttp into "streaming" mode to download just the header, and then proceed with the body download only if the MIME type is correct? Or is there some (fast) alternative method for filtering out non-HTML content that I should consider?
UPDATE
As requested in the comments, I've included some example code of what I mean by making two separate HTTP requests (one HEAD request and one GET request):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com']
results = []
async def get_urls_async(urls):
loop = asyncio.get_running_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
print(f"This is the first (HEAD) request we send for {u}")
tasks.append(loop.create_task(session.get(u)))
results = []
for t in asyncio.as_completed(tasks):
response = await t
url = response.url
if "text/html" in response.headers["Content-Type"]:
print("Sending the 2nd (GET) request to retrive body")
r = await session.get(url)
results.append((url, await r.read()))
else:
print(f"Not HTML, rejecting: {url}")
return results
results = asyncio.run(get_urls_async(urls))
This is a protocol problem, if you do a GET, the server wants to send the body. If you don't retrieve the body you have to discard the connection (this is in fact what it does if you don't do a read() before __aexit__ on the response).
So the above code should do more of less what you want. NOTE the server may send in the first chunk already more than just the headers

How to make client request to external server avoiding cache using aiohttp

We are using aiohttp to make multiple requests to various website vendors to grab their latest data.
Some of the content providers serve the data from a cache. Is it possible to request the data from the server directly? We have tried to pass in the headers parameter with no luck.
async def fetch(url):
global response
headers = {'Cache-Control': 'no-cache'}
async with ClientSession() as session:
async with session.get(url, headers=headers, proxy="OUR-PROXY") as response:
return await response.read()
The goal is to get the last-modified date header, which is not provided from the cache request.
Try to add some additional variable with dynamic value to URL (e.g. timestamp).
This will prevent caching on the server side even if it ignores Cache-Control.
Example:
from: https://example.com/test
to: https://example.com/test?timestamp=20180724181234

Categories