Does aiofile write and read in background avoid blocking the executing thread? - python

I have recently worked with python and I am unsure about asyncio. The program requests a url, then parses a tag for each page and finally writes this to a local file. The program uses the aiofile library to write the tags into the file. I read that aiofile allows one to create an asynchronous file and use its methods like coroutines. Does this mean that while I write my tags into the local file in the background, I can continue to execute other tasks (like request other urls and parse those have been fetched) without having to wait while all tags are written in the local file?
Here is part of the code:
async def fetch():
async def parse():
async def write_one(file: IO, url: str, **kwargs) -> None:
"""Write the found HREFs from `url` to `file`."""
res = await parse(url=url, **kwargs)
if not res:
return None
async with aiofiles.open(file, "a") as f:
for p in res:
await f.write(f"{url}\t{p}\n")
logger.info("Wrote results for source URL: %s", url)

Related

Return File/Streaming response from online video URL in FastAPI

I am using FastAPI to return a video response from googlevideo.com. This is the code I am using:
#app.get(params.api_video_route)
async def get_api_video(url=None):
def iter():
req = urllib.request.Request(url)
with urllib.request.urlopen(req) as resp:
yield from io.BytesIO(resp.read())
return StreamingResponse(iter(), media_type="video/mp4")
but this is not working
I want this Nodejs to be converted into Python FastAPI:
app.get("/download-video", function(req, res) {
http.get(decodeURIComponent(req.query.url), function(response) {
res.setHeader("Content-Length", response.headers["content-length"]);
if (response.statusCode >= 400)
res.status(500).send("Error");
response.on("data", function(chunk) { res.write(chunk); });
response.on("end", function() { res.end(); }); }); });
I encountered similar issues but solved all. The main idea is to create a session with requests.Session(), and yield a chunk one by one, instead of getting all the content and yield it at once. This works very nicely without making any memory issue at all.
#app.get(params.api_video_route)
async def get_api_video(url=None):
def iter():
session = requests.Session()
r = session.get(url, stream=True)
r.raise_for_status()
for chunk in r.iter_content(1024*1024):
yield chunk
return StreamingResponse(iter(), media_type="video/mp4")
The quick solution would be to replace yield from io.BytesIO(resp.read()) with the one below (see FastAPI documentation - StreamingResponse for more details).
yield from resp
However, instead of using urllib.request and resp.read() (which would read the entire file contents into memory, hence the reason for taking too long to respond), I would suggest you use the HTTPX library, which, among other things, provides async support as well. Also, it supports Streaming Responses (see async Streaming Responses too), and thus, you can avoid loading the entire response body into memory at once (especially, when dealing with large files). Below are provided examples in both synchronous and asynchronous ways on how to stream a video from a given URL.
Note: Both versions below would allow multiple clients to connect to the server and get the video stream without being blocked, as a normal def endpoint in FastAPI is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server)—thus ensuring that FastAPI will still work asynchronously. Even if you defined the endpoint of the first example below with async def instead, it would still not block the server, as StreamingResponse will run the code (for sending the body chunks) in an external threadpool that is then awaited (have a look at this comment and the source code here), if the function for streaming the response body (i.e., iterfile() in the examples below) is a normal generator/iterator (as in the first example) and not an async one (as in the second example). However, if you had some other I/O or CPU blocking operations inside that endpoint, it would result in blocking the server, and hence, you should drop the async definition on that endpooint. The second example demonstrates how to implement the video streaming in an async def endpoint, which is useful when you have to call other async functions inside the endpoint that you have to await, as well as you thus save FastAPI from running the endpoint in an external threadpool. For more details on def vs async def, please have a look at this answer.
The below examples use iter_bytes() and aiter_bytes() methods, respectively, to get the response body in chunks. These functions, as described in the documentation links above and in the source code here, can handle gzip, deflate, and brotli encoded responses. One can alternatively use the iter_raw() method to get the raw response bytes, without applying content decoding (if is not needed). This method, in contrast to iter_bytes(), allows you to optionally define the chunk_size for streaming the response content, e.g., iter_raw(1024 * 1024). However, this doesn't mean that you read the body in chunks of that size from the server (that is serving the file) directly. If you had a closer look at the source code of iter_raw(), you would see that it just uses a ByteChunker that stores the byte contents into memory (using BytesIO() stream) and returns the content in fixed-size chunks, depending the chunk size you passed to the function (whereas raw_stream_bytes, as shown in the linked source code above, contains the actual byte chunk read from the stream).
Using HTTPX with def endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
#app.get('/video')
def get_video(url: str):
def iterfile():
with httpx.stream("GET", url) as r:
for chunk in r.iter_bytes():
yield chunk
return StreamingResponse(iterfile(), media_type="video/mp4")
Using HTTPX with async def endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import httpx
app = FastAPI()
#app.get('/video')
async def get_video(url: str):
async def iterfile():
async with httpx.AsyncClient() as client:
async with client.stream("GET", url) as r:
async for chunk in r.aiter_bytes():
yield chunk
return StreamingResponse(iterfile(), media_type="video/mp4")
You can use public videos provided here to test the above. Example:
http://127.0.0.1:8000/video?url=http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4
If you would like to return a custom Response or FileResponse instead—which I wouldn't really recommend in case you are dealing with large video files, as you should either read the entire contents into memory, or save the contents to a temporary file on disk that you later have to read again into memory, in order to send it back to the client—please have a look at this answer and this answer.

How can I upload files through aiohttp using response from get request?

To start off, I am writing an async wrapper for the WordPress REST API. I have a Wordpress site hosted on Bluehost. I am working with the endpoint for media (image) uploads. I have successfully managed to upload an image but there are 2 changes I would like to make. The second change is what I really want, but out of curiosity, I would like to know how to implement change 1 too. I'll provide the code first and then some details.
Working code
async def upload_local_pic2(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title': title, 'status':'publish'}
mpwriter.append_json(json)
with open(local_url, 'rb') as f:
print(f)
payload = mpwriter.append(f)
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
Change 1
The first change is uploading using aiofiles.open() instead of just using open() as I expect to be processing lots of files. The following code does not work.
async def upload_local_pic(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title': title, 'status':'publish'}
mpwriter.append_json(json)
async with aiofiles.open(local_url, 'rb') as f:
print(f)
payload = mpwriter.append(f)
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
Change 2
My other change is that I would like to have another function that can upload the files directly to the WordPress server without downloading them locally. So instead of getting a local picture, I want to pass in the url of an image online. The following code also does not work.
async def upload_pic(self, image_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'title':title, 'status':'publish'}
mpwriter.append_json(json)
async with self.session.get(image_url) as image_response:
image_content = image_response.content
print(image_content)
payload = mpwriter.append(image_content)
async with self.session.post(url, data = payload) as response:
x = await response.read()
print(x)
Details/Debugging
I'm trying to figure out why each one won't work. I think the key is the calls to print(image_content)
and print(f) that show what exactly I am inputting to mpwriter.append
In the example that works where I just use the standard Python open() function, I am apparently passing in <_io.BufferedReader name='/redactedfilepath/index.jpeg'>
In the change 1 example with aiofile, I am passing in <aiofiles.threadpool.binary.AsyncBufferedReader object at 0x7fb803122250>
Wordpress will return this html:
b'<head><title>Not Acceptable!</title></head><body><h1>Not Acceptable!</h1><p>An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.</p></body></html>'
And finally, in change 2 where I try to pass in what the get request to the url gives me I get
<StreamReader 292 bytes>. The response returned by WordPress is the same as above with Mod Security.
Any idea how I can make these examples work? It seems like they are all some type of io reader but I guess the underlying aiohttp code treats them differently.
Also this shouldn't really matter, but this is the url I am passing into the change 2 example.
Ok, so I figured out both changes.
For the first change when trying to read a file with aiofiles, I need to just read the whole file instead of passing in the file handler. Also, I need to set the content disposition manually.
async def upload_local_pic(self, local_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'status':'publish'}
mpwriter.append_json(json)
async with aiofiles.open(local_url, mode='rb') as f:
contents = await f.read()
payload = mpwriter.append(contents)
payload.set_content_disposition('attachment', filename= title+'.jpg')
async with self.session.post(url, data=payload) as response:
x = await response.read()
print(x)
For the second change, it's a similar concept with just uploading a file directly from the URL. Instead of passing in the handler that will read the content, I need to read the entire content first. I also need to set the content-disposition manually.
async def upload_pic(self, image_url, date, title):
url = f'{self.base_url}/wp-json/wp/v2/media'
with aiohttp.MultipartWriter() as mpwriter:
json = {'status':'publish'}
mpwriter.append_json(json)
async with self.session.get(image_url) as image_response:
image_content = await image_response.read()
payload = mpwriter.append(image_content)
payload.set_content_disposition('attachment', filename=title+'.jpg')
async with self.session.post(url, data = payload) as response:
x = await response.read()
print(x)
I will answer only to the title of the post (and not the questions that are in between).
The following code should give a short example of how to upload a file from URL#1 to URL#2 (without the need to download the file to the local machine and only then do the upload).
I will give two examples here:
Read all the content of the file into the memory (not downloading). This is of-course not so good when working with huge files...
Read and send the file in chunks (so we won't read all the file content at once).
Example #1: Reading all file content AT ONCE and uploading
import asyncio
import aiohttp
async def http_upload_from_url(src, dst):
async with aiohttp.ClientSession() as session:
src_resp = await session.get(src)
#print(src_resp)
dst_resp = await session.post(dst, data=src_resp.content)
#print(dst_resp)
try:
asyncio.run(http_upload_from_url(SRC_URL, DST_URL))
except Exception as e:
print(e)
Example #2: Reading file content IN CHUNKS and uploading
import asyncio
import aiohttp
async def url_sender(url=None, chunk_size=65536):
async with aiohttp.ClientSession() as session:
resp = await session.get(url)
#print(resp)
async for chunk in resp.content.iter_chunked(chunk_size):
#print(f"send chunk with size {len(chunk)}")
yield chunk
async def chunked_http_upload_from_url(src, dst):
async with aiohttp.ClientSession() as session:
resp = await session.post(dst, data=url_sender(src))
#print(resp)
#print(await resp.text())
try:
asyncio.run(chunked_http_upload_from_url(SRC_URL, DST_URL))
except Exception as e:
print(e)
Some notes:
You need to define SRC_URL and DST_URL.
I've only added the prints for debug (in-case you don't get a [200 OK] response).

Long running requests with asyncio and aiohttp

Apologies for asking with what may be considered redundant, but I'm finding it extremely difficult to figure out what are the current recommended best practices for using asyncio and aiohttp.
I'm working with an API that ultimately returns a link to a generated CSV file. There are two steps in using the API.
Submit request the triggers a long running process and returns a status URL.
Poll the status URL until the status_code is 201 and then get the URL of the CSV file from the headers.
Here's a stripped down example of how I can successfully do this synchronously with requests.
import time
import requests
def submit_request(id):
"""Submit request to create CSV for specified id"""
body = {'id': id}
response = requests.get(
url='https://www.example.com/endpoint',
json=body
)
response.raise_for_status()
return response
def get_status(request_response):
"""Check whether the CSV has been created."""
status_response = requests.get(
url=request_response.headers['Location']
)
status_response.raise_for_status()
return status_response
def get_data_url(id, poll_interval=10):
"""Submit request to create CSV for specified ID, wait for it to finish,
and return the URL of the CSV.
Wait between status requests based on poll_interval.
"""
response = submit_request(id)
while True:
status_response = get_status(response)
if status_response.status_code == 201:
break
time.sleep(poll_interval)
data_url = status_response.headers['Location']
return data_url
What I'd like to do is be able to submit a group of requests at once, and then wait on all of them to be finished. But I'm not clear on how to structure this with asyncio and aiohttp.
One option would be to first submit all of the requests and then use await.gather (or something) to get all of the status URLs. Then start another event loop where I continuously poll the status_urls until they have all completed and I end up with a list of data URLs.
Alternatively, I suppose I could create a single function that submits the request, gets the status URL, and then polls that until it completes. In that case I would just have a single event loop where I submit each of the IDs that I want processed.
If some pseudo code for those options would be useful I can try to provide it. I've looked at a lot of different examples where you submit requests for a bunch of URLs asynchronously -- this for example -- but I'm finding that I get a bit lost when trying to translate them to this slightly more complicated scenario where I submit the request and then get back a new URL to poll.
FYI based on the comments above my current solution is something like this.
import asyncio
import aiohttp
async def get_data_url(session, id):
url = 'https://www.example.com/endpoint'
body = {'id': id}
async with session.post(url=url, json=body) as response:
response.raise_for_status()
status_url = response.headers['Location']
while True:
async with session.get(url=status_url) as status_response:
status_response.raise_for_status()
if status_response.status == 201:
return status_response.headers['Location']
await asyncio.sleep(10)
async def main(access_token, id):
headers = {'token': access_token}
async with aiohttp.ClientSession(headers=headers) as session:
data_url = await get_data_url(session, id)
return data_url
This works though I'm still not sure on best practices for submitting a set of IDs. I think asyncio.gather would work but it looks like it's deprecated. Ideally I would have a queue of say 100 IDs and only have 5 requests running at any given time. I've found some examples like this but they depend on asyncio.Queue which is also deprecated.

aiohttp: How to efficiently check HTTP headers before downloading response body?

I am writing a web crawler using asyncio/aiohttp. I want the crawler to only want to download HTML content, and skip everything else. I wrote a simple function to filter URLS based on extensions, but this is not reliable because many download links do not include a filename/extension in them.
I could use aiohttp.ClientSession.head() to send a HEAD request, check the Content-Type field to make sure it's HTML, and then send a separate GET request. But this will increase the latency by requiring two separate requests per page (one HEAD, one GET), and I'd like to avoid that if possible.
Is it possible to just send a regular GET request, and set aiohttp into "streaming" mode to download just the header, and then proceed with the body download only if the MIME type is correct? Or is there some (fast) alternative method for filtering out non-HTML content that I should consider?
UPDATE
As requested in the comments, I've included some example code of what I mean by making two separate HTTP requests (one HEAD request and one GET request):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com']
results = []
async def get_urls_async(urls):
loop = asyncio.get_running_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
print(f"This is the first (HEAD) request we send for {u}")
tasks.append(loop.create_task(session.get(u)))
results = []
for t in asyncio.as_completed(tasks):
response = await t
url = response.url
if "text/html" in response.headers["Content-Type"]:
print("Sending the 2nd (GET) request to retrive body")
r = await session.get(url)
results.append((url, await r.read()))
else:
print(f"Not HTML, rejecting: {url}")
return results
results = asyncio.run(get_urls_async(urls))
This is a protocol problem, if you do a GET, the server wants to send the body. If you don't retrieve the body you have to discard the connection (this is in fact what it does if you don't do a read() before __aexit__ on the response).
So the above code should do more of less what you want. NOTE the server may send in the first chunk already more than just the headers

How to create an async generator in Python?

I'm trying to rewrite this Python2.7 code to the new async world order:
def get_api_results(func, iterable):
pool = multiprocessing.Pool(5)
for res in pool.map(func, iterable):
yield res
map() blocks until all results have been computed, so I'm trying to rewrite this as an async implementation that will yield results as soon as they are ready. Like map(), return values must be returned in the same order as iterable. I tried this (I need requests because of legacy auth requirements):
import requests
def get(i):
r = requests.get('https://example.com/api/items/%s' % i)
return i, r.json()
async def get_api_results():
loop = asyncio.get_event_loop()
futures = []
for n in range(1, 11):
futures.append(loop.run_in_executor(None, get, n))
async for f in futures:
k, v = await f
yield k, v
for r in get_api_results():
print(r)
but with Python 3.6 I'm getting:
File "scratch.py", line 16, in <module>
for r in get_api_results():
TypeError: 'async_generator' object is not iterable
How can I accomplish this?
Regarding your older (2.7) code - multiprocessing is considered a powerful drop-in replacement for the much simpler threading module for concurrently processing CPU intensive tasks, where threading does not work so well. Your code is probably not CPU bound - since it just needs to make HTTP requests - and threading might have been enough for solving your problem.
However, instead of using threading directly, Python 3+ has a nice module called concurrent.futures that with a cleaner API via cool Executor classes. This module is available also for python 2.7 as an external package.
The following code works on python 2 and python 3:
# For python 2, first run:
#
# pip install futures
#
from __future__ import print_function
import requests
from concurrent import futures
URLS = [
'http://httpbin.org/delay/1',
'http://httpbin.org/delay/3',
'http://httpbin.org/delay/6',
'http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.coooom/',
]
def fetch(url):
r = requests.get(url)
r.raise_for_status()
return r.content
def fetch_all(urls):
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(fetch, url): url for url in urls}
print("All URLs submitted.")
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is None:
yield url, future.result()
else:
# print('%r generated an exception: %s' % (
# url, future.exception()))
yield url, None
for url, s in fetch_all(URLS):
status = "{:,.0f} bytes".format(len(s)) if s is not None else "Failed"
print('{}: {}'.format(url, status))
This code uses futures.ThreadPoolExecutor, based on threading. A lot of the magic is in as_completed() used here.
Your python 3.6 code above, uses run_in_executor() which creates a futures.ProcessPoolExecutor(), and does not really use asynchronous IO!!
If you really want to go forward with asyncio, you will need to use an HTTP client that supports asyncio, such as aiohttp. Here is an example code:
import asyncio
import aiohttp
async def fetch(session, url):
print("Getting {}...".format(url))
async with session.get(url) as resp:
text = await resp.text()
return "{}: Got {} bytes".format(url, len(text))
async def fetch_all():
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, "http://httpbin.org/delay/{}".format(delay))
for delay in (1, 1, 2, 3, 3)]
for task in asyncio.as_completed(tasks):
print(await task)
return "Done."
loop = asyncio.get_event_loop()
resp = loop.run_until_complete(fetch_all())
print(resp)
loop.close()
As you can see, asyncio also has an as_completed(), now using real asynchronous IO, utilizing only one thread on one process.
You put your event loop in another co-routine. Don't do that. The event loop is the outermost 'driver' of async code, and should be run synchronous.
If you need to process the fetched results, write more coroutines that do so. They could take the data from a queue, or could be driving the fetching directly.
You could have a main function that fetches and processes results, for example:
async def main(loop):
for n in range(1, 11):
future = loop.run_in_executor(None, get, n)
k, v = await future
# do something with the result
loop = asyncio.get_event_loop()
loop.run_until_complete(main(loop))
I'd make the get() function properly async too using an async library like aiohttp so you don't have to use the executor at all.

Categories