asyncio.gather() and aiohttp not behaving as expected - python

I am pretty new to asyncio and I am trying to take its learning seriously so don't bash me too much.
I have written this code where I am trying to use asyncio and aiohttp to fetch input files from an external API while another coroutine fetches a file from an AWS s3 bucket and processes it.
Stripped down, my code looks like this:
async def fetchFromAPI():
...
print("Fetching from my API...")
try:
chunk_counter = 1
async with aiohttp.ClientSession(timeout=timeoutObject) as session:
async with session.get(url, ) as r:
r.raise_for_status()
with open(rawResponseFilepath, "w+") as f:
async for chunk in r.content.iter_chunked(8192):
chunk = chunk.decode('ASCII')
f.write(chunk)
print(f"Chunk #{chunk_counter} download completed.")
chunk_counter +=1
print(f"Streaming complete. Datarecord file available at {rawResponseFilepath}")
except Exception as e:
print(f"Could not query API. Error: {tb.format_exc()}.")
raise
...
async def fetchFromS3AndProcess():
...
print("Fetching from s3...")
s3_client.download_file(filename, localfile)
print("Done fetching from S3.")
processFile(filname)
print("Done processing.")
async def main():
await asyncio.gather(fetchFromAPI(), fetchFromS3AndProcess())
if __name__ == '__main__':
asyncio.run(main())
What I am observing from the print statements (exemplified):
Fetching from my API...
Fetching from s3...
Done fetching from S3.
Done processing.
Chunk #1 download completed.
Chunk #2 download completed.
.
.
.
Chunk #10000 download completed.
Streaming complete. Datarecord file available at dataRecord.dr.
What I am expecting instead is to see the download counter message be printed in between the S3 Fetching and processing print statements, signifying that the two downloads are actually starting at the same time.
Am I missing something obvious? Should I provide more info on the underlying coroutines to see if I am blocking somewhere?
If there is too little context to exemplify well I shall expand. Thanks!

Related

How to really make this operation async in python?

I have very recently started checking out asyncio for python. The use case is that, I hit an endpoint /refresh_data and the data is refreshed (from an S3 bucket). But this should be a non blocking operation, other API endpoints should still be able to be serviced.
So in my controller, I have:
def refresh_data():
myservice.refresh_data()
return jsonify(dict(ok=True))
and in my service I have:
async def refresh_data():
try:
s_result = await self.s3_client.fetch()
except (FileNotFound, IOError) as e:
logger.info("problem")
gather = {i.pop("x"):i async for i in summary_result}
# ... some other stuff
and in my client:
async def fetch():
result = pd.read_parquet("bucket", "pyarrow", cols, filts).to_dict(orient="col1")
return result
And when I run it, I see this error:
TypeError: 'async for' requires an object with __aiter__ method, got coroutine
I don't know how to move past this. Adding async definitely makes it return a coroutine type - but, either I have implemented this messily or I haven't fully understood asyncio package in Python. I have been working off of simple examples but I'm not sure what's wrong with what I've done here.

How stream a response from a Twisted server?

Issue
My problem is that I can't write a server that streams the response that my application sends back.
The response are not retrieved chunk by chunk, but from a single block when the iterator has finished iterating.
Approach
When I write the response with the write method of Request, it understands well that it is a chunk that we send.
I checked if there was a buffer size used by Twisted, but the message size check seems to be done in the doWrite.
After spending some time debugging, it seems that the reactor only reads and writes at the end.
If I understood correctly how a reactor works with Twisted, it writes and reads when the file descriptor is available.
What is a file descriptor in Twisted ?
Why is it not available after writing the response ?
Example
I have written a minimal script of what I would like my server to look like.
It's a "ASGI-like" server that runs an application, iterates over a function that returns a very large string:
# async_stream_server.py
import asyncio
from twisted.internet import asyncioreactor
twisted_loop = asyncio.new_event_loop()
asyncioreactor.install(twisted_loop)
import time
from sys import stdout
from twisted.web import http
from twisted.python.log import startLogging
from twisted.internet import reactor, endpoints
CHUNK_SIZE = 2**16
def async_partial(async_fn, *partial_args):
async def wrapped(*args):
return await async_fn(*partial_args, *args)
return wrapped
def iterable_content():
for _ in range(5):
time.sleep(1)
yield b"a" * CHUNK_SIZE
async def application(send):
for part in iterable_content():
await send(
{
"body": part,
"more_body": True,
}
)
await send({"more_body": False})
class Dummy(http.Request):
def process(self):
asyncio.ensure_future(
application(send=async_partial(self.handle_reply)),
loop=asyncio.get_event_loop()
)
async def handle_reply(self, message):
http.Request.write(self, message.get("body", b""))
if not message.get("more_body", False):
http.Request.finish(self)
print('HTTP response chunk')
class DummyFactory(http.HTTPFactory):
def buildProtocol(self, addr):
protocol = http.HTTPFactory.buildProtocol(self, addr)
protocol.requestFactory = Dummy
return protocol
startLogging(stdout)
endpoints.serverFromString(reactor, "tcp:1234").listen(DummyFactory())
asyncio.set_event_loop(reactor._asyncioEventloop)
reactor.run()
To execute this example:
in a terminal, run:
python async_stream_server.py
in another terminal, run:
curl http://localhost:1234/
You will have to wait a while before you see the whole message.
Details
$ python --version
Python 3.10.4
$ pip list
Package Version Editable project location
----------------- ------- --------------------------------------------------
asgiref 3.5.0
Twisted 22.4.0
You just need to sprinkle some more async over it.
As written, the iterable_content generator blocks the reactor until it finishes generating content. This is why you see no results until it is done. The reactor does not get control of execution back until it finishes.
That's only because you used time.sleep to insert a delay into it. time.sleep blocks. This -- and everything else in the "asynchronous" application -- is really synchronous and keeps control of execution until it is done.
If you replace iterable_content with something that's really asynchronous, like an asynchronous generator:
async def iterable_content():
for _ in range(5):
await asyncio.sleep(1)
yield b"a" * CHUNK_SIZE
and then iterate over it asynchronously with async for:
async def application(send):
async for part in iterable_content():
await send(
{
"body": part,
"more_body": True,
}
)
await send({"more_body": False})
then the reactor has a chance to run in between iterations and the server begins to produce output chunk by chunk.

python/httpx/asyncio: httpx.RemoteProtocolError: Server disconnected without sending a response

I am attempting to optimize a simple web scraper that I made. It gets a list of urls from a table on a main page and then goes to each of those "sub" urls and gets information from those pages. I was able to successfully write it synchronously and using concurrent.futures.ThreadPoolExecutor(). However, I am trying to optimize it to use asyncio and httpx as these seem to be very fast for making hundreds of http requests.
I wrote the following script using asyncio and httpx however, I keep getting the following errors:
httpcore.RemoteProtocolError: Server disconnected without sending a response.
RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.
It appears that I keep losing connection when I run the script. I even attempted running a synchronous version of it and get the same error. I was thinking that the remote server was blocking my requests, however, I am able to run my original program and go to each of the urls from the same IP address without issue.
What would cause this exception and how do you fix it?
import httpx
import asyncio
async def get_response(client, url):
resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
html = resp.text
return html
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks)
for resp in detail_responses:
event = get_details(resp) # Parse url and get desired info
asyncio.run(main())
I've had a same issue, the problem occurs when there is an exception in one of the asyncio.gather tasks, when it's raised, it causes httpxclient to call __ aexit __ and cancel all the current requests, you could bypass it by using return_exceptions=True for asyncio.gather:
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks, return_exceptions=True)
for resp in detail_responses:
# here you would need to do smth with the exceptions
# if isinstance(resp, Exception): ...
event = get_details(resp) # Parse url and get desired info

Async HTTP API call for each line of file - Python

I am working on a big data problem and am stuck with some concurrency and async io issues. The problem is as follows:
1) Have multiple huge files (~4gb each x upto 15) which I am processing using ProcessPoolExecutor from concurrent.futures module this way :
def process(source):
files = os.list(source)
with ProcessPoolExecutor() as executor:
future_to_url = {executor.submit(process_individual_file, source, input_file):input_file for input_file in files}
for future in as_completed(future_to_url):
data = future.result()
2) Now in each file, I want to go line by line, process line to create a particular json, group such 2K jsons together and hit an API with that request to get response. Here is the code:
def process_individual_file(source, input_file):
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
response = requests.post(API_URL, json=json_array)
#check response status here
limit = 2000
3) Now the problem, the number of lines in each file being really large and that API call blocking and bit slow to respond, the program is taking huge amount of time to complete.
4) What I want to achieve is to make that API call async so that I can keep processing next batch of 2000 when that API call is happening.
5) Things I tried till now : I was trying to implement this using asyncio but there we need to collect the set of future tasks and wait for completion using event loop. Something like this:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
ioloop = asyncio.get_event_loop()
ioloop.run_until_complete(process_individual_file(source, input_file))
ioloop.close()
6) I am really not understanding this because this is indirectly the same as previous as it waits to collect all tasks before launching them. Can someone help me with what should be the correct architecture of this problem ? How can I call the API async way, without collecting all tasks and with ability to process next batch parallely ?
I am really not understanding this because this is indirectly the
same as previous as it waits to collect all tasks before launching
them.
No, you wrong here. When you create asyncio.Task with asyncio.ensure_future it starts executing call_api coroutine immediately. This is how tasks in asyncio work:
import asyncio
async def test(i):
print(f'{i} started')
await asyncio.sleep(i)
async def main():
tasks = [
asyncio.ensure_future(test(i))
for i
in range(3)
]
await asyncio.sleep(0)
print('At this moment tasks are already started')
await asyncio.wait(tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
0 started
1 started
2 started
At this moment tasks are already started
Problem with your approach is that process_individual_file is not actually asynchronous: it does large amount of CPU-related job without returning control to your asyncio event loop. It's a problem - function blocks event loop making impossible tasks to be executed.
Very simple, but effective solution I think you can use - is to return control to event loop manually using asyncio.sleep(0) after a few amount of executing process_individual_file, for example, on reading each line:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
await asyncio.sleep(0) # Return control to event loop to allow it execute tasks
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
Upd:
there will be more than millions of requests to be done and hence I am
feeling uncomfortable to store future objects for all of them in a
list
It makes much sense. Nothing good will happen if you run million parallel network requests. Usual way to set limit in this case is to use synchronization primitives like asyncio.Semaphore.
I advice you to make generator to get json_array from file, and acquire Semaphore before adding new task and release it on task ready. You will get clean code protected from many parallel running tasks.
This will look like something like this:
def get_json_array(input_file):
json_array = []
limit = 2000
with open(input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
yield json_array # generator will allow split file-reading logic from adding tasks
json_array = []
limit = 2000
sem = asyncio.Semaphore(50) # don't allow more than 50 parallel requests
async def process_individual_file(input_file):
for json_array in get_json_array(input_file):
await sem.acquire() # file reading wouldn't resume until there's some place for newer tasks
task = asyncio.ensure_future(call_api(json_array))
task.add_done_callback(lambda t: sem.release()) # on task done - free place for next tasks
task.add_done_callback(lambda t: print(t.result())) # print result on some call_api done

Run spider while web application running based on Python asyncio module

My practice now:
I let my backend to catch the get request sent by the front-end page to run my scrapy spider, everytime the page is refreshed or loaded. The crawled data will be shown in my front page. Here's the code, I call a subprocess to run the spider:
from subprocess import run
#get('/api/get_presentcode')
def api_get_presentcode():
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
with open("/static/presentcodes.json") as data_file:
data = json.load(data_file)
logging.info(data)
return data
It works well.
What I want:
However, the spider crawls a website which barely changes, so it's no need to crawl that often.
So I want to run my scrapy spider every 30 minutes using the coroutine method just at backend.
What I tried and succeeded:
from subprocess import run
# init of my web application
async def init(loop):
....
async def run_spider():
while True:
print("Run spider...")
await asyncio.sleep(10) # to check results more obviously
loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()
It works well too.
But when I change the codes of run_spider() into this (which is basically the same as the first one):
async def run_spider():
while True:
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
await asyncio.sleep(20)
the spider was run only at the first time and crawled data was stored to presentcode.json successfully, but the spider never called after 20 seconds later.
Questions
What's wrong with my program? Is it because I called a subprocess in a coroutine and it is invalid?
Any better thoughts to run a spider while the main application is running?
Edit:
Let me put the code of my web app init function here first:
async def init(loop):
logging.info("App started at {0}".format(datetime.now()))
await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
app = web.Application(loop=loop, middlewares=[
logger_factory, auth_factory, response_factory
])
init_jinja2(app, filters=dict(datetime=datetime_filter))
add_routes(app, 'handlers')
add_static(app)
srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000) # It seems something happened here.
logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
return srv
My thought is, the main app made coroutine event loop 'stuck' so the spider cannot be callback later after.
Let me check the source code of create_server and run_until_complete..
Probably not a complete answer, and I would not do it like you do. But calling subprocess from within an asyncio coroutine is definitely not correct. Coroutines offer cooperative multitasking, so when you call subprocess from within a coroutine, that coroutine effectively stops your whole app until called process is finished.
One thing you need to understand when working with asyncio is that control flow can be switched from one coroutine to another only when you call await (or yield from, or async for, async with and other shortcuts). If you do some long action without calling any of those then you block any other coroutines until this action is finished.
What you need to use is asyncio.subprocess which will properly return control flow to other parts of your application (namely webserver) while the subprocess is running.
Here is how actual run_spider() coroutine could look:
import asyncio
async def run_spider():
while True:
sp = await asyncio.subprocess.create_subprocess_shell(
"scrapy srawl presentcodespider -o ../static/presentcodes.new.json",
cwd="./presentcodeSpider")
code = await sp.wait()
if code != 0:
print("Warning: something went wrong, code %d" % code)
continue # retry immediately
if os.path.exists("/static/presentcodes.new.json"):
# output was created, overwrite older version (if any)
os.rename("/static/presentcodes.new.json", "/static/presentcodes.json")
else:
print("Warning: output file was not found")
await asyncio.sleep(20)

Categories