Run spider while web application running based on Python asyncio module

Run spider while web application running based on Python asyncio module - python

My practice now:
I let my backend to catch the get request sent by the front-end page to run my scrapy spider, everytime the page is refreshed or loaded. The crawled data will be shown in my front page. Here's the code, I call a subprocess to run the spider:
from subprocess import run
#get('/api/get_presentcode')
def api_get_presentcode():
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
with open("/static/presentcodes.json") as data_file:
data = json.load(data_file)
logging.info(data)
return data
It works well.
What I want:
However, the spider crawls a website which barely changes, so it's no need to crawl that often.
So I want to run my scrapy spider every 30 minutes using the coroutine method just at backend.
What I tried and succeeded:
from subprocess import run
# init of my web application
async def init(loop):
....
async def run_spider():
while True:
print("Run spider...")
await asyncio.sleep(10) # to check results more obviously
loop = asyncio.get_event_loop()
tasks = [run_spider(), init(loop)]
loop.run_until_complete(asyncio.wait(tasks))
loop.run_forever()
It works well too.
But when I change the codes of run_spider() into this (which is basically the same as the first one):
async def run_spider():
while True:
if os.path.exists("/static/presentcodes.json"):
run("rm presentcodes.json", shell=True)
run("scrapy crawl presentcodespider -o ../static/presentcodes.json", shell=True, cwd="./presentcodeSpider")
await asyncio.sleep(20)
the spider was run only at the first time and crawled data was stored to presentcode.json successfully, but the spider never called after 20 seconds later.
Questions
What's wrong with my program? Is it because I called a subprocess in a coroutine and it is invalid?
Any better thoughts to run a spider while the main application is running?
Edit:
Let me put the code of my web app init function here first:
async def init(loop):
logging.info("App started at {0}".format(datetime.now()))
await orm.create_pool(loop=loop, user='root', password='', db='myBlog')
app = web.Application(loop=loop, middlewares=[
logger_factory, auth_factory, response_factory
])
init_jinja2(app, filters=dict(datetime=datetime_filter))
add_routes(app, 'handlers')
add_static(app)
srv = await loop.create_server(app.make_handler(), '127.0.0.1', 9000) # It seems something happened here.
logging.info('server started at http://127.0.0.1:9000') # this log didn't show up.
return srv
My thought is, the main app made coroutine event loop 'stuck' so the spider cannot be callback later after.
Let me check the source code of create_server and run_until_complete..

Probably not a complete answer, and I would not do it like you do. But calling subprocess from within an asyncio coroutine is definitely not correct. Coroutines offer cooperative multitasking, so when you call subprocess from within a coroutine, that coroutine effectively stops your whole app until called process is finished.
One thing you need to understand when working with asyncio is that control flow can be switched from one coroutine to another only when you call await (or yield from, or async for, async with and other shortcuts). If you do some long action without calling any of those then you block any other coroutines until this action is finished.
What you need to use is asyncio.subprocess which will properly return control flow to other parts of your application (namely webserver) while the subprocess is running.
Here is how actual run_spider() coroutine could look:
import asyncio
async def run_spider():
while True:
sp = await asyncio.subprocess.create_subprocess_shell(
"scrapy srawl presentcodespider -o ../static/presentcodes.new.json",
cwd="./presentcodeSpider")
code = await sp.wait()
if code != 0:
print("Warning: something went wrong, code %d" % code)
continue # retry immediately
if os.path.exists("/static/presentcodes.new.json"):
# output was created, overwrite older version (if any)
os.rename("/static/presentcodes.new.json", "/static/presentcodes.json")
else:
print("Warning: output file was not found")
await asyncio.sleep(20)

Related

How to stop execution of FastAPI endpoint after a specified time to reduce CPU resource usage/cost?

Use case
The client micro service, which calls /do_something, has a timeout of 60 seconds in the request/post() call. This timeout is fixed and can't be changed. So if /do_something takes 10 mins, /do_something is wasting CPU resources since the client micro service is NOT waiting after 60 seconds for the response from /do_something, which wastes CPU for 10 mins and this increases the cost. We have limited budget.
The current code looks like this:
import time
from uvicorn import Server, Config
from random import randrange
from fastapi import FastAPI
app = FastAPI()
def some_func(text):
"""
Some computationally heavy function
whose execution time depends on input text size
"""
randinteger = randrange(1,120)
time.sleep(randinteger)# simulate processing of text
return text
#app.get("/do_something")
async def do_something():
response = some_func(text="hello world")
return {"response": response}
# Running
if __name__ == '__main__':
server = Server(Config(app=app, host='0.0.0.0', port=3001))
server.run()
Desired Solution
Here /do_something should stop the processing of the current request to endpoint after 60 seconds and wait for next request to process.
If execution of the end point is force stopped after 60 seconds we should be able to log it with custom message.
This should not kill the service and work with multithreading/multiprocessing.
I tried this. But when timeout happends the server is getting killed.
Any solution to fix this?
import logging
import time
import timeout_decorator
from uvicorn import Server, Config
from random import randrange
from fastapi import FastAPI
app = FastAPI()
#timeout_decorator.timeout(seconds=2, timeout_exception=StopIteration, use_signals=False)
def some_func(text):
"""
Some computationally heavy function
whose execution time depends on input text size
"""
randinteger = randrange(1,30)
time.sleep(randinteger)# simulate processing of text
return text
#app.get("/do_something")
async def do_something():
try:
response = some_func(text="hello world")
except StopIteration:
logging.warning(f'Stopped /do_something > endpoint due to timeout!')
else:
logging.info(f'( Completed < /do_something > endpoint')
return {"response": response}
# Running
if __name__ == '__main__':
server = Server(Config(app=app, host='0.0.0.0', port=3001))
server.run()

This answer is not about improving CPU time—as you mentioned in the comments section—but rather explains what would happen, if you defined an endpoint with normal def or async def, as well as provides solutions when you run blocking operations inside an endpoint.
You are asking how to stop the processing of a request after a while, in order to process further requests. It does not really make that sense to start processing a request, and then (60 seconds later) stop it as if it never happened (wasting server resources all that time and having other requests waiting). You should instead let the handling of requests to FastAPI framework itself. When you define an endpoint with async def, it is run on the main thread (in the event loop), i.e., the server processes the requests sequentially, as long as there is no await call inside the endpoint (just like in your case). The keyword await passes function control back to the event loop. In other words, it suspends the execution of the surrounding coroutine, and tells the event loop to let something else run, until the awaited task completes (and has returned the result data). The await keyword only works within an async function.
Since you perform a heavy CPU-bound operation inside your async def endpoint (by calling your some_func() function), and you never give up control for other requests to run in the event loop (e.g., by awaiting for some coroutine), the server will be blocked and wait for that request to be fully processed and complete, before moving on to the next one(s)—have a look at this answer for more details.
Solutions
One solution would be to define your endpoint with normal def instead of async def. In brief, when you declare an endpoint with normal def instead of async def in FastAPI, it is run in an external threadpool that is then awaited, instead of being called directly (as it would block the server); hence, FastAPI would still work asynchronously.
Another solution, as described in this answer, is to keep the async def definition and run the CPU-bound operation in a separate thread and await it, using Starlette's run_in_threadpool(), thus ensuring that the main thread (event loop), where coroutines are run, does not get blocked. As described by #tiangolo here, "run_in_threadpool is an awaitable function, the first parameter is a normal function, the next parameters are passed to that function directly. It supports sequence arguments and keyword arguments". Example:
from fastapi.concurrency import run_in_threadpool
res = await run_in_threadpool(cpu_bound_task, text='Hello world')
Since this is about a CPU-bound operation, it would be preferable to run it in a separate process, using ProcessPoolExecutor, as described in the link provided above. In this case, this could be integrated with asyncio, in order to await the process to finish its work and return the result(s). Note that, as described in the link above, it is important to protect the main loop of code to avoid recursive spawning of subprocesses, etc—essentially, your code must be under if __name__ == '__main__'. Example:
import concurrent.futures
from functools import partial
import asyncio
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
res = await loop.run_in_executor(pool, partial(cpu_bound_task, text='Hello world'))
About Request Timeout
With regards to the recent update on your question about the client having a fixed 60s request timeout; if you are not behind a proxy such as Nginx that would allow you to set the request timeout, and/or you are not using gunicorn, which would also allow you to adjust the request timeout, you could use a middleware, as suggested here, to set a timeout for all incoming requests. The suggested middleware (example is given below) uses asyncio's .wait_for() function, which waits for an awaitable function/coroutine to complete with a timeout. If a timeout occurs, it cancels the task and raises asyncio.TimeoutError.
Regarding your comment below:
My requirement is not unblocking next request...
Again, please read carefully the first part of this answer to understand that if you define your endpoint with async def and not await for some coroutine inside, but instead perform some CPU-bound task (as you already do), it will block the server until is completed (and even the approach below wont' work as expected). That's like saying that you would like FastAPI to process one request at a time; in that case, there is no reason to use an ASGI framework such as FastAPI, which takes advantage of the async/await syntax (i.e., processing requests asynchronously), in order to provide fast performance. Hence, you either need to drop the async definition from your endpoint (as mentioned earlier above), or, preferably, run your synchronous CPU-bound task using ProcessPoolExecutor, as described earlier.
Also, your comment in some_func():
Some computationally heavy function whose execution time depends on
input text size
indicates that instead of (or along with) setting a request timeout, you could check the length of input text (using a dependency fucntion, for instance) and raise an HTTPException in case the text's length exceeds some pre-defined value, which is known beforehand to require more than 60s to complete the processing. In that way, your system won't waste resources trying to perform a task, which you already know will not be completed.
Working Example
import time
import uvicorn
import asyncio
import concurrent.futures
from functools import partial
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from starlette.status import HTTP_504_GATEWAY_TIMEOUT
from fastapi.concurrency import run_in_threadpool
REQUEST_TIMEOUT = 2 # adjust timeout as desired
app = FastAPI()
#app.middleware('http')
async def timeout_middleware(request: Request, call_next):
try:
return await asyncio.wait_for(call_next(request), timeout=REQUEST_TIMEOUT)
except asyncio.TimeoutError:
return JSONResponse({'detail': f'Request exceeded the time limit for processing'},
status_code=HTTP_504_GATEWAY_TIMEOUT)
def cpu_bound_task(text):
time.sleep(5)
return text
#app.get('/')
async def main():
loop = asyncio.get_running_loop()
with concurrent.futures.ProcessPoolExecutor() as pool:
res = await loop.run_in_executor(pool, partial(cpu_bound_task, text='Hello world'))
return {'response': res}
if __name__ == '__main__':
uvicorn.run(app)

How stream a response from a Twisted server?

Issue
My problem is that I can't write a server that streams the response that my application sends back.
The response are not retrieved chunk by chunk, but from a single block when the iterator has finished iterating.
Approach
When I write the response with the write method of Request, it understands well that it is a chunk that we send.
I checked if there was a buffer size used by Twisted, but the message size check seems to be done in the doWrite.
After spending some time debugging, it seems that the reactor only reads and writes at the end.
If I understood correctly how a reactor works with Twisted, it writes and reads when the file descriptor is available.
What is a file descriptor in Twisted ?
Why is it not available after writing the response ?
Example
I have written a minimal script of what I would like my server to look like.
It's a "ASGI-like" server that runs an application, iterates over a function that returns a very large string:
# async_stream_server.py
import asyncio
from twisted.internet import asyncioreactor
twisted_loop = asyncio.new_event_loop()
asyncioreactor.install(twisted_loop)
import time
from sys import stdout
from twisted.web import http
from twisted.python.log import startLogging
from twisted.internet import reactor, endpoints
CHUNK_SIZE = 2**16
def async_partial(async_fn, *partial_args):
async def wrapped(*args):
return await async_fn(*partial_args, *args)
return wrapped
def iterable_content():
for _ in range(5):
time.sleep(1)
yield b"a" * CHUNK_SIZE
async def application(send):
for part in iterable_content():
await send(
{
"body": part,
"more_body": True,
}
)
await send({"more_body": False})
class Dummy(http.Request):
def process(self):
asyncio.ensure_future(
application(send=async_partial(self.handle_reply)),
loop=asyncio.get_event_loop()
)
async def handle_reply(self, message):
http.Request.write(self, message.get("body", b""))
if not message.get("more_body", False):
http.Request.finish(self)
print('HTTP response chunk')
class DummyFactory(http.HTTPFactory):
def buildProtocol(self, addr):
protocol = http.HTTPFactory.buildProtocol(self, addr)
protocol.requestFactory = Dummy
return protocol
startLogging(stdout)
endpoints.serverFromString(reactor, "tcp:1234").listen(DummyFactory())
asyncio.set_event_loop(reactor._asyncioEventloop)
reactor.run()
To execute this example:
in a terminal, run:
python async_stream_server.py
in another terminal, run:
curl http://localhost:1234/
You will have to wait a while before you see the whole message.
Details
$ python --version
Python 3.10.4
$ pip list
Package Version Editable project location
----------------- ------- --------------------------------------------------
asgiref 3.5.0
Twisted 22.4.0

You just need to sprinkle some more async over it.
As written, the iterable_content generator blocks the reactor until it finishes generating content. This is why you see no results until it is done. The reactor does not get control of execution back until it finishes.
That's only because you used time.sleep to insert a delay into it. time.sleep blocks. This -- and everything else in the "asynchronous" application -- is really synchronous and keeps control of execution until it is done.
If you replace iterable_content with something that's really asynchronous, like an asynchronous generator:
async def iterable_content():
for _ in range(5):
await asyncio.sleep(1)
yield b"a" * CHUNK_SIZE
and then iterate over it asynchronously with async for:
async def application(send):
async for part in iterable_content():
await send(
{
"body": part,
"more_body": True,
}
)
await send({"more_body": False})
then the reactor has a chance to run in between iterations and the server begins to produce output chunk by chunk.

Concurrent execution of two python methods

I'm creating a script that is posting a message to both discord and twitter, depending on some input. I have to methods (in separate .py files), post_to_twitter and post_to_discord. What I want to achieve is that both of these try to execute even if the other fails (e.g. if there is some exception with login).
Here is the relevant code snippet for posting to discord:
def post_to_discord(message, channel_name):
client = discord.Client()
#client.event
async def on_ready():
channel = # getting the right channel
await channel.send(message)
await client.close()
client.run(discord_config.token)
and here is the snippet for posting to twitter part (stripped from the try-except blocks):
def post_to_twitter(message):
auth = tweepy.OAuthHandler(twitter_config.api_key, twitter_config.api_key_secret)
auth.set_access_token(twitter_config.access_token, twitter_config.access_token_secret)
api = tweepy.API(auth)
api.update_status(message)
Now, both of these work perfectly fine on their own and when being called synchronously from the same method:
def main(message):
post_discord.post_to_discord(message)
post_tweet.post_to_twitter(message)
However, I just cannot get them to work concurrently (i.e. to try to post to twitter even if discord fails or vice-versa). I've already tried a couple of different approaches with multi-threading and with asyncio.
Among others, I've tried the solution from this question. But got an error No module named 'IPython'. When I omitted the IPython line, changed the methods to async, I got this error: RuntimeError: Cannot enter into task <ClientEventTask state=pending event=on_ready coro=<function post_to_discord.<locals>.on_ready at 0x7f0ee33e9550>> while another task <Task pending name='Task-1' coro=<main() running at post_main.py:31>> is being executed..
To be honest, I'm not even sure if asyncio would be the right approach for my use case, so any insight is much appreciated.
Thank you.

In this case running the two things in completely separate threads (and completely separate event loops) is probably the easiest option at your level of expertise. For example, try this:
import post_to_discord, post_to_twitter
import concurrent.futures
def main(message):
with concurrent.futures.ThreadPoolExecutor() as pool:
fut1 = pool.submit(post_discord.post_to_discord, message)
fut2 = pool.submit(post_tweet.post_to_twitter, message)
# here closing the threadpool will wait for both futures to complete
# make exceptions visible
for fut in (fut1, fut2):
try:
fut.result()
except Exception as e:
print("error: ", e)

HTTP server kick-off background python script without blocking

I'd like to be able to trigger a long-running python script via a web request, in bare-bones fashion. Also, I'd like to be able to trigger other copies of the script with different parameters while initial copies are still running.
I've looked at flask, aiohttp, and queueing possibilities. Flask and aiohttp seem to have the least overhead to set up. I plan on executing the existing python script via subprocess.run (however, I did consider refactoring the script into libraries that could be used in the web response function).
With aiohttp, I'm trying something like:
ingestion_service.py:
from aiohttp import web
from pprint import pprint
routes = web.RouteTableDef()
#routes.get("/ingest_pipeline")
async def test_ingest_pipeline(request):
'''
Get the job_conf specified from the request and activate the script
'''
#subprocess.run the command with lookup of job conf file
response = web.Response(text=f"Received data ingestion request")
await response.prepare(request)
await response.write_eof()
#eventually this would be subprocess.run call
time.sleep(80)
return response
def init_func(argv):
app = web.Application()
app.add_routes(routes)
return app
But though the initial request returns immediately, subsequent requests block until the initial request is complete. I'm running a server via:
python -m aiohttp.web -H localhost -P 8080 ingestion_service:init_func
I know that multithreading and concurrency may provide better solutions than asyncio. In this case, I'm not looking for a robust solution, just something that will allow me to run multiple scripts at once via http request, ideally with minimal memory costs.

OK, there were a couple of issues with what I was doing. Namely, time.sleep() is blocking, so asyncio.sleep() should be used. However, since I'm interested in spawning a subprocess, I can use asyncio.subprocess to do that in a non-blocking fashion.
nb:
asyncio: run one function threaded with multiple requests from websocket clients
https://docs.python.org/3/library/asyncio-subprocess.html.
Using these help, but there's still an issue with the webhandler terminating the subprocess. Luckily, there's a solution here:
https://docs.aiohttp.org/en/stable/web_advanced.html
aiojobs has a decorator "atomic" that will protect the process until it is complete. So, code along these lines will function:
from aiojobs.aiohttp import setup, atomic
import asyncio
import os
from aiohttp import web
#atomic
async def ingest_pipeline(request):
#be careful what you pass through to shell, lest you
#give away the keys to the kingdom
shell_command = "[your command here]"
response_text = f"running {shell_command}"
response_code = 200
response = web.Response(text=response_text, status=response_code)
await response.prepare(request)
await response.write_eof()
ingestion_process = await asyncio.create_subprocess_shell(shell_command,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE)
stdout, stderr = await ingestion_process.communicate()
return response
def init_func(argv):
app = web.Application()
setup(app)
app.router.add_get('/ingest_pipeline', ingest_pipeline)
return app
This is very bare bones, but might help others looking for a quick skeleton for a temporary internal solution.

How to use tornado's asynchttpclient alone?

I'm new to tornado.
What I want is to write some functions to fetch webpages asynchronously. Since no requesthandlers, apps, or servers involved here, I think I can use tornado.httpclient.AsyncHTTPClient alone.
But all the sample codes seem to be in a tornado server or requesthandler. When I tried to use it alone, it never works.
For example:
def handle(self,response):
print response
print response.body
#tornado.web.asynchronous
def fetch(self,url):
client=tornado.httpclient.AsyncHTTPClient()
client.fetch(url,self.handle)
fetch('http://www.baidu.com')
It says "'str' object has no attribute 'application'", but I'm trying to use it alone?
or :
#tornado.gen.coroutine
def fetch_with_coroutine(url):
client=tornado.httpclient.AsyncHTTPClient()
response=yield http_client.fetch(url)
print response
print response.body
raise gen.Return(response.body)
fetch_with_coroutine('http://www.baidu.com')
doesn't work either.
Earlier, I tried pass a callback to AsyncHTTPHandler.fetch, then start the IOLoop, It works and the webpage source code is printed. But I can't figure out what to do with the ioloop.

#tornado.web.asynchronous can only be applied to certain methods in RequestHandler subclasses; it is not appropriate for this usage.
Your second example is the correct structure, but you need to actually run the IOLoop. The best way to do this in a batch-style program is IOLoop.current().run_sync(fetch_with_coroutine). This starts the IOLoop, runs your callback, then stops the IOLoop. You should run a single function within run_sync(), and then use yield within that function to call any other coroutines.
For a more complete example, see https://github.com/tornadoweb/tornado/blob/master/demos/webspider/webspider.py

Here's an example I've used in the past...
from tornado.httpclient import AsyncHTTPClient
from tornado.ioloop import IOLoop
AsyncHTTPClient.configure(None, defaults=dict(user_agent="MyUserAgent"))
http_client = AsyncHTTPClient()
def handle_response(response):
if response.error:
print("Error: %s" % response.error)
else:
print(response.body)
async def get_content():
await http_client.fetch("https://www.integralist.co.uk/", handle_response)
async def main():
await get_content()
print("I won't wait for get_content to finish. I'll show immediately.")
if __name__ == "__main__":
io_loop = IOLoop.current()
io_loop.run_sync(main)
I've also detailed how to use Pipenv with tox.ini and Flake8 with this tornado example so others should be able to get up and running much more quickly https://gist.github.com/fd603239cacbb3d3d317950905b76096

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run spider while web application running based on Python asyncio module - python

Related

How to stop execution of FastAPI endpoint after a specified time to reduce CPU resource usage/cost?

How stream a response from a Twisted server?

Concurrent execution of two python methods

HTTP server kick-off background python script without blocking

How to use tornado's asynchttpclient alone?

Categories

Resources