how to design a fastapi app with independent background computation?

how to design a fastapi app with independent background computation? - python

I've created a python main application main.py, which I invoke with uvicorn main.main --reload. Which of course runs the following code...
if __name__ == '__main__':
main()
That part of the application runs constantly, reads data an processes it until the application is aborted manually. I use asyncio to run coroutines.
Task
I would like to build a small html dashboard on it, which can display the data that is constantly computed.
Question
How can I run these background calculations of main.py and still implement a dashboard/website with fastapi and jinja2?
What is the best practice/architecture to structure the files: the background and fastapi app code? e.g. Is there a initial startup function in fastapi where I could invoke the background computation in a coroutine or the other way around?
How would you invoke the application according to your recommendation?
What I have achieved so far
I can run the main application without any fastapi code. And I can run the dashboard without the background tasks. Both work fine independently. But fastapi does not run, when I add its code to the main application with the background computation. (How could it?!? I can only invoke either the main application or the fastapi app.)
Any architectural concepts are appreciated.
Thank you.

Fastapi doesn't run because it cant be reached by python interpreter untill it complete your computations. You should start your web app independently of the main process, I strongly recommend you to use docker-compose.
As fastapi recommends you, you should use Dramatiq or Celery for huge background tasks, or you can just run separate service in compose services, for example:
# background.py
if __name__ == '__main__':
main()
# main.py
app = FastAPI()
docker-compose.yml:
services:
web-app-interface:
command: uvicorn main.main ...
my-daemon:
command: python background.py
You can make them communicate with a message broker, such as RabbitMQ etc.
And never use multiprocessing with uvicorn, it can cause process leak, bcz uvicorn rules it's own workers.

A good approch is to use the on_event decorator with startup. The only thing to remain is to use asyncio.create_task to invoke the background task. As long as you don't await it, it will not block and thus fastapi/uvicorn can continue to serve any http request.
my_service = MyService()
#app.on_event('startup')
async def service_tasks_startup():
"""Start all the non-blocking service tasks, which run in the background."""
asyncio.create_task(my_service.start_processing_data())
Also, with this said, any request can consume the data of this background service.
#app.get("/")
def root():
return my_service.value
Think of MyService as any class of your liking. Kafka consumption, computations, etc. Of course, the value is just an example attribute of the Class MyService.

Related

Periodic and Non periodic tasks with Django + Telegram + Celery

I am building a project based on Django and one of my intentions is to have a telegram bot which is receiving information from a Telegram group. I was able to implement the bot to send messages in Telegram, no issues.
In this moment I have a couple of Celery tasks which are running with Beat and also the Django web, which are decopled. All good here.
I have seen that the python-telegram-bot is running a function in one of the examples (https://github.com/python-telegram-bot/python-telegram-bot/blob/master/examples/echobot.py) which is waiting idle to receive data from Telegram. Now, all my tasks in Celery are in this moment periodic and are called each 10 or 60 minutes by Beat.
How can I run this non-periodic task with Celery in my configuration? I am saying non-periodic because I understood that it will wait for content until it is manually interrupted.
Django~=3.2.6
celery~=5.1.2
CELERY_BEAT_SCHEDULE = {
'task_1': {
'task': 'apps.envc.tasks.Fetch1',
'schedule': 600.0,
},
'task_2': {
'task': 'apps.envc.tasks.Fetch2',
'schedule': crontab(minute='*/60'),
},
'task_3': {
'task': 'apps.envc.tasks.Analyze',
'schedule': 600,
},
}
In my tasks.py I have one of the tasks like this:
#celery_app.task(name='apps.envc.tasks.TelegramBot')
def TelegramBot():
status = start_bot()
return status
And as the start_bot implemenation, I simply copied the echobot.py example and I have added my TOKEN there (of course the functions for different commands from the example are also there).

Set up a webhook instead of polling with Celery
With Django, you shouldn't be using Celery to run Telegram polling (what you call PTB's “non-periodic task”, which is better described as a long-running process or service). Celery is designed for definite tasks, not indefinitely-running processes.
As Django implies that you're already running a web server, then the webhook option is a better fit. (Remember that you can either do polling or set up a webhook in order to receive updates from Telegram's servers.) The option that #CallMeStag suggested, of using a non-threading webhook setup, makes the most sense for Django-PTB integration.
You can do the bot setup (defining and registering your handler functions on a Dispatcher instance) in a separate module; to avoid threading, you should pass update_queue=None, workers=0 to your Dispatcher instantiation. And then, use it in a Django view, like this:
import json
from django.views.decorators.csrf import csrf_exempt
from telegram import Update
from .telegram_init import telegram_bot, telegram_dispatcher
...
#csrf_exempt
def telegram_webhook(request):
data = json.loads(request.body)
update = Update.de_json(data, telegram_bot)
telegram_dispatcher.process_update(update)
return JsonResponse({})
where telegram_bot is the Bot instance that I use for instantiating telegram_dispatcher. (I left out error handling in this snippet.)
Why avoid threading? Threads in the more general sense are not forbidden in Django, but in the context of PTB, threading usually means running bot updaters or dispatchers in a long-running thread that share an update/message queue, and that's a complication that doesn't look nice nor play well with, for example, a typical Django deployment that uses multiple Gunicorn workers in separate processes. There is, however, a motivation for using multithreading (multiple processes, actually, using Celery) in Django-PTB integration; see below.
Development environment caveat
The above setup is what you'd want to use for a basic production system. But during dev, unless your dev machine is internet-facing with a fixed IP, you probably can't use a webhook, so you'd still want to do polling. One way to do this is by creating a custom Django management command:
<my_app>/management/commands/polltelegram.py:
from django.core.management.base import BaseCommand
from my_django_project.telegram_init import telegram_updater
class Command(BaseCommand):
help = 'Run Telegram bot polling.'
def handle(self, *args, **options):
updater.start_polling()
self.stdout.write(
'Telegram bot polling started. '
'Press CTRL-BREAK to terminate.'
)
updater.idle()
self.stdout.write('Polling stopped.')
And then, during dev, run python manage.py polltelegram to fetch and process Telegram updates. (Run this along with python manage.py runserver to be able to use the main Django app simultaneously; the polling runs in a separate process with this setup, not just a separate thread.)
When Celery makes sense
Celery does have a role to play if you're integrating PTB with Django, and this is when reliability is a concern. For instance, when you want to be able to retry sending replies in case of transient network issues. Another potential issue is that the non-threading webhook setup detailed above can, in a high-traffic scenario, run into flood/rate limits. PTB's current solution for this, MessageQueue, uses threading, and while it can work, it can introduce other problems, for example interference with Django's autoreload function when running runserver during dev.
A more elegant and reliable solution is to use Celery to run the message sending function of PTB. This allows for retries and rate limiting for better reliability.
Briefly described, this integration can still use the non-threading webhook setup above, but you have to isolate the Bot.send_message() function into a Celery task, and then make sure that all handlers call this Celery task asynchronously instead of using the bot to run send_message() in the webhook process 'eagerly'.

In PTB, Updater.start_polling/webhook() starts a background thread that waits for incoming updates. Updater.idle() blocks the main thread and when receiving a stop signal, it ends the background thread mentioned above.
I'm not familiar with Celery and only know the basics of Django, but I see a few options here that I'd like to point out.
You can run the PTB-related code in a standalone thread, i.e. a thread that calls Updater.start_polling and Updater.idle. To end that thread on shutdown, you'll have to forward the stop signal to that thread
Vice versa, you can run PTB in the main thread and the Django & Celeray related tasks in a standalone thread
You don't have to use Updater. Since you're using Django anyway, you could switch to a webhook-based solution for receiving updates, where Django serves as webhook for you. You can even eliminate threading for PTB completely by calling Dispatcher.process_update manually. Please see this wiki page for more info on custom webhook solutions
Finally, I'd like to mention that PTB comes with a built-in solution of scheduling tasks, see the wiki page on Job Queue. This may or may not be relevant for you depending on your setup.
Dislaimer: I'm currently the maintainer of python-telegram-bot

How to implement an API which will run a python script with the data from the POST request

I want to run a python script which basically monitors any changes happening to a particular directory (the directory to monitor is passed as part of the POST request). Every time the API is called (I'm using FastAPI), a new instance of the script has to be started to monitor that particular directory and send back a "success" message as response if the script was started successfully. Further I am planning to add another API endpoint that will stop the script that is running to watch a directory.
Can message queues like RQ or Celery be used to achieve this? Please note that I want new scripts to be started every time the API is called so multiple instances of the script should run at the same time. I am using watchdog module to monitor the file system.
I don't know how to do this in the correct way but this is what I have come up with so far where a new thread is created for each API call:
from fastapi import FastAPI
from schemas import Data # pydantic schema model for API
from threading import Thread
import filewatcher # the script that has to be run
#app.post('/register/event')
def register_watchdog(data: Data):
th = Thread(target=filewacther.create_watchdog, args=(data))
th.start()
return {"status": "success"}
What is the best way to achieve this? One further question is, can I implement my script as a Linux service that can run in the background?

In fact, it is rather trivial to call this function in the code if you need to return the result of directory browsing function.
...
def register_watchdog(data: Data):
return {"result": filewacther.create_watchdog(data)}
But if you want to run some time-consuming process in the background, you really should use AMQP with the workers. RabbitMQ with Celery is really the right choice for this, it will make it easy to scale your system.
What is the best way to achieve this? One further question is, can I implement my script as a Linux service that can run in the background?
Yes, you can indeed run RabbitMQ together with Celery as a linux service in the background e.g. using supervisor (example), but this is not the best practice. Look in the direction of containerizing elements of your system. You can wrap Celery in a Docker and make it easy enough to run that along with the AMQP service and your web application (example).

Tornado server caused Django unable to handle concurrent requests

I wrote a Django website that handles concurrent database requests and subprocess calls perfectly fine, if I just run "python manage.py runserver"
This is my model
class MyModel:
...
def foo(self):
args = [......]
pipe = subprocess.Popen(args, stdout=subproccess.PIPE, stderr=subprocess.PIPE)
In my view:
def call_foo(request):
my_model = MyModel()
my_model.foo()
However, after I wrap it using Tornado server, it's no longer able to handle concurrent request. When I click my website where it sends async get request to this call_foo() function, it seems like my app is not able to handle other requests. For example, if I open the home page url, it keeps waiting and won't display until the above subprocess call in foo() has finished.
If I do not use Tornado, everything works fine.
Below is my code to start the tornado server. Is there anything that I did wrong?
MAX_WAIT_SECONDS_BEFORE_SHUTDOWN = 5
def sig_handler(sig, frame):
logging.warning('Caught signal: %s', sig)
tornado.ioloop.IOLoop.instance().add_callback(force_shutdown)
def force_shutdown():
logging.info("Stopping tornado server")
server.stop()
logging.info('Will shutdown in %s seconds ...', MAX_WAIT_SECONDS_BEFORE_SHUTDOWN)
io_loop = tornado.ioloop.IOLoop.instance()
deadline = time.time() + MAX_WAIT_SECONDS_BEFORE_SHUTDOWN
def stop_loop():
now = time.time()
if now < deadline and (io_loop._callbacks or io_loop._timeouts):
io_loop.add_timeout(now + 1, stop_loop)
else:
io_loop.stop()
logging.info('Force Shutdown')
stop_loop()
def main():
parse_command_line()
logging.info("starting tornado web server")
os.environ['DJANGO_SETTINGS_MODULE'] = 'mydjango.settings'
django.setup()
wsgi_app = tornado.wsgi.WSGIContainer(django.core.handlers.wsgi.WSGIHandler())
tornado_app = tornado.web.Application([
(r'/(favicon\.ico)', tornado.web.StaticFileHandler, {'path': "static"}),
(r'/static/(.*)', tornado.web.StaticFileHandler, {'path': "static"}),
('.*', tornado.web.FallbackHandler, dict(fallback=wsgi_app)),
])
global server
server = tornado.httpserver.HTTPServer(tornado_app)
server.listen(options.port)
signal.signal(signal.SIGTERM, sig_handler)
signal.signal(signal.SIGINT, sig_handler)
tornado.ioloop.IOLoop.instance().start()
logging.info("Exit...")
if __name__ == '__main__':
main()

There is nothing wrong with your set-up. This is by design.
So, WSGI protocol (and so Django) uses syncronous model. It means that when your app starts processing a request it takes control and gives it back only when request is finished. That's why it can process single request at once. To allow simultaneous requests one usually launches wsgi application in multithreaded or multiprocessed mode.
The Tornado server on other side uses asynchronous model. The idea here is to have own scheduler instead of OS scheduler that works with threads and processes. So your code runs some logic, then launches some long task (DB call, URL fetch), sets up what to run when task finishes and gives control back to scheduler.
Giving controll back to scheduler is crucial part, it allows async server to work fast because it can start processing new request while previous is waiting for data.
This answer explains sync/async detailed. It focuses on client, but I think you can see the idea.
So whats wrong with your code: Popen does not give control to IOLoop. Python does nothing until your subprocess is finished, and so can not process other requests, even not Django's requests. runserver "works" here because it's multithreaded. So while locking entirely the thread, other threads can still process requests.
For this reason it's usually not recommended to run WSGI apps under async server like tornado. The doc claims it will be less scalable, but you can see the problem on your own code. So if you need both servers (e.g. Tornado for sockets and Django for main site), I'd suggest to run both behind nginx, and use uwsgi or gunicorn to run Django. Or take a look at django-channels app instead of tornado.
Besides, while it works on test environment, I guess it's not a recomended way to do what you try to achieve. It's hard to suggest the solution, as I don't know what do you call with Popen, but it seams to be something long running. Maybe you should take a look at Celery project. It's a package for running long-term background job.
However, back to running sub-processes. In Tornado you can use tornado.process.Subprocess. It's a wrapper over Popen to allow it to work with IOLoop. Unfortunately I don't know if you can use it in wsgi part under tornado. There are some projects I remember, like django futures but it seems to be abandoned.
As another quick and dirty fix - you can run Tornado with several processes. Check this example on how to fork server. But I will not recoment using this in production anyway (fork is OK, running wsgi fallback is not).
So to summarize, I would rewrite your code to do one of the following:
Run the Popen call in some background queue, like Celery
Process such views with Tornado and use tornado.processes module to run subprocess.
And overall, I'd seek for another deployment infrastructure, and would not run Durango under tornado.

Stop a background process in flask without creating zombie processes

I need to start a long-running background process with subprocess when someone visits a particular view.
My code:
from flask import Flask
import subprocess
app = Flask(__name__)
#app.route("/")
def index():
subprocess.Popen(["sleep", "10"])
return "hi\n"
if __name__ == "__main__":
app.run(debug=True)
This works great, for the most part.
The problem is that when the process (sleep) ends, ps -Af | grep sleep shows it as [sleep] <defunct>.
From what I've read, this is because I still have a reference to the process in flask.
Is there a way to drop this reference after the process exits?
I tried doing g.subprocess = subprocess.Popen(["sleep", "10"]), and waiting for the process to end in #app.after_request(response) so I can use del on it, but this prevents flask from returning the response until the subprocess exits - I need it to return the response before the subprocess exits.
Note:
I need the subprocess.Popen operation to be non-blocking - this is important.

As I've suggested in the comments, one of the cleanest and most robust way of achieving this kind of thing in Python is by using celery.
Celery requires a broker transport for messaging, for which rabbitmq is the default, and at least a process with workers running. However, the thing that increases readbility an dmaintanability is that the worker code can co-exist in the same file or files than your server app. You invoke the remote procedures as though it where a simple function call.
Celery can handle retries, post-task events, and lots of other things for free, everything with mature code hardened by years of use in production.
This is your example after re-writting it for use with Celery:
from flask import Flask
from celery import Celery
import subprocess
app = Flask(__name__)
celery_app = Celery("test")
#celery_app.task
def run_process():
subprocess.Popen(["sleep", "5"])
#app.route("/")
def index():
run_process.delay()
return "hi\n"
if __name__ == "__main__":
app.run(debug=True, port=8080)
With this code, in a system with the rabbitmq server running with default options (I installed the package, and started the service - no configurations whatsoever. Of course on production you would have to tune that - but if everything is to be on the same server, it may not even be needed.)
With rabbitmq in place, one starts the worker process with a command line like: celery worker -A bla1.celery_app -D (pip install celery on the same virtualenv you have your Flask). Then just launch the flask server and see it working.
Of course this has even more advantages if you are doing more work in Python itself than just calling an external process. It can have access to your database models, and you can perform assynchronous actions that modify objects in there (and eventually trigger responses for the user, as "flash" messages on the user session, or e-mails)

I've seen a lot of "poor man's parallel processing" using subprocess.Popen and letting it run freely, but that's often leading to zombie problems as you noted.
You could run your process in a thread (in that case, no need for Popen, just use call or check_call if you want to raise an exception if process failed). call or check_call (or run since Python 3.5) waits for the process to complete so no zombies, and since you're running it in a thread you're not blocked.
import threading
def in_background():
subprocess.call(["sleep", "10"])
#app.route("/")
def index():
t = threading.Thread(target=in_background)
t.start()
return "hi\n"
Note: To wait for thread completion you'd have to use t.join() and for that you'd have to keep a reference on the t thread object.
BTW, I suppose that your real process isn't sleep, or it's not very useful and time.sleep(10) does the same (always in a thread of course!)

Asynchronous Flask functions

I have an app in flask that I need to perform some asynchronous action on. I've read about celery, but not sure if it's correct.
Basically I have a button that takes input and runs a query to return back to the template, and this is quick, but I want it to also run another task (passing a SOAP envelope against a web service), and this is slow. I don't want the user to have to wait for the web-service call to finish. I'd like for the query running the return back to the template with new data to happen as quickly as possible and the web-service call to happen in the background.
Is this doable?

I know there are lots of Celery related threads here, but this might provide some service.
Using Celery for asynchronous activity requires more than just installing and importing the lib.
Requirements:
Celery lib
Queue broker, like Redis (in memory db), installed
Separate file that creates celery object
I found the Flask documentation on Celery with flask lacking. My preferred method was to create a tasks.py file and put in
from celery import Celery
# Other imports for functionality here
app = Celery('tasks', broker='redis://localhost:6379')
#app.tasks
def your_function(args):
do something with args
return something
Then in the application file make sure this is imported:
from tasks import your_function
And then use it where you need to in the app
your_function(args)
Then you must make sure that a celery daemon/worker is running. This can be done by init, by systemd, by launchctl or manually at the CLI (not ideal). Redis must also be running and listening on the url you give it.
I hope this helps someone else.

sounds like you need tornado! Asynchronous web server gateway compatible with flask
from tornado.wsgi import WSGIContainer
from tornado.httpserver import HTTPServer
from tornado.ioloop import IOLoop
from YourModule import app
http_server = HTTPServer(WSGIContainer(app))
http_server.listen(8080)
IOLoop.instance().start()
I prefer tornado for its speed, reliability, and simplicity with Flask, which I love for its beauty

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.