How to design a resilient and highly available service in python?

How to design a resilient and highly available service in python? - python

I am trying to design a resilient and highly available python API back-end service. The core service is designed to run continuously. The service has to run independently for each of my tenants. This is required as the core service is a blocking service and each tenant's execution needs to be independent from any other tenant's service.
The core service is to be started by a provisioning service. The provisioner is also a continuously running service and is to be responsible for doing the house-keeping functions i.e start the core service on tenant sign-up, check for the required environment and attributes and stop the core service etc.
Currently I am using the multiprocessing module to spawn child instances of the core service from the provisioner service. Having a multi-threaded service with one thread for each tenant is also an option but that has the drawback of disruption of service for other tenant if any of the threads craches. Ideally I would like all these to run as background processes. The problems are
If I daemonize the provisioner service, multiprocessing will not let that daemon to create child processes. This is written here
If the provisioner service dies, then all the children will become orphans. How do I get back from this situation.
Obviously, I am open to solutions that do not follow this multiprocessing usage model.

I would recommend you take a different approach. Use the system tools available in your distribution to manage the life-cycle of your processes instead of spawning them yourself. The provisioner would be much simpler as well, as it will not have to reproduce what your operating system can do with little effort.
On Ubuntu/CentOS 6 systems you can use Upstart, which has a great deal of advantages compared to the old sysvinit (aggressive parallelisation, respawning, simple init config syntax, etc).
There is also SystemD which is similar to upstart in design, and comes default in OpenSuse.
The provisioner could then be used only to create the needed init config for each service, and start or stop them using the subprocess module. You could then monitor your instances in case upstart was not able to respawn an instance, and send an alert, or try to start the service again.
Using this approach, you isolate all instances of user services from one another. If the provisioner crashes, the rest of the services will remain up.
For example, say your provisioner is running in the background. It gets a message via AMQP or some other means to create a user and start services for that user. One possible flow youd be:
create user
Do any bootstrap needed for new users
Create /etc/init/[username]_service.conf
start [username]_service
The init script could look similar to:
description "start Service for [username]"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
# Run before process
pre-start script
end script
exec /bin/su -c "/path/to/your/app" <username>
This way you offload process management from your provisioner to the system upstart daemon. You only need to do job management in a simple way (create/destroy services when a user is created or deleted).

On debian-like you can wrap not demonized service with
start-stop-daemon --start --quiet --background --make-pidfile --pidfile $PIDFILE --exec $DAEMON --chuid $USER --chdir $DIR -- \
$DAEMON_ARGS
Children must die after proceesing task.
Parent process must be so simle so posible, only "resieve task - spawn child" in main loop.

Related

How do you schedule some python scripts to run regularly on a Windows PC?

I have some python scripts that I look to run daily form a Windows PC.
My current workflow is:
The desktop PC stays all every day except for a weekly restart over the weekend
After the restart I open VS Code and run a little bash script ./start.sh that kicks off the tasks.
The above works reasonably fine, but it is also fairly painful. I need to re-run start.sh if I ever close VS Code (eg. for an update). Also the processes use some local python libraries so I need to stop them if I'm going to update them.
With regards to how to do this properly, 4 tools came to mind:
Windows Scheduler
Airflow
Prefect (https://www.prefect.io/)
Rocketry (https://rocketry.readthedocs.io/en/stable/)
However, I can't quite get my head around the fundamental issue that Prefect/Airflow/Rocketry run on my PC then there is nothing that will restart them after the PC reboots. I'm also not sure they will give me the isolation I'd prefer on these tools.
Docker came to mind, I could put each task into a docker image and run them via some form of docker swarm or something like that. But not sure if I'm re-inventing the wheel.
I'm 100% sure I'm not the first person in this situation. Could anyone point me to a guide on how this could be done well?
Note:
I am not considering running the python scripts in the cloud. They interact with local tools that are only licenced for my PC.

You can definitely use Prefect for that - it's very lightweight and seems to be matching what you're looking for. You install it with pip install prefect, start Orion API server: prefect orion start and once you create a Deployment, and start an agent prefect agent start -q default you can even configure schedule from the UI
For more information about Deployments, check our FAQ section.

It sounds Rocketry could also be suitable. Rocketry can shut down itself using a task. You could do a task that:
Runs on the main thread and process (blocking starting new tasks)
Waits or terminates all the currently running tasks (use the session)
Calls session.shut_down() which sets a flag to the scheduler.
There is also a app configuration shut_cond which is simply a condition. If this condition is True, the scheduler exits so alternatively you can use this.
Then after the line app.run() you simply have a line that runs shutdown -r (restart) command on shell using a subprocess library, for example. Then you need something that starts Rocketry again when the restart is completed. For this, perhaps this could be an answer: https://superuser.com/a/954957, or use Windows scheduler to have a simple startup task that starts Rocketry.
Especially if you had Linux machines (Raspberry Pis for example), you could integrate Rocketry with FastAPI and make a small cluster in which Rocketry apps communicate with each other, just put script with Rocketry as a startup service. One machine could be a backup that calls another machine's API which runs Linux restart command. Then the backup executes tasks until the primary machine answers to requests again (is up and running).
But as the author of the library, I'm possibly biased toward my own projects. But Rocketry very capable on complex scheduling problems, that's the purpose of the project.

You can use schtasks for windows to schedule the tasks like running bash script or python script and it's pretty reliable too.

Keep Python script running as a TCP socket server on AWS machine

I need to deploy a Python script on a AWS machine with Ubuntu server 18.04.
In the script there is a TCP server using a custom TCP port (let's say the 9999), which handles the clients' requests in different threads.
The problem is that I don't know which could be the best practice to keep this script running if there is any problem (the main TCP server thread dies for whatever reason).
Furthermore, I don't really know which could be the best practice to run this kind of script in the AWS EC2 system.
So far I am manually starting the script via SSH. Everything in the script logic works well, the problem is how to start and keep running such script.

You should take a look at the systemd suite. It can be used to manage the status of your script. It can restart the script if it dies, or if the node is rebooted.
Here's an example service.
Create the file below in this location: /lib/systemd/system/example.service
[Unit]
Description=A short description of the script.
[Service]
Type=simple
# Script location
ExecStart=/path/to/some/script.py
# Restart the script in all circumstances (e.g If it exits successfully, fails or crashes).
Restart=always
[Install]
WantedBy=multi-user.target
Then set the service to start automatically on boot and start the service:
chmod 644 /lib/systemd/system/example.service
systemctl enable example
systemctl start example
There are a lot of resources available if you want to learn more about systemd. I'd suggest the links below:
[0] https://www.freedesktop.org/wiki/Software/systemd/
[1] https://github.com/torfsen/python-systemd-tutorial
[2] https://www.linode.com/docs/quick-answers/linux/start-service-at-boot/#create-a-custom-systemd-service
[3] https://medium.com/#benmorel/creating-a-linux-service-with-systemd-611b5c8b91d6
As for general best practices, it is difficult to provide advice without knowing more about your script. It is not recommended to use the python HTTPServer module for Production workloads, because it only implements basic security checks.

How can I start redis queue worker without command line interface?

I use this tool http://python-rq.org/
I have a flask app, but I could not find the way to start rq worker other than with rq cli, e.g. $ rq worker
I need to have the worker running all the time, how can I make it running as a service? I need the service to start up on boot also.

You should investigate some supervisor program to control your rq worker. Take a look at supervisor or systemd. I personally use supervisord and it's pretty popular in the Python community.
This is how any supervisor program works (not to be confused with supervisord): the supervisor itself is a service (controlled by another service! e.g., systemd, initd, etc) and it runs programs specified in its configuration file. If a program exits or has issues, the supervisor will respawn it.
If you were in the docker ecosystem, it'd be simpler because docker can be your supervisor.

Multiple options :
For dev purpose, I recommand a docker container (https://docker.com). For production purpose, the cleanest way is : use the packaged version for your system (assuming you're using a GNU/Linux box) along with a dedicated systemd unit.
For example, on Debian :
apt-get update && apt-get install redis
Then edit redis.conf, and start it :
systemctl start redis
Enable it (== start redis during startup)
systemctl enable redis

Python daemon and systemd service

I have a simple Python script working as a daemon. I am trying to create systemd script to be able to start this script during startup.
Current systemd script:
[Unit]
Description=Text
After=syslog.target
[Service]
Type=forking
User=node
Group=node
WorkingDirectory=/home/node/Node/
PIDFile=/var/run/zebra.pid
ExecStart=/home/node/Node/node.py
[Install]
WantedBy=multi-user.target
node.py:
if __name__ == '__main__':
with daemon.DaemonContext():
check = Node()
check.run()
run contains while True loop.
I try to run this service with systemctl start zebra-node.service. Unfortunately service never finished stating sequence - I have to press Ctrl+C.
Script is running, but status is activating and after a while it change to deactivating.
Now I am using python-daemon (but before I tried without it and the symptoms were similar).
Should I implement some additional features to my script or is systemd file incorrect?

The reason, it does not complete the startup sequence is, that for Type forking your startup process is expected to fork and exit (see $ man systemd.service - search for forking).
Simply use only the main process, do not daemonize
One option is to do less. With systemd, there is often no need to create daemons and you may directly run the code without daemonizing.
#!/usr/bin/python -u
from somewhere import Node
check = Node()
check.run()
This allows using simpler Type of service called simple, so your unit file would look like.
[Unit]
Description=Simplified simple zebra service
After=syslog.target
[Service]
Type=simple
User=node
Group=node
WorkingDirectory=/home/node/Node/
ExecStart=/home/node/Node/node.py
StandardOutput=syslog
StandardError=syslog
[Install]
WantedBy=multi-user.target
Note, that the -u in python shebang is not necessary, but in case you print something out to the stdout or stderr, the -u makes sure, there is no output buffering in place and printed lines will be immediately caught by systemd and recorded in journal. Without it, it would appear with some delay.
For this purpose I added into unit file the lines StandardOutput=syslog and StandardError=syslog. If you do not care about printed output in your journal, do not care about these lines (they do not have to be present).
systemd makes daemonization obsolete
While the title of your question explicitly asks about daemonizing, I guess, the core of the question is "how to make my service running" and while using main process seems much simpler (you do not have to care about daemons at all), it could be considered answer to your question.
I think, that many people use daemonizing just because "everybody does it". With systemd the reasons for daemonizing are often obsolete. There might be some reasons to use daemonization, but it will be rare case now.
EDIT: fixed python -p to proper python -u. thanks kmftzg

It is possible to daemonize like Schnouki and Amit describe. But with systemd this is not necessary. There are two nicer ways to initialize the daemon: socket-activation and explicit notification with sd_notify().
Socket activation works for daemons which want to listen on a network port or UNIX socket or similar. Systemd would open the socket, listen on it, and then spawn the daemon when a connection comes in. This is the preferred approch because it gives the most flexibility to the administrator. [1] and [2] give a nice introduction, [3] describes the C API, while [4] describes the Python API.
[1] http://0pointer.de/blog/projects/socket-activation.html
[2] http://0pointer.de/blog/projects/socket-activation2.html
[3] http://www.freedesktop.org/software/systemd/man/sd_listen_fds.html
[4] http://www.freedesktop.org/software/systemd/python-systemd/daemon.html#systemd.daemon.listen_fds
Explicit notification means that the daemon opens the sockets itself and/or does any other initialization, and then notifies init that it is ready and can serve requests. This can be implemented with the "forking protocol", but actually it is nicer to just send a notification to systemd with sd_notify().
Python wrapper is called systemd.daemon.notify and will be one line to use [5].
[5] http://www.freedesktop.org/software/systemd/python-systemd/daemon.html#systemd.daemon.notify
In this case the unit file would have Type=notify, and call
systemd.daemon.notify("READY=1") after it has established the sockets. No forking or daemonization is necessary.

You're not creating the PID file.
systemd expects your program to write its PID in /var/run/zebra.pid. As you don't do it, systemd probably thinks that your program is failing, hence deactivating it.
To add the PID file, install lockfile and change your code to this:
import daemon
import daemon.pidlockfile
pidfile = daemon.pidlockfile.PIDLockFile("/var/run/zebra.pid")
with daemon.DaemonContext(pidfile=pidfile):
check = Node()
check.run()
(Quick note: some recent update of lockfile changed its API and made it incompatible with python-daemon. To fix it, edit daemon/pidlockfile.py, remove LinkFileLock from the imports, and add from lockfile.linklockfile import LinkLockFile as LinkFileLock.)
Be careful of one other thing: DaemonContext changes the working dir of your program to /, making the WorkingDirectory of your service file useless. If you want DaemonContext to chdir into another directory, use DaemonContext(pidfile=pidfile, working_directory="/path/to/dir").

I came across this question when trying to convert some python init.d services to systemd under CentOS 7. This seems to work great for me, by placing this file in /etc/systemd/system/:
[Unit]
Description=manages worker instances as a service
After=multi-user.target
[Service]
Type=idle
User=node
ExecStart=/usr/bin/python /path/to/your/module.py
Restart=always
TimeoutStartSec=10
RestartSec=10
[Install]
WantedBy=multi-user.target
I then dropped my old init.d service file from /etc/init.d and ran sudo systemctl daemon-reload to reload systemd.
I wanted my service to auto restart, hence the restart options. I also found using idle for Type made more sense than simple.
Behavior of idle is very similar to simple; however, actual execution
of the service binary is delayed until all active jobs are dispatched.
This may be used to avoid interleaving of output of shell services
with the status output on the console.
More details on the options I used here.
I also experimented with keeping the old service and having systemd resart the service but I ran into some issues.
[Unit]
# Added this to the above
#SourcePath=/etc/init.d/old-service
[Service]
# Replace the ExecStart from above with these
#ExecStart=/etc/init.d/old-service start
#ExecStop=/etc/init.d/old-service stop
The issues I experienced was that the init.d service script was used instead of the systemd service if both were named the same. If you killed the init.d initiated process, the systemd script would then take over. But if you ran service <service-name> stop it would refer to the old init.d service. So I found the best way was to drop the old init.d service and the service command referred to the systemd service instead.
Hope this helps!

Also, you most likely need to set daemon_context=True when creating the DaemonContext().
This is because, if python-daemon detects that if it is running under a init system, it doesn't detach from the parent process. systemd expects that the daemon process running with Type=forking will do so. Hence, you need that, else systemd will keep waiting, and finally kill the process.
If you are curious, in python-daemon's daemon module, you will see this code:
def is_detach_process_context_required():
""" Determine whether detaching process context is required.
Return ``True`` if the process environment indicates the
process is already detached:
* Process was started by `init`; or
* Process was started by `inetd`.
"""
result = True
if is_process_started_by_init() or is_process_started_by_superserver():
result = False
Hopefully this explains better.

How can I communicate with Celery on Cloud Foundry?

I have a wsgi app with a celery component. Basically, when certain requests come in they can hand off relatively time-consuming tasks to celery. I have a working version of this product on a server I set up myself, but our client recently asked me to deploy it to Cloud Foundry. Since Celery is not available as a service on Cloud Foundry, we (me and the client's deployment team) decided to deploy the app twice – once as a wsgi app and once as a standalone celery app, sharing a rabbitmq service.
The code between the apps is identical. The wsgi app responds correctly, returning the expected web pages. vmc logs celeryapp shows that celery is to be up-and-running, but when I send requests to wsgi that should become celery tasks, they disappear as soon as they get to a .delay() statement. They neither appear in the celery logs nor do they appear as an error.
Attempts to debug:
I can't use celery.contrib.rdb in Cloud Foundry (to supply a telnet interface to pdb), as each app is sandboxed and port-restricted.
I don't know how to find the specific rabbitmq instance these apps are supposed to share, so I can see what messages it's passing.
Update: to corroborate the above statement about finding rabbitmq, here's what happens when I try to access the node that should be sharing celery tasks:
root#cf:~# export RABBITMQ_NODENAME=eecef185-e1ae-4e08-91af-47f590304ecc
root#cf:~# export RABBITMQ_NODE_PORT=57390
root#cf:~# ~/cloudfoundry/.deployments/devbox/deploy/rabbitmq/sbin/rabbitmqctl list_queues
Listing queues ...
=ERROR REPORT==== 18-Jun-2012::11:31:35 ===
Error in process <0.36.0> on node 'rabbitmqctl17951#cf' with exit value: {badarg,[{erlang,list_to_existing_atom,["eecef185-e1ae-4e08-91af-47f590304ecc#localhost"]},{dist_util,recv_challenge,1},{dist_util,handshake_we_started,1}]}
Error: unable to connect to node 'eecef185-e1ae-4e08-91af-47f590304ecc#cf': nodedown
diagnostics:
- nodes and their ports on cf: [{'eecef185-e1ae-4e08-91af-47f590304ecc',57390},
{rabbitmqctl17951,36032}]
- current node: rabbitmqctl17951#cf
- current node home dir: /home/cf
- current node cookie hash: 1igde7WRgkhAea8fCwKncQ==
How can I debug this and/or why are my tasks vanishing?

Apparently the problem was caused by a deadlock between the broker and the celery worker, such that the worker would never acknowledge the task as complete, and never accept a new task, but never crashed or failed either. The tasks weren't vanishing; they were simply staying in queue forever.
Update: The deadlock was caused by the fact that we were running celeryd inside a wrapper script that installed dependencies. (Literally pip install -r requirements.txt && ./celeryd -lINFO). Because of how Cloud Foundry manages process trees, Cloud Foundry would try to kill the parent process (bash), which would HUP celeryd, but ultimately lots of child processes would never die.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.